TLDR West is experiencing continuous pod crashing with ETCD in HA mode. After discussion, Hengfei provides potential causes and a restoration plan, sharing a document about manual etcd cluster restoration.
only one pod is working?
yes
which one?
etcd-0?
etcd-1 is running , and 0, 2 are crashed
Yep, in this case, we need manually restore the cluster.
it happens always , we un installed and installed openobserve , still we have the same issue
in our experiments, in two case it will happen: 1. the node of k8s always recreate, and the etcd pod also recreate, if both of 3 pods recreate at same time, the cluster will broken, then only one pod can work. 2. the data in etcd is very large, it will cause etcd not stable.
how to manually restore the etcd cluster, i will write a document for this. will share you later.
first, can you check what is volume size of etcd PVC?
I guess it is very small, about 100MB
yes it is very small
instead of using default storage class for PV , if we use file storage will it fix the problem ?
It should be no difference. the key is we can't recreate the 3 pods at the same, recreate one by one will be no problem.
Thank you Hengfei
West
Wed, 18 Oct 2023 07:50:57 UTCHi Team , I am getting issue with ETCD on HA mode , I am running with 3 replicas , only one replica is running always , other two pod are crashing continuously with below error `{"level":"info","ts":"2023-10-18T07:46:49.368156Z","caller":"embed/etcd.go:569","msg":"cmux::serve","address":"[::]:2380"}` {"level":"info","ts":"2023-10-18T07:46:49.368153Z","caller":"embed/etcd.go:278","msg":"now serving peer/client/metrics","local-member-id":"428e4cdf52bbee23","initial-advertise-peer-urls":[""],"listen-peer-urls":[""],"advertise-client-urls":["",""],"listen-client-urls":[""],"listen-metrics-urls":[]}
{"level":"warn","ts":"2023-10-18T07:46:49.368646Z","caller":"etcdserver/server.go:1127","msg":"server error","error":"the member has been permanently removed from the cluster"}
{"level":"warn","ts":"2023-10-18T07:46:49.368673Z","caller":"etcdserver/server.go:1128","msg":"data-dir used by this member must be removed"}
{"level":"warn","ts":"2023-10-18T07:46:49.368703Z","caller":"etcdserver/server.go:2073","msg":"stopped publish because server is stopped","local-member-id":"428e4cdf52bbee23","local-member-attributes":"{Name:openobs-etcd-0 ClientURLs:[ ]}","publish-timeout":"7s","error":"etcdserver: server stopped"}
{"level":"info","ts":"2023-10-18T07:46:49.36874Z","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"cc79cc83e0b16b1f"}
{"level":"info","ts":"2023-10-18T07:46:49.368788Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"cc79cc83e0b16b1f"}
{"level":"info","ts":"2023-10-18T07:46:49.368812Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"cc79cc83e0b16b1f"}
{"level":"info","ts":"2023-10-18T07:46:49.368837Z","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"428e4cdf52bbee23","remote-peer-id":"cc79cc83e0b16b1f"}
{"level":"info","ts":"2023-10-18T07:46:49.368862Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"428e4cdf52bbee23","remote-peer-id":"cc79cc83e0b16b1f"}
{"level":"info","ts":"2023-10-18T07:46:49.368891Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"428e4cdf52bbee23","remote-peer-id":"cc79cc83e0b16b1f"}
{"level":"info","ts":"2023-10-18T07:46:49.368908Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"cc79cc83e0b16b1f"}
{"level":"info","ts":"2023-10-18T07:46:49.368914Z","caller":"rafthttp/peer.go:330","msg":"stopping remote peer","remote-peer-id":"8c766ebebb7cfed8"}
{"level":"info","ts":"2023-10-18T07:46:49.368925Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"8c766ebebb7cfed8"}
{"level":"info","ts":"2023-10-18T07:46:49.368945Z","caller":"rafthttp/stream.go:294","msg":"stopped TCP streaming connection with remote peer","stream-writer-type":"unknown stream","remote-peer-id":"8c766ebebb7cfed8"}
{"level":"info","ts":"2023-10-18T07:46:49.368967Z","caller":"rafthttp/pipeline.go:85","msg":"stopped HTTP pipelining with remote peer","local-member-id":"428e4cdf52bbee23","remote-peer-id":"8c766ebebb7cfed8"}
{"level":"info","ts":"2023-10-18T07:46:49.368986Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream MsgApp v2","local-member-id":"428e4cdf52bbee23","remote-peer-id":"8c766ebebb7cfed8"}
{"level":"info","ts":"2023-10-18T07:46:49.369016Z","caller":"rafthttp/stream.go:442","msg":"stopped stream reader with remote peer","stream-reader-type":"stream Message","local-member-id":"428e4cdf52bbee23","remote-peer-id":"8c766ebebb7cfed8"}
{"level":"info","ts":"2023-10-18T07:46:49.369043Z","caller":"rafthttp/peer.go:335","msg":"stopped remote peer","remote-peer-id":"8c766ebebb7cfed8"}
{"level":"info","ts":"2023-10-18T07:46:49.376294Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"}
{"level":"info","ts":"2023-10-18T07:46:49.376311Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"}