You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We injected a 12 second network error(disable the port) into the switch, Etcd should elect the new leader right after the port is recovered. However, the actual time from the failure to the new leader is elected is 26 second, far exceeding expectations.
After repeated fault injection tests, with the network error duration is 15s, 17s, 20s, each time from fault injection to recovery and elect new leader is around 26 second.
What did you expect to happen?
The Etcd should elect new leader right after the network error is recovered, at least not as long as 26 second.
How can we reproduce it (as minimally and precisely as possible)?
Inject network error to the switch port more than 10 second.
Anything else we need to know?
We have 3 etcd deployed
Etcd version (please run commands below)
$ etcd --version
#
$ etcdctl version
# paste output here
3.5.0
Etcd configuration (command line flags or environment variables)
paste your configuration here
{"level":"info","ts":"2022-10-25T17:14:22.835+0800","caller":"embed/etcd.go:309","msg":"starting an etcd server","etcd-version":"3.5.0","go-version":"go1.16.5","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":false,"data-dir":"/opt/ETCDdata/","wal-dir":"","wal-dir-dedicated":"","member-dir":"/opt/ETCDdata/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-cluster-state":"new","initial-cluster-token":"haf-etcd-cluster","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
No response
The text was updated successfully, but these errors were encountered:
Would be great to add a test that validates that leader election latency.
I would also love to reproduce this, we've seen something similar with OpenShift a while ago on some cloud providers. @EricLiuRan Do you use a special switch to inject this failure or can this be done on the OS as well?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
What happened?
We injected a 12 second network error(disable the port) into the switch, Etcd should elect the new leader right after the port is recovered. However, the actual time from the failure to the new leader is elected is 26 second, far exceeding expectations.
After repeated fault injection tests, with the network error duration is 15s, 17s, 20s, each time from fault injection to recovery and elect new leader is around 26 second.
What did you expect to happen?
The Etcd should elect new leader right after the network error is recovered, at least not as long as 26 second.
How can we reproduce it (as minimally and precisely as possible)?
Inject network error to the switch port more than 10 second.
Anything else we need to know?
We have 3 etcd deployed
Etcd version (please run commands below)
Etcd configuration (command line flags or environment variables)
paste your configuration here
{"level":"info","ts":"2022-10-25T17:14:22.835+0800","caller":"embed/etcd.go:309","msg":"starting an etcd server","etcd-version":"3.5.0","go-version":"go1.16.5","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":false,"data-dir":"/opt/ETCDdata/","wal-dir":"","wal-dir-dedicated":"","member-dir":"/opt/ETCDdata/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-cluster-state":"new","initial-cluster-token":"haf-etcd-cluster","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
No response
The text was updated successfully, but these errors were encountered: