Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the election time exceeds the network error #14653

Closed
EricLiuRan opened this issue Oct 29, 2022 · 3 comments
Closed

When the election time exceeds the network error #14653

EricLiuRan opened this issue Oct 29, 2022 · 3 comments

Comments

@EricLiuRan
Copy link

What happened?

We injected a 12 second network error(disable the port) into the switch, Etcd should elect the new leader right after the port is recovered. However, the actual time from the failure to the new leader is elected is 26 second, far exceeding expectations.
After repeated fault injection tests, with the network error duration is 15s, 17s, 20s, each time from fault injection to recovery and elect new leader is around 26 second.

What did you expect to happen?

The Etcd should elect new leader right after the network error is recovered, at least not as long as 26 second.

How can we reproduce it (as minimally and precisely as possible)?

Inject network error to the switch port more than 10 second.

Anything else we need to know?

We have 3 etcd deployed

Etcd version (please run commands below)

$ etcd --version
#
$ etcdctl version
# paste output here
3.5.0

Etcd configuration (command line flags or environment variables)

paste your configuration here

{"level":"info","ts":"2022-10-25T17:14:22.835+0800","caller":"embed/etcd.go:309","msg":"starting an etcd server","etcd-version":"3.5.0","go-version":"go1.16.5","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":false,"data-dir":"/opt/ETCDdata/","wal-dir":"","wal-dir-dedicated":"","member-dir":"/opt/ETCDdata/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-cluster-state":"new","initial-cluster-token":"haf-etcd-cluster","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

@serathius
Copy link
Member

Interesting could you reproduce it on latest version of etcd v3.5.5?
Would be great to add a test that validates that leader election latency.

@tjungblu
Copy link
Contributor

tjungblu commented Nov 1, 2022

Would be great to add a test that validates that leader election latency.

I would also love to reproduce this, we've seen something similar with OpenShift a while ago on some cloud providers. @EricLiuRan Do you use a special switch to inject this failure or can this be done on the OS as well?

@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 18, 2023
@stale stale bot closed this as completed May 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants