When the election time exceeds the network error #14653

EricLiuRan · 2022-10-29T13:10:19Z

What happened?

We injected a 12 second network error(disable the port) into the switch, Etcd should elect the new leader right after the port is recovered. However, the actual time from the failure to the new leader is elected is 26 second, far exceeding expectations.
After repeated fault injection tests, with the network error duration is 15s, 17s, 20s, each time from fault injection to recovery and elect new leader is around 26 second.

What did you expect to happen?

The Etcd should elect new leader right after the network error is recovered, at least not as long as 26 second.

How can we reproduce it (as minimally and precisely as possible)?

Inject network error to the switch port more than 10 second.

Anything else we need to know?

We have 3 etcd deployed

Etcd version (please run commands below)

$ etcd --version
#
$ etcdctl version
# paste output here

3.5.0

Etcd configuration (command line flags or environment variables)

paste your configuration here

{"level":"info","ts":"2022-10-25T17:14:22.835+0800","caller":"embed/etcd.go:309","msg":"starting an etcd server","etcd-version":"3.5.0","go-version":"go1.16.5","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":false,"data-dir":"/opt/ETCDdata/","wal-dir":"","wal-dir-dedicated":"","member-dir":"/opt/ETCDdata/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"snapshot-catchup-entries":5000,"initial-cluster-state":"new","initial-cluster-token":"haf-etcd-cluster","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

serathius · 2022-10-29T18:34:50Z

Interesting could you reproduce it on latest version of etcd v3.5.5?
Would be great to add a test that validates that leader election latency.

tjungblu · 2022-11-01T10:31:34Z

Would be great to add a test that validates that leader election latency.

I would also love to reproduce this, we've seen something similar with OpenShift a while ago on some cloud providers. @EricLiuRan Do you use a special switch to inject this failure or can this be done on the OS as well?

stale · 2023-03-18T09:21:11Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

EricLiuRan added the type/bug label Oct 29, 2022

serathius added the release/v3.5 label Nov 15, 2022

serathius mentioned this issue Jan 18, 2023

Plan for v3.5.7 release #15141

Closed

stale bot added the stale label Mar 18, 2023

serathius mentioned this issue May 10, 2023

Plans for v3.5.9 release #15871

Closed

4 tasks

stale bot closed this as completed May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the election time exceeds the network error #14653

When the election time exceeds the network error #14653

EricLiuRan commented Oct 29, 2022

paste your configuration here

serathius commented Oct 29, 2022

tjungblu commented Nov 1, 2022

stale bot commented Mar 18, 2023

When the election time exceeds the network error #14653

When the election time exceeds the network error #14653

Comments

EricLiuRan commented Oct 29, 2022

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

serathius commented Oct 29, 2022

tjungblu commented Nov 1, 2022

stale bot commented Mar 18, 2023