Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd old leader still revokes lease after it steped to follower if cpu is high #12528

Closed
jiapeish opened this issue Dec 7, 2020 · 22 comments
Closed
Labels

Comments

@jiapeish
Copy link

jiapeish commented Dec 7, 2020

** Issue Overview **
Hi,

I have found a possible bug related to lease. It seems that the etcd old leader will send leaseRevoke
request to new leader, which results in lease repired of the key, although the client is still sending
lease keepalive to the new leader.

** our use case **

  1. we use 3 node etcd cluster with version v3.4.6, echo node runs as a container with
    4vCPU core. These 3 containers are deployed at 3 different VMs. Client will write key with lease time 4 seconds to the etcd cluster.

  2. the system architecture is as follows:

image

** Steps to reproduce **

  1. In etcdserver/raft.go, add a small piece of code to simulate CPU heavy load in func (r *raftNode) start(rh *raftReadyHandler) ,
					rh.updateLeadership(newLeader)
					r.td.Reset()
				}

				// mock stuck
				if time.Since(start) > 10*time.Second && !flag {
					plog.Warningf("sleep!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
					time.Sleep(5*time.Second)
					flag = true
				}

				if len(rd.ReadStates) != 0 {
  1. compile the etcd in step 1 as etcd-stuck.exe, and start another 2 etcd, try to make etcd-stuck as the leader;

  2. Write a key into the cluster and send lease keepalive to the 3 nodes cluster;

  3. after 10s, the etcd-stuck will sleep, then a new leader is elected.

  4. at the same time, the old leader will send a lease revoke request to the new leader, and the key is
    deleted.

** What we expect **

  1. we expect that the key is not deleted, because 2 nodes are normal and they can process the lease
    keepalive of the clients.

  2. We wonder that why the old leader send lease revoke request as it step down to follower at that time?

  3. Why the key is deleted? Why not the new leader check the delete request and refuse it?

** Logs **

I've attached the 3 nodes logs.
etcd1.txt
etcd2.txt
etcd3.txt

@jiapeish jiapeish changed the title etcd old server still revokes lease after it steped to follower if cpu is high etcd old leader still revokes lease after it steped to follower if cpu is high Dec 7, 2020
@tangcong
Copy link
Contributor

tangcong commented Dec 8, 2020

It seems that in extreme cases, it is indeed possible that an old leader sends a revoke lease request to the cluster. When the cluster leader changes, the old leader will clear all lessers. However, during the period from the old leader to the updated membership, the old leader will revoke expired Leases.

@jiapeish
Copy link
Author

jiapeish commented Dec 8, 2020

When the old leader received the msg with higher term, it became follower, but it seems that another goroutine still mark itself as "leader" , is this the period you mean "update membership"? If so, why not we clear the lessors immediately when it receives msg with higher term?

image

@tangcong
Copy link
Contributor

tangcong commented Dec 8, 2020

https://github.com/etcd-io/etcd/blob/v3.4.6/etcdserver/raft.go#L174-L195
When the leader changes, the updateLeadership function will clear the leases. The etcd server obtains and processes raft log entries serially from the raft module. Before the old leader calls updateleadership, i guess that it may take a long time to process the last message, causing this issue. For example, higher CPU load, higher disk latency, and your sleep function.

@jiapeish
Copy link
Author

jiapeish commented Dec 8, 2020

https://github.com/etcd-io/etcd/blob/v3.4.6/etcdserver/raft.go#L174-L195
When the leader changes, the updateLeadership function will clear the leases. The etcd server obtains and processes raft log entries serially from the raft module. Before the old leader calls updateleadership, i guess that it may take a long time to process the last message, causing this issue. For example, higher CPU load, higher disk latency, and your sleep function.

Hi, @tangcong

In fact, we encountered this issue several times already and with no sleep function :) . For a distributed system, such phenomena like higher CPU load, Network latency are the norm, and we think that it is better if the community can solve this issue. And why don't we try to avoid calling lease revoke as soon as the old leader got MsgApp with higher term?

@tangcong
Copy link
Contributor

tangcong commented Dec 8, 2020

yeah, we need to spend more time to find a simple and safe way to fix it, If you have a good idea to fix it, please submit a PR. thanks.

I submitted a PR to try to reduce the probability of this bug. Can you help test it in your cluster? I am a little busy recently. thanks. @jiapeish

@jiapeish
Copy link
Author

jiapeish commented Dec 9, 2020

yeah, we need to spend more time to find a simple and safe way to fix it, If you have a good idea to fix it, please submit a PR. thanks.

I submitted a PR to try to reduce the probability of this bug. Can you help test it in your cluster? I am a little busy recently. thanks. @jiapeish

Yes, I can test this PR in our cluster, and will show the test results a.s.a.p.

@jiapeish
Copy link
Author

jiapeish commented Dec 9, 2020

Hi, @tangcong ,
unfortunately, this PR doesn't take effect, the issue still occurs. I believe it has some relation to concurrent/sync problems between different goroutines, since LeaseRevoke is in a new goroutine.

@jiapeish
Copy link
Author

Hi all,
Is there anybody can help us? Do we have a solution/plan to fix this potential bug?

Thanks :)

@tangcong
Copy link
Contributor

@jiapeish, I have updated and verified pr in my local environment, and it can fix this issue. etcd will verify leader status from raft state machine, but it will sacrifice a little performance. can you take a try? thanks.

16:57:52 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:52.831+0800","caller":"etcdserver/server.go:1055","msg":"ignore older leader revoke lease request","lease-id":"32697660774ba905","isLeader":true,"isLeaderInRaftStateMachine":false}
+16:57:54 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:54.686+0800","caller":"etcdserver/raft.go:376","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"fd422379fda50e48","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"4.889343035s"}
+16:57:54 etcd1 | {"level":"warn","ts":"2020-12-14T16:57:54.686+0800","caller":"etcdserver/raft.go:376","msg":"leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk","to":"91bc3c398fb3c146","heartbeat-interval":"100ms","expected-duration":"200ms","exceeded-duration":"4.889457147s"}

@jiapeish
Copy link
Author

@tangcong , We tested this PR in our cluster and it does take effect, adding another line to detect if the lg is nil. We'll have stability and stress test on this PR furthermore and will update the test result. Thank you.

@tangcong
Copy link
Contributor

@tangcong , We tested this PR in our cluster and it does take effect, adding another line to detect if the lg is nil. We'll have stability and stress test on this PR furthermore and will update the test result. Thank you.

good. thanks. if you cherry-pick it into etcd 3.4/3.3, you have to detect if the lg is nil.

@jiapeish
Copy link
Author

@tangcong , We tested this PR in our cluster and it does take effect, adding another line to detect if the lg is nil. We'll have stability and stress test on this PR furthermore and will update the test result. Thank you.

good. thanks. if you cherry-pick it into etcd 3.4/3.3, you have to detect if the lg is nil.

@tangcong Hi, as this PR take effect, I was wondering if we/community can cherry-pick/backport this PR to etcd 3.4.6 officially? And can we submit it to the newer version?

@tangcong
Copy link
Contributor

ok, I will add some comments and see if the failed test cases are relevant. @jiapeish

@jiapeish
Copy link
Author

@tangcong Is there any progress of the failed test cases? It seems that the CI-Job-4706.6 is failed, and I tried to figure out why it failed, however the CI logs confused me...
Could you help see why the test cases are failed?

@tangcong
Copy link
Contributor

@jiapeish sorry, I have been busy recently,some test cases are flaky, It seems that the failed test case is not related to this PR. PR is ready for review.
#12531

@ptabor
Copy link
Contributor

ptabor commented Feb 16, 2021

I've seen this (or simillar) issue on single node cluster:

2021-02-16 06:01:30.976722 I | etcdserver/api/etcdhttp: /health OK (status code 200)
(there are no logs at all for 14s - so its pretty serious 'starvation') 
2021-02-16 06:01:44.004055 W | etcdserver: request "header:<ID:5225433015266645667 > lease_revoke:<id:488477a450910fe1>" with result "size:29" took too long (1.989515846s) to execute
2021-02-16 06:01:44.655677 W | etcdserver: failed to apply request,took 27.653µs,request header:<ID:5225433015266645670 > lease_revoke:<id:488477a450910fe1>,resp size:29,err is lease not found
...
> 2021-02-16 06:01:44.786936 W | etcdserver: failed to revoke 488477a450910fe1 ("lease not found")

@jiapeish
Copy link
Author

I've seen this (or simillar) issue on single node cluster:

2021-02-16 06:01:30.976722 I | etcdserver/api/etcdhttp: /health OK (status code 200)
(there are no logs at all for 14s - so its pretty serious 'starvation') 
2021-02-16 06:01:44.004055 W | etcdserver: request "header:<ID:5225433015266645667 > lease_revoke:<id:488477a450910fe1>" with result "size:29" took too long (1.989515846s) to execute
2021-02-16 06:01:44.655677 W | etcdserver: failed to apply request,took 27.653µs,request header:<ID:5225433015266645670 > lease_revoke:<id:488477a450910fe1>,resp size:29,err is lease not found
...
> 2021-02-16 06:01:44.786936 W | etcdserver: failed to revoke 488477a450910fe1 ("lease not found")

Yes, this is a similarity between them, so I think this issue is important to fix...

@stale
Copy link

stale bot commented May 23, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label May 23, 2021
@jiapeish
Copy link
Author

This issue is used to track #12531
#12531

@stale stale bot closed this as completed Jun 14, 2021
@jiapeish
Copy link
Author

This issue still EXIST.

@aaronjzhang
Copy link

I also faced the issue, my case is as below:
We have 3 etcd nodes, they are running in k8s, use portworx storage, when portworx node reboots, it will cause fdatasync takes too long time, the leader etcd server will be stuck in processing Ready, then the raft will step to follower after an election-timeout, but the lessor still consider it's the primary lessor and revoke the expired leases, even it will be request timeout, but after the portworx restored, the fdatasync will return and the etcd server continue to process the next Ready and still send out the revoke lease request which are stored in the next Ready.Messages.
Looks the fix works well, why it's not merged? I also think it's very import to fix

@ahrtr
Copy link
Member

ahrtr commented Jan 14, 2023

@aaronjzhang could you raise a new issue and link to this one? We will triage it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

5 participants