-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading etcd cluster version from v3.2.24 to v3.3.15 made the k8s cluster apparently frozen #12225
Comments
do you enable auth? please see #11689. |
Thanks a lot for pointing me to that issue, @tangcong! I believe that is indeed what is happening in my cluster. I noticed that there are some logs like the following in my cluster's leader logs:
Which I guess indicates that auth is indeed enabled and there are lease_revoke requests being issued. I noticed that the only solution proposed by you is to first upgrade to the latest 3.2 version, and then upgrade to 3.3. Does that mean that this cluster I have is in an unrecoverable state and I should just obliterate it? Since the entire cluster is already at v3.3. Also, I couldn't find the note you added related to this issue here. Shouldn't it be there? |
If your clusters are already inconsistent, you can only remove the follower nodes one by one, and then add them to the cluster to make the cluster consistent. Note that there is no guarantee that your data is complete. could you also release a new version for 3.2 when you release new version for 3.4/3.3, which includes a fix for a data inconsistency bug. thanks. @gyuho The etcd website doc has not been updated for a long time, I will see how to update it, thank you. note is latest here.@Leulz |
etcd version: 3.3.15
k8s version: 1.16
I am using the etcd-wrapper to run etcd in dedicated machines.
When I upgraded the first machine, I noticed an absurd increase in CPU usage. The average CPU usage with v3.2.24 was at around 40% of an EC2 m3.medium. It jumped to more than 90% after the upgrade, and now, even using m3.large instances, it's still at around 50~60%.
Alas, I decided to keep upgrading the cluster instead of rolling back, and now the k8s cluster is seemingly immutable. Thankfully it's in a staging environment.
The cluster reports itself as healthy:
But I noticed that the Raft Index is sometimes significantly different across all instances:
Lots (as in, dozens a second) of logs like this:
Aug 15 19:57:10 internal-dns etcd-wrapper[2981]: 2020-08-15 19:57:10.849851 I | auth: deleting token <token> for user root
can be seenOther logs that look weird are:
auth: invalid user name etcd-2 for permission checking
,pkg/fileutil: purged file /var/lib/etcd/member/snap/00000000000000ae-000000000e945a73.snap successfully
, and lots ofetcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" " with result "range_response_count:808 size:13316569" took too long (127.988459ms) to execute
.The k8s cluster using this etcd cluster is, as mentioned, apparently frozen. I tried editing a deployment we have in the cluster, and the result was:
Pods before the patch:
Pods after editing a deployment to force a cycle:
Pods after some time:
Is this a known issue? Any insight in what is happening here is much appreciated.
The text was updated successfully, but these errors were encountered: