-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apiserver timeouts after rolling-update of etcd cluster #47131
Comments
@javipolo There are no sig labels on this issue. Please add a sig label by: |
@kubernetes/sig-api-machinery-misc I guess :) |
@javipolo: Reiterating the mentions to trigger a notification: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@javipolo Can you provide more logging? Probably in 10minutes window? The compacted thing is expected as the connection between etcd and api server was cut off. But it should disappear when etcd is back and resync is finished. |
Yep. before I just provided the ERROR part of the logs. Now I'm putting the INFO part of them also. I leave logs since 16:50 until restart. Around 17:45 was when we did the rolling update of etcd. If you need anything else just ask for it and I'll try to provide it :) (thanks!)
|
When did you restart API server? Are there more logs like |
/cc @jbeda |
The logs I provided end up when restarting the apiserver, so no, there are no more of those errors after 17:51 I can provide the restart times of etcds and apiservers (date/time of the host are the same in every server) etcd restart times
apiserver restart times:
|
I have encountered the same problem, this problem has been troubled me. kubernetes version: v1.6.4 etcd status:
show logs: kube-aiserver
|
+1 |
the same problem |
I also have the same messages on the apiserver log, however it's rare that I get timeouts on operations invoked on it. K8s: 1.6.2 |
/assign @hongchaodeng @xiang90 |
OK. So this issue already exists. I can share my story as well. Basically, we were doing some etcd scale up and down stress testing. We saw that APIserver hanged and some resources unusable. I tried a dozen times and they were whimsical:
I have tried to reproduce with scratch apiserver:
Haven't reproduced yet. So I doubt it needs some data. Note that in my case it happened only on scaling down from 3 to 1. Scaling down from 2 to 1 wouldn't happen. I also tried to scale down slowly, waiting 60s before removing the second one, and it wouldn't happen. |
@xiang90 @hongchaodeng etcd version:
kubernetes version:
show kube-apiserver logs:
this issue will affect the production environment?thx |
Similar problem (see log below). In short - installing new K8S master node based on this guide: https://coreos.com/kubernetes/docs/latest/getting-started.html - when trying to start kubelet service on it - everything starts up, but the apiserver all the time crashes (?). Due to that worker node can't register it self as well master node isn't fully working. CoreOS Linux: 1492.4.0
Occasionally it runs further:
|
Update to the situation - I solved my issue. Turns out that if you are running |
Any progress about this issue? we have also encountered this, and even worse is that some updating operation failed. the log is:
this happens twice in our environment, and we did't find the reproducing condition. k8s version: v1.6.0 |
also have the same question. any solution ? I checked log find many line like: k8s version: 1.7.2 and also have a strange queation. |
|
cc @kubernetes/kubernetes-release-managers as this is most probably a show-stopper for v1.9; we will most probably consider blocking the release on fixing this issue. |
@luxas That's the basic set of steps planned, yes. Note that we're planning to merge the work to the k8s 1.10 branch, and then cherry pick to a 1.9.x release once it proves stable, we're not targeting the initial 1.9 release. I'll be picking up the engineering work on this. I'll link to a issue explaining in more detail soon. |
[MILESTONENOTIFIER] Milestone Removed From Issue @hongchaodeng @javipolo @xiang90 @kubernetes/sig-api-machinery-misc @kubernetes/sig-cluster-lifecycle-misc @kubernetes/sig-scalability-misc Important: Code freeze is in effect and only issues with |
also have the same question,I want to know that the cause of the problem is what? @xiang90 |
Any update on this? We've had to implement a brute force restart of kube-apiserver regularly as we roll the etcd credentials frequently (every couple of days) in a deployment with short lived TLS credentials from Vault. We weren't sure what was causing the apiserver to become unresponsive but every symptom in this issue rings true. K8S 1.7.8, etcd 3.2.6. |
@sgmiller etc v3.2.11 has just been released and contains multiple fixes in the etcd client to gRPC. We plan to upgrade kubernetes to use the new etcd client; just upgrading the etcd server(s) to 3.2.11 is not sufficient here, we need to upgrade the client code vendored into kubernetes as well.See #56114 for details. I should warn that the bulk of the reports found in this (lengthy) issue provide insufficient detail for us to say for certain if they all have the same root cause, or if the fixes in etcd 3.2.11 will resolve them. But they are close enough related that we'd like to push out these changes and get feedback on what (if any) of these types of issues remain. If you have any more details you can provide about the specific issues you've encountered, please let us know. |
Note that #57061 very likely to be grpc/grpc-go#1005, which was fixed after v1.0.4. Users on Kube 1.6 or 1.7 would be broken against etcd 3.2 servers at very large loads. |
Is it possible to run into this issue on a new Kubernetes install with 1.9.1? Stuck here:
It looks like the API server is failing to fully deploy, and I don't see an etcd container. Should I?
|
We see something similar when doing hard poweroff of a VM on which one of etcd nodes is running. K8s version 1.9.1 |
This problem should be solved or at least mitigated by #57160. The PR bumps both gRPC and etcd client to fix the timeout problem caused by connection reset and balancing. I am closing out this issue. If anyone still see the timeout problem in a release (after 1.10) with #57160, please create a new issue with reproduce steps. |
@xiang90 any chance this gets backported to 1.9.x? |
I agree, #57160 should resolve this. Unfortunately, the PR introduces a lot of change. It bumps the versions of multiple major dependencies (etcd, docker, containerd, grpc, ...). I do not believe it is well suited for backport to stable kubernetes branches at this time. After 1.10 is released and this has had some time to bake in production, I'd be open to revisiting this, but for now let's limit it to k8s 1.10. |
We've recently updated to etcd 3.1.13 and have kube on 1.6.13, as we got onto the v3 etcd store we appear to have been hit hard by this, we are working our way towards 1.10.x but are a few months from getting there especially since the lockups of api-server are distracting from moving forward. I see #56114 should Improve the problem, I was wondering if we would get most of the behavior if we introduced a grpc proxy from 3.2.11 (or newer) onto the host with api-server to back to the etcd server. digging through https://kubernetes.io/docs/imported/release/notes/ I see 'Upgrade to etcd client 3.2.13 and grpc 1.7.5 to improve HA etcd cluster stability. (#57480, ...)' and 'The supported etcd server version is 3.1.12, as compared to 3.0.17 in v1.9 (#60988)' so I see strong evidence that the client can talk to our etcd server. It seems viable to switch to the gRPC proxy to avoid this. looking for confirmation that the proxy would use the loadbalancer as updated in etcd-io/etcd#8840 |
Kubernetes version: 1.6.2
Etcd version: 3.1.8
Environment:
What happened:
We were doing an upgrade of the configuration of etcd (flags election_timeout and heartbeat_interval)
We did upgrade all of our etcd servers one at a time (we have 5)
We did check that the etcd cluster was healthy issuing etcdctl cluster-health and etcdctl member list
Then kube-apiserver started to behave erratically, giving timeout to almost every request sent
On the logs of the apiserver we can see lots of lines like that:
In order for the apiservers to start behaving correctly, we had to perform a restart of the kube-apiserver service (just that one service, in all of our apiservers)
We did this twice, and twice it did happen the same. The cluster is in production, so we cannot risk a third outage to reproduce it again. The two times we tried it the behaviour was pretty consistent
What you expected to happen:
That etcd just updated its configuration and apiserver never stoped working
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
Just ask :)
The text was updated successfully, but these errors were encountered: