-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for pod to exist up to 60 seconds before erroring #155
Conversation
The ContainerExecDecorator is used as part of the So, the only cases I can see that the pod not being there are:
In both cases, waiting won't change things. Do I miss something here? |
I am using the pipeline steps correctly, I am calling the container step within my node step, which is within my podTemplate step. This is definitely fix the issues I am experiencing, without this patch my pipeline will not run succeed on some of my clusters. The error I was getting is here: https://gist.github.com/chancez/bd80345f8e84b2bc79b70c824f2e7cea You are missing something I think, which is that Kubernetes works asynchronously, and thus it's possible that after creating a pod, it may not be immediately visible afterwards. The creation is asynchronous, so if you create it, you may query for it afterwards, only to see it hasn't yet shown up yet. In the case of the error, I am seeing that the pod is created, both containers start, and are ready, but the plugin has failed because it doesn't see the pod by the time it checked. The same pipeline worked on current clusters running Kubernetes 1.4, but not on my newest cluster running Kubernetes 1.5.5. I can literally launch my job, exec into both pods, watch the logs, but Jenkins will fail, and my pod will get killed by the plugin after the job fails. With this PR, I see it query once and it doesn't see the pod, and then the 2nd time it queries, the pod does exist. I'm able to see this using the log statements I added in this PR. It generally only ever takes 2 tries before the pod is visible, so it's almost certainly just a race condition between create + get. This is my pipeline: https://gist.github.com/chancez/b033ea166e5dea7a3862beb1ff86eb7d |
If the pod is not up and running I can't see how anything within the @carlossg: Your thoughts? |
You're right, I'm a bit unclear myself after looking at the actual slave provisioning code + the callback. It looks pretty much correct to me, but I'm definitely seeing that the https://github.com/jenkinsci/kubernetes-plugin/pull/155/files#diff-8a1ab34452b01252e89ffeee0b2eda45R163 returns null the first time, basically always. One thing that I noticed is that https://github.com/jenkinsci/kubernetes-plugin/blob/master/src/main/java/org/csanchez/jenkins/plugins/kubernetes/KubernetesCloud.java#L655-L658 sleeps before it even tries to query the pod, so I bet you would see a null pod there if you moved the sleep to after it, not that changes anything in this scenario, but I bet that sleep is hiding a race in some cases, so I wouldn't be surprised if there were more races. |
I think it is reasonable to wait for until pod is not null if k8s returns that |
Why I use kubernetes cloud to build which need 8s at least and wait 6s until k8s create pod. |
I've learned a little more about how the API servers work, and found more evidence of why this might be happening. Basically each API server also does a lot of caching of the results from etcd, and running multiple API servers is what makes this more likely to happen, as the resource may exist after creating, but isn't yet in the cache of all API servers, and causes some delays in seeing resources, despite them existing |
Most actions that create resources in Kubernetes are asyncronous, because the
controller-manager needs to update resources, and perform extra work
asyncronously after an API call is made to create a given resource, thus it is
possible to do a POST + GET, and not see the resource just POSTed.
The current code incorrectly assumed that if the CreatePod API call succeeded,
that the pod will exist by the time we wait for containers to become ready.
This pull-request updates the logic to query for the pod mulitiple times before
deciding it doesn't exist.