-
Notifications
You must be signed in to change notification settings - Fork 40k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout on instances.NodeAddresses cloud provider request #62543
Timeout on instances.NodeAddresses cloud provider request #62543
Conversation
0b6b6ff
to
55e0b31
Compare
a call to the cloud provider wedging the kubelet is a problem. i wish the cloud provider apis had a defined timeout, but i recognize that is harder to push out. |
/kind bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you clarify on how 3m default was selected?
pkg/kubelet/kubelet.go
Outdated
klet.cloudproviderRequestParallelism = make(chan int, 1) | ||
klet.cloudproviderRequestSync = make(chan int) | ||
// TODO(jchaloup): Make it configurable via --cloud-provider-request-timeout | ||
klet.cloudproviderRequestTimeout = 3 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did you settle on 3m?
i feel like this should be less than the heartbeat status interval
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, either 10s (status frequency) or 30s (10s less than the 40s timeout before the node is marked NotReady. I favor 10s.
There is also the fundamental issue of do we need to do this call on every status update. Seems like this information (hostame, internal ip, external ip) is pretty static.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did you settle on 3m?
for testing purposes. I plan to make the timeout configurable. Or does it make more sense to hardcode it? 10s seems good.
select { | ||
case <-kl.cloudproviderRequestSync: | ||
case <-time.After(kl.cloudproviderRequestTimeout): | ||
err = fmt.Errorf("Timeout after %v", kl.cloudproviderRequestTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might put this at V(2) and change to "timeout after %v trying to get instance information from cloud provider"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
glog.V(2).Infof("timeout after %v trying to get instance information from cloud provider", kl.cloudproviderRequestTimeout)
return nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually nevermind this whole thing. i didn't see the block after this and that we were just setting err here, not returning it.
55e0b31
to
478e3b8
Compare
@derekwaynecarr @sjenning changed the timeout to 10 seconds. |
@ingvagabund you might need to rebase this to resolve the test failures. The tests don't seem to be broken across all PRs, but they are repeatably broken for this PR. |
478e3b8
to
d34b50d
Compare
Just rebasing. I will check the failed tests afterwards. |
/test pull-kubernetes-bazel-test |
No idea why the bazel fails:
Any hints what to look for? |
Reproducible locally by running |
|
d34b50d
to
61efc29
Compare
/test pull-kubernetes-integration |
/test pull-kubernetes-local-e2e-containerized |
Most likely the |
@sjenning @derekwaynecarr PTAL |
@ingvagabund you can safely ignore the |
@ingvagabund: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@dims yeah, I checked the TestGrid and other PRs and it seems to be so true :) Thanks for verifying that. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: derekwaynecarr, ingvagabund The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here. |
On kubelet start-up some functions to set the node status are generated. One of those functions propagates the node addresses into the `Node` object the kubelet is responsible for (`.status.addresses`). The kube-apiserver uses these addresses to talk to the actual node. To identify the IP address of the node the kubelet communicates with the cloud provider. kubernetes/kubernetes#62543 introduced a timeout of 10s when trying to connect to the cloud. In case the IP cannot be determined within 10s, the `Node` object does not report an `InternalIP` address. Consequently, the kube-apiserver will never be able to talk to that node; particularly VPN won't work in case the vpn-shoot pod is scheduled on it. Once the connection failed, it is never retried, and only a kubelet process restart can trigger it again. Hence, our kubelet monitoring script will now do the same when it cannot find an `InternalIP` or an `ExternalIP` address on the `Node` object. closes #283
On kubelet start-up some functions to set the node status are generated. One of those functions propagates the node addresses into the `Node` object the kubelet is responsible for (`.status.addresses`). The kube-apiserver uses these addresses to talk to the actual node. To identify the IP address of the node the kubelet communicates with the cloud provider. kubernetes/kubernetes#62543 introduced a timeout of 10s when trying to connect to the cloud. In case the IP cannot be determined within 10s, the `Node` object does not report an `InternalIP` address. Consequently, the kube-apiserver will never be able to talk to that node; particularly VPN won't work in case the vpn-shoot pod is scheduled on it. Once the connection failed, it is never retried, and only a kubelet process restart can trigger it again. Hence, our kubelet monitoring script will now do the same when it cannot find an `InternalIP` or an `ExternalIP` address on the `Node` object. closes gardener#283
What this PR does / why we need it:
In cases the cloud provider does not respond before the node gets evicted.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #
Special notes for your reviewer:
Release note: