-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163
Comments
Can you distill this down? This is an absurdly large amount of logs, and we have no idea whatsoever what your test environment is like. What's the network config? Which CNI? What do the CNI logs say? It looks like only one pod is getting an address, but it's an absolute unknown what CNI you're using, whether it's finished reconciling, what (if anything) happened to the host OS, whether it's eBPF CNI or overlay or bridge, and so on. A brief overview of what the environment is supposed to look like is important. |
hello @evol262 2- An app is deployed to both nodes (busybox)
if you wanna see the content of the "multinode-pod-dns-test.yaml" here it is:
3- then the test is verifying to check if Both nodes Serve traffic
The first problem is in this verification:
$ kubectl -- get pods -o jsonpath='{.items[*].metadata.name}' $ kubectl -- exec busybox-6b86dd6d48-8vp4k -- nslookup kubernetes.default.svc.cluster.local: exit status 1 (187.571295ms) -- /stdout -- I confirm this is causing the issue consistently since we upgraded the cri-docker |
so in summary it seems like we deploy an app on two nodes, but only one node gets the app |
@medyagh Please actually verify this (leave the test environment up and actually check with the docker CLI or other). I have no idea what "default settings" means for a CNI. Presumably it is whatever the default is for minikube. We don't ship one. Do you know what CNI is in use? Failure to issue an address falls strictly in the domain of CNI IPAM, not cri-dockerd, and the fact that this is neither reproducible in testing nor has it been reported by anyone using a version released over a month ago other than a "deploy with no CNI to assign addresses" test scenario leads me to believe that it's somewhere there. I'm happy to help debug, but it really needs a lot more information. |
@evol262 fair enough, the default CNI is actually kindnet (https://github.com/kubernetes-sigs/kind/tree/main/images/kindnetd/cmd/kindnetd) as seen in minikube
|
Again, please check the actual nodes. Or the CNI logs. Post back your findings, please. Pod IP assignment is strictly up to the CNI, and establishing whether the CNI thinks it's doing the right thing is crucial before we go ghost hunting to see whether a dep update for a CVE somehow broke kindnet and no other CNIs. |
sounds good we will do some more investigation on our end to see why it happens on Debian Linux CI machines, hopefully with more useful logs for you |
hello @evol262, we've investigated this issue further and can share with you our findings and details: the context:
the problems we observed:
please see the following screenshot capturing the problem: on the left-hand side, red rectangles mark Info messages from cri-docker log - eg:
the corresponding kubelet/PLEG log around that same specific time (2023-03-19T20:53:16Z) is:
similarly, on the right-hand side, highlighted are sections affirming that the pod has no ip address assigned (and no eth0 interface) yet, but it's marked as Ready, which assumes it should be able to communicate - ie, based on the Pod Lifecycle/Pod conditions:
according to the CRI Networking Specifications Requirements:
so, it looks like it's the cri's responsibility to ensure that the pod has an ip address before marking it as Ready ? i tried to dig a bit deeper, and i might be completely wrong, but sharing anyway - this code segment might be relevant: cri-dockerd/core/service_alpha.go Lines 33 to 42 in 6daf9ac
here the v1Response is used before the error is checked, whereas the ds.RunPodSandbox might return both a non-nil error and a non-nil response (perhaps here?) also, the above Info log message comes from cri-docker's cri-dockerd/core/sandbox_helpers.go Lines 145 to 148 in 6daf9ac
for completeness, i'm attaching the logs from kubelet (log verbosity 7, to be able to capture PLEG logs there) and cri-docker: i hope these more specific details help (happy to share more, if needed) what are your thoughts, please? |
I mean, the short answer here is "yes, but no". It's handed off to the CNI, which deals with IPAM after that, and the responsibility of the CRI is to raise any errors back up the stack, or cancel it after the timeout (220 seconds, as a default). If the plugin says "ok, done", we and returns a result which is not an err, we believe it, and we have no reason not to. |
thanks, @evol262, for your quick reply! so if SetUpPod returns an error back to the caller - ds.RunPodSandbox, it might still return both non-nil error and non-nil response, and then RunPodSandbox might process the response before/without checking if the error is not nil (effectively ignoring the error)? cri-dockerd/core/service_alpha.go Lines 33 to 42 in 6daf9ac
|
btw, our temporary workaround is to wait (retry) for the pod to get an ip address, which evenutally happens, whereas we should ideally only rely on the ready status cri-dockerd/core/sandbox_helpers.go Lines 145 to 148 in 6daf9ac
perhaps checking for IP[s] and [re]setting the status/state if needed between two blocks here or moving the ip check before state is set and then also check it here?
|
Thanks! Merged. I'll cut a release early next week |
great, thanks! |
Reference kubernetes/minikube#15870
In kubernetes/minikube#15752 we updated cri-dockerd and resulted in some consistently failing multi-node network tests. Such as
DeployApp2Nodes
:deploys an app to a multinode cluster and makes sure all nodes can serve traffic
I see in the release notes, PR #147 there was an update to a CNI dependency which could be related to the problem.
Logs for the failing test can be seen here: https://storage.googleapis.com/minikube-builds/logs/15752/27907/Docker_Linux.html#fail_TestMultiNode/serial/DeployApp2Nodes
The text was updated successfully, but these errors were encountered: