Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163

spowelljr · 2023-02-16T19:09:17Z

In kubernetes/minikube#15752 we updated cri-dockerd and resulted in some consistently failing multi-node network tests. Such as DeployApp2Nodes: deploys an app to a multinode cluster and makes sure all nodes can serve traffic

I see in the release notes, PR #147 there was an update to a CNI dependency which could be related to the problem.

Logs for the failing test can be seen here: https://storage.googleapis.com/minikube-builds/logs/15752/27907/Docker_Linux.html#fail_TestMultiNode/serial/DeployApp2Nodes

The text was updated successfully, but these errors were encountered:

evol262 · 2023-02-16T19:44:28Z

Can you distill this down? This is an absurdly large amount of logs, and we have no idea whatsoever what your test environment is like. What's the network config? Which CNI? What do the CNI logs say?

It looks like only one pod is getting an address, but it's an absolute unknown what CNI you're using, whether it's finished reconciling, what (if anything) happened to the host OS, whether it's eBPF CNI or overlay or bridge, and so on.

A brief overview of what the environment is supposed to look like is important.

medyagh · 2023-02-21T23:39:37Z

hello @evol262
here is a summary of what the test is doing, I hope this information answers your question,
1- Test starts a new minikube cluster with two nodes without any additional CNI (default settings)
$ minikube start --wait=true memory=2200 --nodes=2

2- An app is deployed to both nodes (busybox)

$ kubectl  -- apply -f ./testdata/multinodes/multinode-pod-dns-test.yaml

if you wanna see the content of the "multinode-pod-dns-test.yaml" here it is:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: busybox
  labels:
    app: busybox
spec:
  replicas: 2
  selector:
    matchLabels:
      app: busybox
  template:
    metadata:
      labels:
        app: busybox
    spec:
      containers:
      - name: busybox
        # flaky nslookup in busybox versions newer than 1.28:
        # https://github.com/docker-library/busybox/issues/48
        # note: k8s.gcr.io/e2e-test-images/agnhost:2.32
        # has similar issues (ie, resolves but returns exit code 1)
        image: gcr.io/k8s-minikube/busybox:1.28
        command:
          - sleep
          - "3600"
        imagePullPolicy: IfNotPresent
      restartPolicy: Always
      affinity:
        # ⬇⬇⬇ This ensures pods will land on separate hosts
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions: [{ key: app, operator: In, values: [busybox] }]
            topologyKey: "kubernetes.io/hostname"

3- then the test is verifying to check if Both nodes Serve traffic

$ kubectl  -- rollout status deployment/busybox
$ kubectl  -- rollout status deployment/busybox: (1.805356933s)

The first problem is in this verification:

kubectl -p  -- get pods -o jsonpath='{.items[*].status.podIP}'

expected 2 Pod IPs but got 1
and then the test goes on ...

$ kubectl -- get pods -o jsonpath='{.items[*].metadata.name}'
$ kubectl -- exec busybox-6b86dd6d48-8vp4k -- nslookup kubernetes.io

$ kubectl -- exec busybox-6b86dd6d48-8vp4k -- nslookup kubernetes.default.svc.cluster.local: exit status 1 (187.571295ms)
-- stdout --
Server: 10.96.0.10
Address 1: 10.96.0.10

-- /stdout --
** stderr **
nslookup: can't resolve 'kubernetes.default.svc.cluster.local'
command terminated with exit code 1
** /stderr **

I confirm this is causing the issue consistently since we upgraded the cri-docker

medyagh · 2023-02-21T23:43:41Z

so in summary it seems like we deploy an app on two nodes, but only one node gets the app

evol262 · 2023-02-22T00:01:48Z

@medyagh Please actually verify this (leave the test environment up and actually check with the docker CLI or other).

I have no idea what "default settings" means for a CNI. Presumably it is whatever the default is for minikube. We don't ship one. Do you know what CNI is in use? Failure to issue an address falls strictly in the domain of CNI IPAM, not cri-dockerd, and the fact that this is neither reproducible in testing nor has it been reported by anyone using a version released over a month ago other than a "deploy with no CNI to assign addresses" test scenario leads me to believe that it's somewhere there.

I'm happy to help debug, but it really needs a lot more information.

medyagh · 2023-02-22T00:08:56Z

@evol262 fair enough, the default CNI is actually kindnet (https://github.com/kubernetes-sigs/kind/tree/main/images/kindnetd/cmd/kindnetd)

as seen in minikube

kube-system   kindnet-68wxq                      1/1     Running   0             65s
kube-system   kindnet-t69r5                      1/1     Running   0             61s

evol262 · 2023-02-22T00:15:59Z

Again, please check the actual nodes. Or the CNI logs. Post back your findings, please. Pod IP assignment is strictly up to the CNI, and establishing whether the CNI thinks it's doing the right thing is crucial before we go ghost hunting to see whether a dep update for a CVE somehow broke kindnet and no other CNIs.

medyagh · 2023-02-22T21:37:50Z

sounds good we will do some more investigation on our end to see why it happens on Debian Linux CI machines, hopefully with more useful logs for you

prezha · 2023-03-23T21:38:17Z

hello @evol262, we've investigated this issue further and can share with you our findings and details:

the context:

we deploy two pods to two nodes (cp and worker) using anti-affinity rules
we then noticed that the pod on the worker node eventually does get an ip address assigned ~40sec after deployment was initiated (and ~30+ sec after the pod on the cp node got an ip address)
during the period when the cp's pod has an ip address, and the worker's pod still don't have one assigned, both pods (hence also the deployment) are marked as Ready
this breaks our tests, as we expect pods to have an ip address if they are Ready

the problems we observed:

significant delay before a pod get an ip address
note: i think this is actually a minor problem as it might be down to the cni - kindnet in this case, and when we tried with other CNIs (and also other than the busybox pods), the delay was lower but still present
worker's pod being marked as Ready while not having an ip address assigned (yet) - this will be the focus of further details below

please see the following screenshot capturing the problem:

on the left-hand side, red rectangles mark Info messages from cri-docker log - eg:

Mar 19 20:53:16 multinode-877589-m02 cri-dockerd[1059]: time=2023-03-19T20:53:16Z level=info msg="Failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "busybox-6b86dd6d48-ttnwd_default": unexpected command output Device "eth0" does not exist.\n with error: exit status 1"

the corresponding kubelet/PLEG log around that same specific time (2023-03-19T20:53:16Z) is:

Mar 19 20:53:16 multinode-877589-m02 kubelet[1337]: I0319 20:53:16.289327 1337 generic.go:453] "PLEG: Write status" pod="default/busybox-6b86dd6d48-ttnwd" podStatus=&{ID:6fe19c31-5d2c-444e-8562-69f4c31186f4 Name:busybox-6b86dd6d48-ttnwd Namespace:default IPs:[] ContainerStatuses:[] SandboxStatuses:[&PodSandboxStatus{Id:a926fc57856373944f563b0d8ebc1bd3befda7fe42e00932dc6ae0e8b97d91e7,Metadata:&PodSandboxMetadata{Name:busybox-6b86dd6d48-ttnwd,Uid:6fe19c31-5d2c-444e-8562-69f4c31186f4,Namespace:default,Attempt:0,},State:SANDBOX_READY,CreatedAt:1679259195437451087,Network:&PodSandboxNetworkStatus{Ip:,AdditionalIps:[]*PodIP{},},Linux:&LinuxPodSandboxStatus{Namespaces:&Namespace{Options:&NamespaceOption{Network:POD,Pid:CONTAINER,Ipc:POD,TargetId:,UsernsOptions:nil,},},},Labels:map[string]string{app: busybox,io.kubernetes.pod.name: busybox-6b86dd6d48-ttnwd,io.kubernetes.pod.namespace: default,io.kubernetes.pod.uid: 6fe19c31-5d2c-444e-8562-69f4c31186f4,pod-template-hash: 6b86dd6d48,},Annotations:map[string]string{kubernetes.io/config.seen: 2023-03-19T20:53:15.093241411Z,kubernetes.io/config.source: api,},RuntimeHandler:,}] TimeStamp:0001-01-01 00:00:00 +0000 UTC}

similarly, on the right-hand side, highlighted are sections affirming that the pod has no ip address assigned (and no eth0 interface) yet, but it's marked as Ready, which assumes it should be able to communicate - ie, based on the Pod Lifecycle/Pod conditions:

Ready: the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.

according to the CRI Networking Specifications Requirements:

RunPodSandbox must set up pod's network. This includes, but is not limited to allocating a pod IP, configuring the pod's network interfaces and default network route. Kubelet expects the pod sandbox to have an IP which is routable within the k8s cluster, if RunPodSandbox returns successfully. RunPodSandbox must return an error if it fails to set up the pod's network. If the pod's network has already been set up, RunPodSandbox must skip network setup and proceed.

so, it looks like it's the cri's responsibility to ensure that the pod has an ip address before marking it as Ready ?

i tried to dig a bit deeper, and i might be completely wrong, but sharing anyway - this code segment might be relevant:

cri-dockerd/core/service_alpha.go

Lines 33 to 42 in 6daf9ac

    
           v1Response, err = as.ds.RunPodSandbox(ctx, &v1Request) 
        
           if v1Response != nil { 
        
           	resp := &runtimeapi_alpha.RunPodSandboxResponse{} 
        
           	err = utils.V1ResponseToAlphaResponse(v1Response, resp) 
        
           	if err == nil { 
        
           		res = resp 
        
           	} 
        
           	return res, err 
        
           } 
        
           return nil, err

here the v1Response is used before the error is checked, whereas the ds.RunPodSandbox might return both a non-nil error and a non-nil response (perhaps here?)

also, the above Info log message comes from cri-docker's PodSandboxStatus that calls getIPs that has this segment that might be related in the call chain:

cri-dockerd/core/sandbox_helpers.go

Lines 145 to 148 in 6daf9ac

    
           // If all else fails, warn but don't return an error, as pod status 
        
           // should generally not return anything except fatal errors 
        
           logrus.Infof("Failed to read pod IP from plugin/docker: %v", err) 
        
           return ips

for completeness, i'm attaching the logs from kubelet (log verbosity 7, to be able to capture PLEG logs there) and cri-docker:

i hope these more specific details help (happy to share more, if needed)

what are your thoughts, please?

evol262 · 2023-03-23T22:19:14Z

I mean, the short answer here is "yes, but no". It's handed off to the CNI, which deals with IPAM after that, and the responsibility of the CRI is to raise any errors back up the stack, or cancel it after the timeout (220 seconds, as a default).

If the plugin says "ok, done", we and returns a result which is not an err, we believe it, and we have no reason not to.

prezha · 2023-03-23T22:33:28Z

thanks, @evol262, for your quick reply!

so if SetUpPod returns an error back to the caller - ds.RunPodSandbox, it might still return both non-nil error and non-nil response, and then RunPodSandbox might process the response before/without checking if the error is not nil (effectively ignoring the error)?

cri-dockerd/core/service_alpha.go

Lines 33 to 42 in 6daf9ac

    
           v1Response, err = as.ds.RunPodSandbox(ctx, &v1Request) 
        
           if v1Response != nil { 
        
           	resp := &runtimeapi_alpha.RunPodSandboxResponse{} 
        
           	err = utils.V1ResponseToAlphaResponse(v1Response, resp) 
        
           	if err == nil { 
        
           		res = resp 
        
           	} 
        
           	return res, err 
        
           } 
        
           return nil, err

prezha · 2023-03-23T22:40:55Z

btw, our temporary workaround is to wait (retry) for the pod to get an ip address, which evenutally happens, whereas we should ideally only rely on the ready status
~~not sure if similar retries should be on cni or cri to do though~~
actually, instead of waiting on cni/cri side, i think it would be sufficient if cri returns to kubelet/pleg a non-ready status if pod does not have an ip assigned - that's probably where the PodSandboxStatus could play a role
currently, it only logs an Info message:

cri-dockerd/core/sandbox_helpers.go

Lines 145 to 148 in 6daf9ac

    
           // If all else fails, warn but don't return an error, as pod status 
        
           // should generally not return anything except fatal errors 
        
           logrus.Infof("Failed to read pod IP from plugin/docker: %v", err) 
        
           return ips

perhaps checking for IP[s] and [re]setting the status/state if needed between two blocks here or moving the ip check before state is set and then also check it here?

prezha · 2023-03-24T17:18:39Z

@evol262 if the above makes sense, i've created a pr #168 that should add that check

evol262 · 2023-03-25T00:33:00Z

Thanks! Merged.

I'll cut a release early next week

prezha · 2023-03-25T23:30:51Z

great, thanks!

medyagh mentioned this issue Feb 22, 2023

Add default cni logs to minikbue logs kubernetes/minikube#15909

Merged

medyagh mentioned this issue Mar 9, 2023

test: wait for pods to get IPs kubernetes/minikube#16001

Merged

prezha mentioned this issue Mar 24, 2023

check if running container also has ip address to return ready state #168

Merged

evol262 closed this as completed in #168 Mar 24, 2023

medyagh mentioned this issue Mar 29, 2023

adding testing with minikube to cri-dockerd release #171

Open

manusa mentioned this issue Sep 12, 2023

feat: support CRI manusa/actions-setup-minikube#95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163

Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163

spowelljr commented Feb 16, 2023

evol262 commented Feb 16, 2023

medyagh commented Feb 21, 2023 •

edited

Loading

medyagh commented Feb 21, 2023

evol262 commented Feb 22, 2023

medyagh commented Feb 22, 2023

evol262 commented Feb 22, 2023

medyagh commented Feb 22, 2023

prezha commented Mar 23, 2023 •

edited

Loading

evol262 commented Mar 23, 2023

prezha commented Mar 23, 2023 •

edited

Loading

prezha commented Mar 23, 2023 •

edited

Loading

prezha commented Mar 24, 2023

evol262 commented Mar 25, 2023

prezha commented Mar 25, 2023

Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163

Updating from v0.3.0 to v0.3.1 causing multi-node network issues #163

Comments

spowelljr commented Feb 16, 2023

evol262 commented Feb 16, 2023

medyagh commented Feb 21, 2023 • edited Loading

medyagh commented Feb 21, 2023

evol262 commented Feb 22, 2023

medyagh commented Feb 22, 2023

evol262 commented Feb 22, 2023

medyagh commented Feb 22, 2023

prezha commented Mar 23, 2023 • edited Loading

evol262 commented Mar 23, 2023

prezha commented Mar 23, 2023 • edited Loading

prezha commented Mar 23, 2023 • edited Loading

prezha commented Mar 24, 2023

evol262 commented Mar 25, 2023

prezha commented Mar 25, 2023

medyagh commented Feb 21, 2023 •

edited

Loading

prezha commented Mar 23, 2023 •

edited

Loading

prezha commented Mar 23, 2023 •

edited

Loading

prezha commented Mar 23, 2023 •

edited

Loading