Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Weave Net Daemonset fails to restart pod due to existing dummy interface #3414

Open
0x2c7 opened this issue Sep 26, 2018 · 16 comments · May be fixed by #3459
Open

Weave Net Daemonset fails to restart pod due to existing dummy interface #3414

0x2c7 opened this issue Sep 26, 2018 · 16 comments · May be fixed by #3459

Comments

@0x2c7
Copy link

0x2c7 commented Sep 26, 2018

What you expected to happen?

The Weave Net Daemonset, who controls the weave-net pod in each node, should be able to restart a pod in a node when it fails / stopped for any reason.

What happened?

The weave-net pod gets Error and CrashLoopBack status and unable to function again, until I terminate that node.

How to reproduce it?

SSH into a node, use docker command to kill the weave-net container. Of course, this is just for re-produce. In fact, when on our production cluster, we sometimes meet the situation when weave-net crashes on a node and don't know why.

The logs point out that weave-net fails to create dummy interface:

FATA: 2018/09/26 05:08:04.369497 creating dummy interface: file exists

I have a small investigation, and it looks like the bug comes from net/bridge.go, in function initPrep, at those lines:

	dummy := &netlink.Dummy{LinkAttrs: netlink.NewLinkAttrs()}
	dummy.LinkAttrs.Name = "vethwedu"
	if err = netlink.LinkAdd(dummy); err != nil {
		return errors.Wrap(err, "creating dummy interface")
	}

Before the weave-net starts, it creates a dummy interface object, and when my pod starts, the interface already exists, checked with ip link | grep vethwedu command:

96782: vethwedu: <BROADCAST,NOARP> mtu 1376 qdisc noop state DOWN mode DEFAULT group default qlen 1000

It looks like in the previous session of weave-net, it fails to delete this dummy interface, or it is killed before deleting it. When I delete the dummy manually with ip link delete vethwedu, the pod runs smoothly and back to normal.

Adding a small check and delete if the dummy exists before creating a new one would solve this problem. Is it a good solution for this? If that's okay, I'll open an PR.

	if existingDummy, err = netlink.LinkByName("vethwedu"); err == nil {
            if err := netlink.LinkDel(existingDummy); err != nil {
              return errors.Wrap(err, "deleting existing dummy interface")
	    }
	}
        //...

Anything else we need to know?

I run our Kubernetes cluster on AWS, using KOPS.

Versions:

$ weave version: 2.4.1

$ docker version
Client:
 Version:      17.03.2-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   f5ec1e2
 Built:        Tue Jun 27 02:09:56 2017
 OS/Arch:      linux/amd64

$ uname -a
Linux ip-172-50-52-229 4.4.121-k8s #1 SMP Sun Mar 11 19:39:47 UTC 2018 x86_64 GNU/Linux

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.8", GitCommit:"c138b85178156011dc934c2c9f4837476876fb07", GitTreeState:"clean", BuildDate:"2018-05-21T18:53:18Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}

Logs:

$ kubectl logs -n kube-system <weave-net-pod> weave

DEBU: 2018/09/26 05:08:04.266053 [kube-peers] Checking peer "7a:1f:1c:b2:b7:7e" against list &{[{ea:43:25:46:a3:95 ip-172-50-61-167.ap-southeast-2.compute.internal} {16:ac:8c:71:55:20 ip-172-50-80-215.ap-southeast-2.compute.internal} {3e:af:c2:26:08:00 ip-172-50-110-136.ap-southeast-2.compute.internal} {2e:04:f7:c2:42:71 ip-172-50-32-17.ap-southeast-2.compute.internal} {32:4d:7c:65:31:8d ip-172-50-50-186.ap-southeast-2.compute.internal} {ae:92:06:fd:b3:e7 ip-172-50-93-2.ap-southeast-2.compute.internal} {7e:69:df:8f:8f:17 ip-172-50-41-197.ap-southeast-2.compute.internal} {62:aa:0a:e8:65:96 ip-172-50-109-73.ap-southeast-2.compute.internal} {f6:30:3e:1d:4b:8b ip-172-50-113-192.ap-southeast-2.compute.internal} {7a:1f:1c:b2:b7:7e ip-172-50-52-229.ap-southeast-2.compute.internal} {12:b7:c5:3d:f4:82 ip-172-50-67-61.ap-southeast-2.compute.internal} {aa:0d:6b:9c:56:9b ip-172-50-44-14.ap-southeast-2.compute.internal} {82:e9:f1:ce:c5:29 ip-172-50-58-155.ap-southeast-2.compute.internal} {26:a8:11:0d:76:e2 ip-172-50-33-242.ap-southeast-2.compute.internal}]}
INFO: 2018/09/26 05:08:04.297726 Command line options: map[ipalloc-range:100.96.0.0/11 port:6783 docker-api: expect-npc:true host-root:/host nickname:ip-172-50-52-229.ap-southeast-2.compute.internal conn-limit:100 db-prefix:/weavedb/weave-net http-addr:127.0.0.1:6784 metrics-addr:0.0.0.0:6782 name:7a:1f:1c:b2:b7:7e datapath:datapath ipalloc-init:consensus=14 no-dns:true]
INFO: 2018/09/26 05:08:04.297772 weave  2.3.0
FATA: 2018/09/26 05:08:04.369497 creating dummy interface: file exists
@murali-reddy
Copy link
Contributor

thanks @nguyenquangminh0711 for reporting this issue.

Its strange your cluster ended up in this situation. As you may have seen in the code, dummy interface is created and immediately deleted. Do you happen to see any reason in the weave pod logs why dummy interface failed to get deleted?

Its corner case, but its good to handle gracefully. Please submit the PR. May be you can simplify

if err = netlink.LinkAdd(dummy); err != syscall.EEXIST {
	return errors.Wrap(err, "creating dummy interface")
}

@0x2c7
Copy link
Author

0x2c7 commented Sep 26, 2018

Hi @murali-reddy, thanks for your quick reply, I have just capture the logs of the previous session:

FATA: 2018/09/26 06:41:38.227346 setting dummy interface master: exchange full

It looks like it crashes at these lines, right before the dummy interface is deleted. That's why it is left to the next restart.

	if err := netlink.LinkSetMasterByIndex(dummy, b.bridge.Attrs().Index); err != nil {
		return errors.Wrap(err, "setting dummy interface master")
	}

@rade
Copy link
Member

rade commented Sep 26, 2018

The LinkDel in that code really should be done in a defer block right after creation. Though we still need to gracefully handle the presence of the interface in any case, since a tiny window will remain where if the process is killed the interface would survive.

@rade
Copy link
Member

rade commented Sep 26, 2018

May be you can simplify ...

Just call LinkAddIfNotExist instead.

@brb
Copy link
Contributor

brb commented Sep 26, 2018

FATA: 2018/09/26 06:41:38.227346 setting dummy interface master: exchange full

This indicates that there were 1024 connected veth pairs to the weave bridge. @nguyenquangminh0711 How many pods did you run on the failing node?

@0x2c7
Copy link
Author

0x2c7 commented Sep 26, 2018

@brb Not much, just 41 pods on that nodes. And most of the pods are in ContainerCreating because they can not schedule the pods.

@brb
Copy link
Contributor

brb commented Sep 26, 2018

@nguyenquangminh0711 Can you run ip -o link from the node?

@0x2c7
Copy link
Author

0x2c7 commented Sep 26, 2018

@brb When running your command, there are 1023 items. It looks like it nearly exceed the total number of 1024 in a new restart. The details of the ip links:
https://gist.github.com/nguyenquangminh0711/4dcb3926b6b8e00255e34cf9c5c7bfa6

Btw, I'm spotting that other pods in other nodes start to restart randomly, although not collapse completely like the node I'm investigating.

screen shot 2018-09-26 at 22 07 57

@brb
Copy link
Contributor

brb commented Sep 27, 2018

@nguyenquangminh0711 Thanks. Can you paste kubelet logs?

@0x2c7
Copy link
Author

0x2c7 commented Sep 28, 2018

@brb I fetched the logs from the nodes, including kube-controller-manager, kube-proxy and kube-scheduler of the failing nodes. That's all what I can find in /var/log. Not sure that's what you are asking for? https://gist.github.com/nguyenquangminh0711/39835eedd752986a46dc8554832d2ffc

I observe that there are a lot of failures when the node tries to read from etcd. I don't whether that relates to the issue?

@brb
Copy link
Contributor

brb commented Oct 1, 2018

@nguyenquangminh0711 Do you use systemd? If yes, can you run journalctl -r kubelet from the worker node with the 1023 interfaces. Otherwise, the logs should be located in /var/log/kubelet.log.

I'm interested to see whether there were any errors reported by kubelet when removing the interfaces via CNI.

@0x2c7
Copy link
Author

0x2c7 commented Oct 4, 2018

Hi @brb, sorry for my late response. Our logs are centrialized and not indexed properly, so I have to take some actions to extract the logs. Here is the logs of the failed node from start to killed:

https://gist.githubusercontent.com/nguyenquangminh0711/39dceff835e0d39e365c29dc1d958118/raw/fee8fc835cfc71e9fb9e32936b7d0abe8afaf275/Kubelet%2520logs

It looks like before it goes to the situation of CrashLoop, there are tons of logs like this:

2018-09-25 16:49:43 +0700 kernel: [99825.596071] weave: port 379(vethwepla47d508) entered forwarding state
2018-09-25 16:49:43 +0700 kernel: [99826.012090] weave: port 380(vethwepl7c826e6) entered forwarding state
2018-09-25 16:49:46 +0700 kernel: [99827.706206] weave: port 391(vethwepl1646259) entered forwarding state
2018-09-25 16:49:46 +0700 kernel: [99827.713079] weave: port 391(vethwepl1646259) entered forwarding state
2018-09-25 16:49:46 +0700 kernel: [99827.706206] weave: port 391(vethwepl1646259) entered forwarding state
2018-09-25 16:49:46 +0700 kernel: [99827.713079] weave: port 391(vethwepl1646259) entered forwarding state
2018-09-25 16:49:46 +0700 kernel: [99827.706206] weave: port 391(vethwepl1646259) entered forwarding state
2018-09-25 16:49:46 +0700 kernel: [99827.713079] weave: port 391(vethwepl1646259) entered forwarding state
2018-09-25 16:49:48 +0700 kernel: [99829.180077] weave: port 381(vethweplcdc27c4) entered forwarding state
2018-09-25 16:49:48 +0700 kernel: [99829.180077] weave: port 381(vethweplcdc27c4) entered forwarding state
2018-09-25 16:49:48 +0700 kernel: [99829.180077] weave: port 381(vethweplcdc27c4) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99830.044074] weave: port 382(vethwepl2ca4be2) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99830.684073] weave: port 383(vethweple32abc0) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99831.260076] weave: port 384(vethwepl9b74577) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99831.900134] weave: port 385(vethwepl4101de7) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99830.044074] weave: port 382(vethwepl2ca4be2) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99830.684073] weave: port 383(vethweple32abc0) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99831.260076] weave: port 384(vethwepl9b74577) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99831.900134] weave: port 385(vethwepl4101de7) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99830.044074] weave: port 382(vethwepl2ca4be2) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99830.684073] weave: port 383(vethweple32abc0) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99831.260076] weave: port 384(vethwepl9b74577) entered forwarding state
2018-09-25 16:49:49 +0700 kernel: [99831.900134] weave: port 385(vethwepl4101de7) entered forwarding state
2018-09-25 16:49:50 +0700 kernel: [99832.096068] weave: port 386(vethwepl8e2a2b2) entered forwarding state
2018-09-25 16:49:50 +0700 kernel: [99832.284074] weave: port 387(vethwepl2e24511) entered forwarding state
2018-09-25 16:49:50 +0700 kernel: [99832.096068] weave: port 386(vethwepl8e2a2b2) entered forwarding state
2018-09-25 16:49:50 +0700 kernel: [99832.284074] weave: port 387(vethwepl2e24511) entered forwarding state
2018-09-25 16:49:50 +0700 kernel: [99832.096068] weave: port 386(vethwepl8e2a2b2) entered forwarding state
2018-09-25 16:49:50 +0700 kernel: [99832.284074] weave: port 387(vethwepl2e24511) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99832.540077] weave: port 388(vethwepl20f2d0e) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99832.860073] weave: port 389(vethwepl9d3a07a) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99833.020105] weave: port 390(vethweplb3ecd60) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99832.540077] weave: port 388(vethwepl20f2d0e) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99832.860073] weave: port 389(vethwepl9d3a07a) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99833.020105] weave: port 390(vethweplb3ecd60) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99832.540077] weave: port 388(vethwepl20f2d0e) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99832.860073] weave: port 389(vethwepl9d3a07a) entered forwarding state
2018-09-25 16:49:51 +0700 kernel: [99833.020105] weave: port 390(vethweplb3ecd60) entered forwarding state
2018-09-25 16:49:54 +0700 kernel: [99834.764564] weave: port 392(vethwepl194c895) entered forwarding state
2018-09-25 16:49:54 +0700 kernel: [99834.771644] weave: port 392(vethwepl194c895) entered forwarding state```

@brb
Copy link
Contributor

brb commented Oct 8, 2018

@nguyenquangminh0711 Thanks.

I was hoping to see more weave-cni del errors which would explain why there were so many veths attached to the bridge.

Anyway, the entered forwarding state log entries at the bottom indicate that there were new containers starting. Each veth name consists of vethwepl + a container short id. Can you find those containers in your system?

@iAnomaly
Copy link

iAnomaly commented Nov 21, 2018

@brb We're experiencing this on 2.4.1 although I think we saw it on 2.4.0 as well.

From one of the affected worker nodes:

$ ip -o link|wc -l
1032

I have attached output for both ip -o link and last 15 minutes of sudo journalctl -ru kubelet from the affected node.

ip -o link.txt
kubelet.tar.gz

@murali-reddy
Copy link
Contributor

@iAnomaly Looking at the output you shared for the ip -o link I do see dummy interface which should have been deleted.

222911: vethwedu: <BROADCAST,NOARP> mtu 8912 qdisc noop state DOWN mode DEFAULT group default qlen 1000\    link/ether 02:9a:c1:2b:07:4a brd ff:ff:ff:ff:ff:ff

Kubelet logs indicate weave is not running. I suspect weave pod went into CrashLoopBack on the node where you saw above behaviour?

Nov 21 16:28:04 ip-172-20-60-206 kubelet[10866]: E1121 16:28:04.542329   10866 cni.go:259] Error adding network: unable to allocate IP address: Post http://127.0.0.1:6784/ip/2c535c623770ff2b389523c3f5aed82d3d76b05e30b88f1bbf1e3284a124221b: dial tcp 127.0.0.1:6784: connect:         connection refused

This is only side-effect. Root cause is dummy interface which should have been deleted after it was created by Weave. By any chance you have old Weave pod logs which can indicate why vethwedu inteface is not deleted?

@murali-reddy
Copy link
Contributor

murali-reddy commented Nov 22, 2018

As pointed out dummy interface deletion should have been in the defer block in below code path.

https://github.com/weaveworks/weave/blob/v2.5.0/net/bridge.go#L332-L343

Let me know if you are still interested @nguyenquangminh0711 to raise a PR. Else I will raise a PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants