Skip to content
This repository has been archived by the owner on Dec 7, 2023. It is now read-only.

Multi host ignite VMs networking based on WeaveNet not working as expected (reopening issue #628) #642

Closed
mdundek opened this issue Jul 13, 2020 · 9 comments · Fixed by #645
Labels
area/networking Issues related to networking kind/support Categorizes the issue as related to support questions.

Comments

@mdundek
Copy link

mdundek commented Jul 13, 2020

Hello WeaveWorks team,

I am reopening issue #628, the applied fix did make things slightly better but it did not fix the underlying issue.
Two Hosts, on both I run the WeaveWorks CNI docker image as described in issue #628.
I then installed ignite on both hosts and started a Ignite VM on each host with the flag --network-plugin cni respectively. On each host, the output of ifconfig is (I filtered out the other network interfaces that are not relevant to this issue):

ignite0: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 1500
        inet 10.61.0.1  netmask 255.255.0.0  broadcast 10.61.255.255
        inet6 fe80::dca5:f0ff:fed9:7481  prefixlen 64  scopeid 0x20<link>
        ether de:a5:f0:d9:74:81  txqueuelen 1000  (Ethernet)
        RX packets 189  bytes 17839 (17.8 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 201  bytes 20068 (20.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethd0e92add: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::14e9:28ff:fed4:5cda  prefixlen 64  scopeid 0x20<link>
        ether 16:e9:28:d4:5c:da  txqueuelen 0  (Ethernet)
        RX packets 189  bytes 20485 (20.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 230  bytes 23476 (23.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethwe-bridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1376
        inet6 fe80::ac38:4ff:feac:c4f3  prefixlen 64  scopeid 0x20<link>
        ether ae:38:04:ac:c4:f3  txqueuelen 0  (Ethernet)
        RX packets 196  bytes 22470 (22.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 92  bytes 10557 (10.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethwe-datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1376
        inet6 fe80::f8ae:8eff:feef:6077  prefixlen 64  scopeid 0x20<link>
        ether fa:ae:8e:ef:60:77  txqueuelen 0  (Ethernet)
        RX packets 92  bytes 10557 (10.5 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 196  bytes 22470 (22.4 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vxlan-6784: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65535
        inet6 fe80::f067:50ff:fe4f:45f8  prefixlen 64  scopeid 0x20<link>
        ether f2:67:50:4f:45:f8  txqueuelen 1000  (Ethernet)
        RX packets 240  bytes 163152 (163.1 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 174  bytes 155592 (155.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

weave: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1376
        inet 10.32.0.1  netmask 255.240.0.0  broadcast 10.47.255.255
        inet6 fe80::7cb0:f6ff:fe9e:1b0e  prefixlen 64  scopeid 0x20<link>
        ether 7e:b0:f6:9e:1b:0e  txqueuelen 1000  (Ethernet)
        RX packets 195  bytes 19650 (19.6 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 59  bytes 6808 (6.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

and

ignite0: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 1500
        inet 10.61.0.1  netmask 255.255.0.0  broadcast 10.61.255.255
        inet6 fe80::d8aa:29ff:fe1c:2e35  prefixlen 64  scopeid 0x20<link>
        ether da:aa:29:1c:2e:35  txqueuelen 1000  (Ethernet)
        RX packets 294  bytes 28890 (28.8 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 300  bytes 33022 (33.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 552  bytes 48676 (48.6 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 552  bytes 48676 (48.6 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth5cf312db: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::39:f3ff:fe5e:2075  prefixlen 64  scopeid 0x20<link>
        ether 02:39:f3:5e:20:75  txqueuelen 0  (Ethernet)
        RX packets 294  bytes 33006 (33.0 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 330  bytes 36565 (36.5 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethwe-bridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1376
        inet6 fe80::d47a:82ff:fe1e:5807  prefixlen 64  scopeid 0x20<link>
        ether d6:7a:82:1e:58:07  txqueuelen 0  (Ethernet)
        RX packets 149  bytes 15746 (15.7 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 111  bytes 12216 (12.2 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethwe-datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1376
        inet6 fe80::88f:cbff:fe49:a42b  prefixlen 64  scopeid 0x20<link>
        ether 0a:8f:cb:49:a4:2b  txqueuelen 0  (Ethernet)
        RX packets 111  bytes 12216 (12.2 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 149  bytes 15746 (15.7 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vxlan-6784: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 65535
        inet6 fe80::a41b:37ff:fe63:e69e  prefixlen 64  scopeid 0x20<link>
        ether a6:1b:37:63:e6:9e  txqueuelen 1000  (Ethernet)
        RX packets 304  bytes 317620 (317.6 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 371  bytes 325090 (325.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

weave: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1376
        inet 10.40.0.0  netmask 255.240.0.0  broadcast 10.47.255.255
        inet6 fe80::d0e2:beff:fe0c:6c35  prefixlen 64  scopeid 0x20<link>
        ether d2:e2:be:0c:6c:35  txqueuelen 1000  (Ethernet)
        RX packets 148  bytes 13584 (13.5 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 77  bytes 8348 (8.3 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

I can ping 10.40.0.0 from 10.32.0.1 and vise versa, this tells me that WeaveWorks CNI is working as expected.
But when I SSH into each VM, they both have the same output for ifconfig:

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 10.61.0.2  netmask 255.255.0.0  broadcast 10.61.255.255
        inet6 fe80::3804:88ff:fec8:a6ef  prefixlen 64  scopeid 0x20<link>
        ether 3a:04:88:c8:a6:ef  txqueuelen 1000  (Ethernet)
        RX packets 303  bytes 34991 (34.9 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 215  bytes 25732 (25.7 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

As you can see, they both have the same IP on the WeaveWorks CNI network (obviously not what was expected), rather than each having a dedicated IP. There is no more hanging or delay, so that issue is gone now, but I still cant do multi host Ignite VM networking using WeaveNet CNI.
I also removed the file /etc/cni/net.d/10-ignite.conflist before running the ignite run command. The content of the newly created /etc/cni/net.d/10-ignite.conflist file is:

{
	"cniVersion": "0.4.0",
	"name": "ignite-cni-bridge",
	"plugins": [
		{
			"type": "bridge",
			"bridge": "ignite0",
			"isGateway": true,
			"isDefaultGateway": true,
			"promiscMode": true,
			"ipMasq": true,
			"ipam": {
				"type": "host-local",
				"subnet": "10.61.0.0/16"
			}
		},
		{
			"type": "portmap",
			"capabilities": {
				"portMappings": true
			}
		},
		{
			"type": "firewall"
		}
	]
}

Or the documentation is lacking some instructions here, or there is a bug somewhere. Could someone please have a look and help me solve this issue please?

Thanks

@twelho
Copy link
Contributor

twelho commented Jul 14, 2020

Hi @mdundek 👋

Your Weave Net setup looks correct, but it seems like the Ignite VMs are not attaching to it at all. The 10.61.0.0/16 address space is the default bridge of Ignite, if the VMs have these addresses they each are using the host-local CNI bridge, not Weave Net.

Have a look at what you have in /etc/cni/net.d/, there should be a file named 10-weave.conflist (created by Weave Net) on both hosts and nothing else. Ignite only creates its "fallback" default bridge config (the 10-ignite.conflist) if there's no other file in that directory, so only if no external CNI config is defined.

For reference, this is what 10-weave.conflist should contain:

$ cat /etc/cni/net.d/10-weave.conflist 
{
    "cniVersion": "0.3.0",
    "name": "weave",
    "plugins": [
        {
            "name": "weave",
            "type": "weave-net",
            "hairpinMode": true
        },
        {
            "type": "portmap",
            "capabilities": {"portMappings": true},
            "snat": true
        }
    ]
}

@twelho twelho added area/networking Issues related to networking kind/support Categorizes the issue as related to support questions. labels Jul 14, 2020
@mdundek
Copy link
Author

mdundek commented Jul 14, 2020

Hello twelho,

thanks for your quick response, I started from 2 fresh VMs to test this again, but this time I started the WeaveNet CNI first before I installed Ignite and the Ignite default CNI plugin. This time I do have the file /etc/cni/net.d/10-weave.conflist instead of the 10-ignite.conflist one, so that is great news. Strange, because before it did recreate the 10-ignite.conflist every time I started an Ignite VM, even with the WeaveNet CNI docker container running. Maybe there must be a specific order of installing / creating things...
That said, I do have IPs belonging to the WeaveNet CNI now, but I can only ping WeaveNet IPs on the same host, but not the VM on the other host. So now I have the following IPs:

  • Host 1 has the IP 10.32.0.1, VM on host 1 has the IP 10.32.0.2
  • Host 2 has the IP 10.40.0.0, VM on host 2 has the IP 10.40.0.1

I can ping 10.40.0.0 from 10.32.0.1, but I can not ping 10.40.0.1 from 10.32.0.1 (Same the other way around). The ping command does not return an error, it simply gets stuck until I Ctrl-C it. In other words, I can ping between the hosts, but not the VMs nor between the VMs. I have no firewall running on the hosts.
Any idea what could cause this blockage?

Thanks

@twelho
Copy link
Contributor

twelho commented Jul 15, 2020

Could you post the Docker logs for the weave-kube containers on both hosts, so the output of docker logs <container_name>? They both should state that Weave Net has discovered the other peer. I'll try to replicate this scenario on two separate hosts shortly.

@mdundek
Copy link
Author

mdundek commented Jul 15, 2020

IP of host one is: 192.168.68.148, IP of host 2 is 192.168.68.131. Here is the log output from Host 1:

INFO: 2020/07/14 18:02:31.195120 Command line options: map[ipalloc-init:consensus=2 port:6783 datapath:datapath db-prefix:/weavedb/weave-net http-addr:127.0.0.1:6784 ipalloc-range:10.32.0.0/12 metrics-addr:0.0.0.0:6782 nickname:multipaas-master no-dns:true conn-limit:100 host-root:/host name:7e:b0:f6:9e:1b:0e docker-api: expect-npc:true]
INFO: 2020/07/14 18:02:31.195208 weave  2.5.2
INFO: 2020/07/14 18:02:31.374779 Bridge type is bridged_fastdp
INFO: 2020/07/14 18:02:31.374797 Communication between peers is unencrypted.
INFO: 2020/07/14 18:02:31.377200 Our name is 7e:b0:f6:9e:1b:0e(multipaas-master)
INFO: 2020/07/14 18:02:31.377242 Launch detected - using supplied peer list: [192.168.68.148 192.168.68.131]
INFO: 2020/07/14 18:02:31.377257 Checking for pre-existing addresses on weave bridge
INFO: 2020/07/14 18:02:31.388423 [allocator 7e:b0:f6:9e:1b:0e] No valid persisted data
INFO: 2020/07/14 18:02:31.434477 [allocator 7e:b0:f6:9e:1b:0e] Initialising via deferred consensus
INFO: 2020/07/14 18:02:31.434550 Sniffing traffic on datapath (via ODP)
INFO: 2020/07/14 18:02:31.444148 Listening for HTTP control messages on 127.0.0.1:6784
INFO: 2020/07/14 18:02:31.445914 Listening for metrics requests on 0.0.0.0:6782
INFO: 2020/07/14 18:02:31.464666 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/14 18:02:31.464843 ->[192.168.68.148:6783] attempting connection
INFO: 2020/07/14 18:02:31.465849 ->[192.168.68.148:46049] connection accepted
INFO: 2020/07/14 18:02:31.466103 ->[192.168.68.148:46049|7e:b0:f6:9e:1b:0e(multipaas-master)]: connection shutting down due to error: cannot connect to ourself
INFO: 2020/07/14 18:02:31.466192 ->[192.168.68.148:6783|7e:b0:f6:9e:1b:0e(multipaas-master)]: connection shutting down due to error: cannot connect to ourself
INFO: 2020/07/14 18:02:31.467531 ->[192.168.68.131:6783] error during connection attempt: dial tcp4 :0->192.168.68.131:6783: connect: connection refused
INFO: 2020/07/14 18:02:31.986561 Weave version 2.6.2 is available; please update at https://github.com/weaveworks/weave/releases/download/v2.6.2/weave
FATA: 2020/07/14 18:02:32.148930 [kube-peers] Could not get cluster config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
INFO: 2020/07/14 18:02:34.128538 ->[192.168.68.131:51433] connection accepted
INFO: 2020/07/14 18:02:34.140039 ->[192.168.68.131:51433|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection ready; using protocol version 2
INFO: 2020/07/14 18:02:34.140115 overlay_switch ->[d2:e2:be:0c:6c:35(multipaas-worker1)] using fastdp
INFO: 2020/07/14 18:02:34.140138 ->[192.168.68.131:51433|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection added (new peer)
INFO: 2020/07/14 18:02:34.141702 ->[192.168.68.131:51433|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection fully established
10.32.0.1
FATA: 2020/07/14 18:02:34.188881 [kube-peers] Could not get cluster config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
INFO: 2020/07/14 18:02:34.216691 Discovered remote MAC d2:e2:be:0c:6c:35 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:02:34.252999 Discovered remote MAC 86:49:ef:9c:9f:6d at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:02:34.489779 Discovered remote MAC ee:7c:ea:6e:71:79 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:02:34.644346 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2020/07/14 18:02:34.646953 sleeve ->[192.168.68.131:6783|d2:e2:be:0c:6c:35(multipaas-worker1)]: Effective MTU verified at 1438
INFO: 2020/07/14 18:09:53.643847 Discovered remote MAC 56:6d:74:bc:e7:97 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:09:54.548616 Discovered remote MAC 52:b9:61:e0:6a:e3 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:09:55.889006 Discovered remote MAC 0e:76:6f:03:9e:85 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:34:43.981410 Discovered remote MAC d2:e2:be:0c:6c:35 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:36:22.296077 Discovered remote MAC ee:7c:ea:6e:71:79 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:36:42.311492 Discovered remote MAC 86:49:ef:9c:9f:6d at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:37:15.106163 Discovered remote MAC 82:b9:cc:4e:da:07 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:37:16.107863 Discovered remote MAC 1e:33:ee:14:f0:c9 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:37:17.312412 Discovered remote MAC 3e:47:ab:fd:ed:06 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:58:45.775030 Discovered remote MAC 82:b9:cc:4e:da:07 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:58:53.333475 Discovered remote MAC d2:e2:be:0c:6c:35 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:58:53.334382 Discovered remote MAC 86:49:ef:9c:9f:6d at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/14 18:58:53.334889 Discovered remote MAC ee:7c:ea:6e:71:79 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/15 05:36:20.624290 Discovered remote MAC 1e:33:ee:14:f0:c9 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/15 07:09:30.391728 Discovered remote MAC 82:b9:cc:4e:da:07 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/15 07:34:02.070236 overlay_switch ->[d2:e2:be:0c:6c:35(multipaas-worker1)] sleeve write ip4 192.168.68.148->192.168.68.131: write: network is unreachable
INFO: 2020/07/15 07:34:26.009153 ->[192.168.68.131:51433|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection shutting down due to error: read tcp4 192.168.68.148:6783->192.168.68.131:51433: i/o timeout
INFO: 2020/07/15 07:34:26.009300 ->[192.168.68.131:51433|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection deleted
INFO: 2020/07/15 07:34:26.009320 Removed unreachable peer d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/15 07:34:26.009757 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/15 07:34:26.009842 ->[192.168.68.131:6783] error during connection attempt: dial tcp4 :0->192.168.68.131:6783: connect: network is unreachable
INFO: 2020/07/15 07:34:30.475182 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/15 07:34:30.475289 ->[192.168.68.131:6783] error during connection attempt: dial tcp4 :0->192.168.68.131:6783: connect: network is unreachable
INFO: 2020/07/15 07:34:35.575761 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/15 07:34:35.575928 ->[192.168.68.131:6783] error during connection attempt: dial tcp4 :0->192.168.68.131:6783: connect: network is unreachable
INFO: 2020/07/15 07:34:45.322800 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/15 07:34:45.322979 ->[192.168.68.131:6783] error during connection attempt: dial tcp4 :0->192.168.68.131:6783: connect: network is unreachable
INFO: 2020/07/15 07:34:50.516495 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/15 07:34:50.516662 ->[192.168.68.131:6783] error during connection attempt: dial tcp4 :0->192.168.68.131:6783: connect: network is unreachable
INFO: 2020/07/15 07:35:11.585585 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/15 07:35:11.585728 ->[192.168.68.131:6783] error during connection attempt: dial tcp4 :0->192.168.68.131:6783: connect: network is unreachable
INFO: 2020/07/15 07:35:38.722405 ->[192.168.68.131:6783] attempting connection
INFO: 2020/07/15 07:36:42.406156 ->[192.168.68.131:6783|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection ready; using protocol version 2
INFO: 2020/07/15 07:36:42.406947 overlay_switch ->[d2:e2:be:0c:6c:35(multipaas-worker1)] using fastdp
INFO: 2020/07/15 07:36:42.406999 ->[192.168.68.131:6783|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection added (new peer)
INFO: 2020/07/15 07:36:42.409091 ->[192.168.68.131:6783|d2:e2:be:0c:6c:35(multipaas-worker1)]: connection fully established
INFO: 2020/07/15 07:36:42.412607 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
INFO: 2020/07/15 07:36:42.413634 sleeve ->[192.168.68.131:6783|d2:e2:be:0c:6c:35(multipaas-worker1)]: Effective MTU verified at 1438
INFO: 2020/07/15 07:36:42.600665 Discovered remote MAC 3e:47:ab:fd:ed:06 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/15 07:44:19.368873 Discovered remote MAC 1e:33:ee:14:f0:c9 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/15 08:00:41.184428 Discovered remote MAC d2:e2:be:0c:6c:35 at d2:e2:be:0c:6c:35(multipaas-worker1)
INFO: 2020/07/15 08:00:41.184667 Discovered remote MAC 86:49:ef:9c:9f:6d at d2:e2:be:0c:6c:35(multipaas-worker1)

Host 1 & host 2 can talk to each other, host 1 can talk to VM 1 (running on host 1), but host 1 can not talk to VM 2 (running on host 2).

Thank you twelho for helping me out on this, much appreciated !

@twelho
Copy link
Contributor

twelho commented Jul 15, 2020

Yes, from your log output it seems like the two Weave Net instances are able to connect to each other. I just replicated the same setup with two physical hosts on the same LAN, just to end up at the same conclusion – the Weave Net instances pair just fine and the VMs connect to their respective instances, but no traffic flows in the CNI network. The VMs reach their own CNI provider but cannot connnect to the other host nor the internet.

This is kind of a dead end now with Weave Net not being intended to be run "externally" like this (would need more work to investigate why it doesn't work), however I'm assembling some scripts to set up a Flannel-based CNI overlay with the aim to get multi-node networking actually working now. Should have the scripts ready for testing tomorrow.

@mdundek
Copy link
Author

mdundek commented Jul 15, 2020

Thanks twelho, appreciate your efforts in helping me out on this. I am very much looking forward to your Flannel based solution, I believe it makes a lot of sense to get multi-host networking work with Ignite, especially when using Ignite with Kubernetes where each node should be running on it's own host when not using it purely for testing purposes.
While I have your attention, could you tell me if a VMs IP will persist over reboots in this scenario, since it does not seem possible to assign static IPs? I intend on using Ignite for Kubernetes mainly, and nodes can not change IP once they joined a cluster.

Looking forward to your scripts, thanks again!

@twelho
Copy link
Contributor

twelho commented Jul 16, 2020

The scripts (and docs) are ready in #645, go ahead and give them a go, it would be good to know if they work on your end as well. Thanks for bringing up the static IPs, I've added a chapter to the docs in that same PR, the tl;dr of which being that it's possible but you need to configure it for your CNI provider. For Flannel it might be doable using leases and reservations, but I don't have the resources to implement full-blown CLI utilities to do that right now. The current tools are quite simple since they mostly serve as examples in their current state, but contributions are definitely welcome if there is a need for more advanced configuration.

@mdundek
Copy link
Author

mdundek commented Jul 17, 2020

Helo twelho,

thanks, I tested your scrips, it works great! I do have a couple of comments for you:

  • The script tries to create the file /etc/cni/net.d/10-flannel.conflist, on a fresh system, the directory /etc/cni/net.d/ does not exist and this command fails. You should add the command sudo mkdir -p /etc/cni/net.d on line 159 & 184.
  • In my tests, I managed to ping all IPs, including from host 1 to VM 2 running on host 2. Your documentation seems to indicate that this particular scenario is not possible, that only VM to VM communication or host to VM on same host is possible. Did I miss understood that part?
  • About the static IP, the flannel Leases and Reservations doc only talks about reusing the subnet (10.50.X.0) on reboots, when in reservation mode the TTL of this reservation for the defined subnet is forever, but this does not guarantee that the actual VM final IP on that subnet will be static. I will do some testing here, reboot the machines a couple of times, wait a couple of days with the VMs being off, and see what happens. If the IPs stick, then I guess that flannel has some mechanism to assign IPs to the MAC address of the VM for persistance of that IP, therefore making them permanent. Setting IPs manually is a nice to have but not mandatory, on the other hand, once assigned, those VMs can not change IP randomely on reboot, making the use of a Kubernetes cluster impossible outside of testing scenarios.

Thats it. Great work, and thanks again for the support! Very much appreciated. When will you merge your changes to the master branch?
I will now try to implement Ignite on my project. If you are curious, check it out here. I am using VirtualBox CLI at the moment, but Firecracker and Ignite is a much better fit for this.

BR

@twelho
Copy link
Contributor

twelho commented Jul 17, 2020

Hi,

  • Good catch on the /etc/cni/net.d, thanks! I've updated the PR.
  • Apparently I had my routes misconfigured by all the testing, I tried it now with a clean environment on both hosts and it does indeed seem to be possible to cross-ping VMs from different hosts. Docs updated, thanks!
  • The static IPs would definitely need more investigation/work put in. Currently Ignite has no means to enable MAC address persistence, which is a requirement to implement this properly (bind an IP to a specific MAC), and in addition to that restarting stopped VMs is currently not possible (How to re-run stopped VMs? #504), so the MAC will always change. That said, there's now the alternative of specifying the static IP from inside the VM mentioned in the docs.

The merge of #645 will most likely happen early next week. Good luck with your project, just note that we don't have a lot of free resources to put into developing/supporting Ignite currently, so there might be delays in support/bug fixing.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/networking Issues related to networking kind/support Categorizes the issue as related to support questions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants