This repository was archived by the owner on Jun 20, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 679
This repository was archived by the owner on Jun 20, 2024. It is now read-only.
[SCALING] weaver connections to peers results in various errors in at larger cluster sizes #3595
Copy link
Copy link
Closed
Description
What you expected to happen?
On empty cluster with no workload traffic (dataplane traffic) with just Weave-net control plane traffic weave-net pods should scale to 100's and even thousands of nodes
What happened?
At cluster size of 200 nodes with sufficient memory and CPU requested for the weave-net pods (to avoid #3593 ), while weaver process does not crash it fails to connects to peers due to various reasons.
/weave --local status connections
-> 172.20.33.99:6783 pending none 8a:cd:59:87:41:6c(ip-172-20-33-99.us-west-2.compute.internal)
-> 172.20.40.56:6783 pending fastdp e2:c2:65:63:da:3a(ip-172-20-40-56.us-west-2.compute.internal) mtu=8912
-> 172.20.81.32:6783 pending fastdp 56:71:f2:cc:2c:86(ip-172-20-81-32.us-west-2.compute.internal) mtu=8912
-> 172.20.50.215:6783 pending fastdp fa:7d:34:69:28:a7(ip-172-20-50-215.us-west-2.compute.internal) mtu=8912
<- 172.20.40.229:33689 pending fastdp 5e:37:6f:54:38:49(ip-172-20-40-229.us-west-2.compute.internal) mtu=8912
-> 172.20.65.70:6783 pending fastdp d6:a4:f5:d1:30:c9(ip-172-20-65-70.us-west-2.compute.internal) mtu=8912
-> 172.20.75.16:6783 pending none aa:8d:ef:6d:29:92(ip-172-20-75-16.us-west-2.compute.internal)
<- 172.20.69.126:36809 pending fastdp 7e:3b:25:46:24:04(ip-172-20-69-126.us-west-2.compute.internal) mtu=8912
-> 172.20.59.101:6783 pending none d6:7d:7e:c7:44:74(ip-172-20-59-101.us-west-2.compute.internal)
-> 172.20.93.123:6783 pending none 8a:0c:c3:37:d0:6f(ip-172-20-93-123.us-west-2.compute.internal)
-> 172.20.74.170:6783 pending fastdp 4a:94:72:d7:30:58(ip-172-20-74-170.us-west-2.compute.internal) mtu=8912
<- 172.20.45.166:43416 pending none 1a:dc:ba:6a:42:26(ip-172-20-45-166.us-west-2.compute.internal)
-> 172.20.69.190:6783 pending fastdp 76:e0:b4:17:d5:22(ip-172-20-69-190.us-west-2.compute.internal) mtu=8912
-> 172.20.88.20:6783 pending fastdp 12:9b:1a:7e:6a:71(ip-172-20-88-20.us-west-2.compute.internal) mtu=8912
-> 172.20.45.208:6783 pending none 4a:75:9f:b7:84:7e(ip-172-20-45-208.us-west-2.compute.internal)
-> 172.20.72.144:6783 pending fastdp 56:6d:a8:7a:af:03(ip-172-20-72-144.us-west-2.compute.internal) mtu=8912
-> 172.20.36.159:6783 pending none c2:d3:01:28:80:72(ip-172-20-36-159.us-west-2.compute.internal)
-> 172.20.64.32:6783 pending fastdp a6:6d:d9:2a:b5:79(ip-172-20-64-32.us-west-2.compute.internal) mtu=8912
-> 172.20.85.79:6783 pending fastdp e6:36:fd:17:aa:b7(ip-172-20-85-79.us-west-2.compute.internal) mtu=8912
-> 172.20.56.230:6783 pending fastdp 56:f8:f4:12:0b:38(ip-172-20-56-230.us-west-2.compute.internal) mtu=8912
-> 172.20.64.121:6783 pending none 22:ef:c3:40:98:58(ip-172-20-64-121.us-west-2.compute.internal)
-> 172.20.83.5:6783 pending fastdp e2:04:e3:3f:13:fc(ip-172-20-83-5.us-west-2.compute.internal) mtu=8912
-> 172.20.43.56:6783 pending fastdp ca:cc:05:e3:aa:87(ip-172-20-43-56.us-west-2.compute.internal) mtu=8912
-> 172.20.88.57:6783 pending none f6:21:b0:40:35:a9(ip-172-20-88-57.us-west-2.compute.internal)
-> 172.20.52.121:6783 pending fastdp 2e:92:d8:f1:fc:5b(ip-172-20-52-121.us-west-2.compute.internal) mtu=8912
-> 172.20.58.243:6783 pending fastdp a2:50:a5:8a:03:dc(ip-172-20-58-243.us-west-2.compute.internal) mtu=8912
<- 172.20.48.123:27466 established sleeve 9e:a2:5f:dc:a1:6d(ip-172-20-48-123.us-west-2.compute.internal) mtu=1437
-> 172.20.48.130:6783 pending fastdp de:92:89:48:4d:52(ip-172-20-48-130.us-west-2.compute.internal) mtu=8912
-> 172.20.45.127:6783 pending fastdp 42:05:d1:06:28:1e(ip-172-20-45-127.us-west-2.compute.internal) mtu=8912
-> 172.20.79.29:6783 pending fastdp ca:40:cb:58:5a:f0(ip-172-20-79-29.us-west-2.compute.internal) mtu=8912
-> 172.20.69.171:6783 pending none 2e:40:0b:84:0e:8c(ip-172-20-69-171.us-west-2.compute.internal)
-> 172.20.61.156:6783 pending fastdp 0a:47:05:10:4f:70(ip-172-20-61-156.us-west-2.compute.internal) mtu=8912
-> 172.20.81.74:6783 pending none 5e:23:ab:aa:d5:3c(ip-172-20-81-74.us-west-2.compute.internal)
-> 172.20.46.216:6783 pending none 42:5b:69:b4:24:2a(ip-172-20-46-216.us-west-2.compute.internal)
-> 172.20.75.225:6783 pending fastdp ce:96:a9:f7:15:88(ip-172-20-75-225.us-west-2.compute.internal) mtu=8912
-> 172.20.89.254:6783 pending fastdp 82:33:3d:60:1a:2b(ip-172-20-89-254.us-west-2.compute.internal) mtu=8912
-> 172.20.75.121:6783 pending fastdp 96:84:dc:c3:d1:f0(ip-172-20-75-121.us-west-2.compute.internal) mtu=8912
-> 172.20.75.26:6783 pending none 4e:0b:a9:76:b3:1d(ip-172-20-75-26.us-west-2.compute.internal)
-> 172.20.41.21:6783 pending none 4a:e2:c0:b7:5d:88(ip-172-20-41-21.us-west-2.compute.internal)
-> 172.20.65.249:6783 pending fastdp de:82:f3:f3:ab:f6(ip-172-20-65-249.us-west-2.compute.internal) mtu=8912
-> 172.20.46.204:6783 pending none 8e:86:e0:dd:07:18(ip-172-20-46-204.us-west-2.compute.internal)
-> 172.20.90.248:6783 established sleeve 42:a8:69:50:44:ca(ip-172-20-90-248.us-west-2.compute.internal) mtu=1438
-> 172.20.40.132:6783 pending fastdp d6:f2:39:10:c6:0e(ip-172-20-40-132.us-west-2.compute.internal) mtu=8912
-> 172.20.95.154:6783 pending fastdp 4e:61:6d:18:12:da(ip-172-20-95-154.us-west-2.compute.internal) mtu=8912
-> 172.20.46.243:6783 pending fastdp 86:1a:15:d8:8f:54(ip-172-20-46-243.us-west-2.compute.internal) mtu=8912
<- 172.20.51.54:10407 pending fastdp 9e:bf:f4:6b:0c:0d(ip-172-20-51-54.us-west-2.compute.internal) mtu=8912
-> 172.20.83.88:6783 pending fastdp 56:57:ec:b4:60:12(ip-172-20-83-88.us-west-2.compute.internal) mtu=8912
-> 172.20.63.88:6783 pending none b6:f4:35:d1:b8:19(ip-172-20-63-88.us-west-2.compute.internal)
-> 172.20.68.113:6783 pending fastdp 06:19:80:55:f3:ba(ip-172-20-68-113.us-west-2.compute.internal) mtu=8912
<- 172.20.88.40:30522 pending fastdp 4a:11:bf:1d:a0:a6(ip-172-20-88-40.us-west-2.compute.internal) mtu=8912
-> 172.20.54.141:6783 pending fastdp 32:ba:e2:0e:aa:59(ip-172-20-54-141.us-west-2.compute.internal) mtu=8912
-> 172.20.73.221:6783 established sleeve d6:0d:28:bb:9a:49(ip-172-20-73-221.us-west-2.compute.internal) mtu=1438
-> 172.20.38.39:6783 pending none ee:bf:a1:fd:91:de(ip-172-20-38-39.us-west-2.compute.internal)
<- 172.20.51.128:38633 pending fastdp 62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal) mtu=8912
-> 172.20.84.9:6783 pending none ce:dd:52:c9:83:5d(ip-172-20-84-9.us-west-2.compute.internal)
-> 172.20.70.164:6783 pending fastdp 1e:01:c6:d7:cc:c7(ip-172-20-70-164.us-west-2.compute.internal) mtu=8912
-> 172.20.82.214:6783 pending fastdp 42:53:72:c1:f2:3c(ip-172-20-82-214.us-west-2.compute.internal) mtu=8912
-> 172.20.57.153:6783 pending fastdp 3a:c5:86:76:4d:94(ip-172-20-57-153.us-west-2.compute.internal) mtu=8912
<- 172.20.95.36:37232 pending fastdp 72:3d:d3:67:52:9d(ip-172-20-95-36.us-west-2.compute.internal) mtu=8912
-> 172.20.74.165:6783 pending none 9e:21:58:fb:c8:25(ip-172-20-74-165.us-west-2.compute.internal)
-> 172.20.62.156:6783 pending fastdp 26:bb:0f:03:a7:8e(ip-172-20-62-156.us-west-2.compute.internal) mtu=8912
<- 172.20.39.207:20149 pending fastdp 5e:26:06:43:af:f2(ip-172-20-39-207.us-west-2.compute.internal) mtu=8912
<- 172.20.55.102:28813 established sleeve fa:ab:a8:92:f1:5e(ip-172-20-55-102.us-west-2.compute.internal) mtu=1438
-> 172.20.93.234:6783 pending none 12:72:9e:fc:e2:b9(ip-172-20-93-234.us-west-2.compute.internal)
-> 172.20.65.177:6783 pending none 72:b6:f5:2a:14:46(ip-172-20-65-177.us-west-2.compute.internal)
-> 172.20.46.208:6783 pending none 5e:d6:97:db:64:49(ip-172-20-46-208.us-west-2.compute.internal)
-> 172.20.89.63:6783 pending none 6e:3c:09:6a:b7:86(ip-172-20-89-63.us-west-2.compute.internal)
-> 172.20.73.2:6783 established sleeve f2:51:2a:fc:5d:0c(ip-172-20-73-2.us-west-2.compute.internal) mtu=1438
-> 172.20.70.91:6783 pending fastdp fe:cb:15:e1:f2:e0(ip-172-20-70-91.us-west-2.compute.internal) mtu=8912
-> 172.20.88.40:6783 pending none 4a:11:bf:1d:a0:a6(ip-172-20-88-40.us-west-2.compute.internal)
<- 172.20.33.134:59783 pending none 92:bc:13:9e:7e:ad(ip-172-20-33-134.us-west-2.compute.internal)
-> 172.20.34.244:6783 pending none ce:e9:ac:27:93:3d(ip-172-20-34-244.us-west-2.compute.internal)
<- 172.20.84.9:36887 pending fastdp ce:dd:52:c9:83:5d(ip-172-20-84-9.us-west-2.compute.internal) mtu=8912
-> 172.20.56.9:6783 pending fastdp da:35:de:fc:40:00(ip-172-20-56-9.us-west-2.compute.internal) mtu=8912
<- 172.20.56.9:17141 pending fastdp da:35:de:fc:40:00(ip-172-20-56-9.us-west-2.compute.internal) mtu=8912
-> 172.20.69.231:6783 pending none e6:58:ad:01:f1:12(ip-172-20-69-231.us-west-2.compute.internal)
<- 172.20.43.11:49382 established sleeve 1a:5d:b4:d6:0f:c0(ip-172-20-43-11.us-west-2.compute.internal) mtu=1438
-> 172.20.72.9:6783 pending none 7a:0a:3d:a8:30:6b(ip-172-20-72-9.us-west-2.compute.internal)
-> 172.20.49.14:6783 pending fastdp c6:dd:8f:bb:13:e9(ip-172-20-49-14.us-west-2.compute.internal) mtu=8912
<- 172.20.85.147:27084 pending none da:b6:41:b8:30:a3(ip-172-20-85-147.us-west-2.compute.internal)
<- 172.20.74.165:41335 pending none 9e:21:58:fb:c8:25(ip-172-20-74-165.us-west-2.compute.internal)
<- 172.20.85.126:28186 established sleeve 2a:b6:1d:fc:83:77(ip-172-20-85-126.us-west-2.compute.internal) mtu=1438
-> 172.20.86.109:6783 pending fastdp 0a:f8:96:9e:41:e2(ip-172-20-86-109.us-west-2.compute.internal) mtu=8912
-> 172.20.43.55:6783 pending none 5a:74:54:93:4d:54(ip-172-20-43-55.us-west-2.compute.internal)
<- 172.20.72.14:43666 pending none 2a:13:fc:30:47:70(ip-172-20-72-14.us-west-2.compute.internal)
-> 172.20.67.185:6783 pending none be:a6:c3:02:c6:9a(ip-172-20-67-185.us-west-2.compute.internal)
-> 172.20.58.161:6783 pending none a6:c2:14:b8:5c:16(ip-172-20-58-161.us-west-2.compute.internal)
-> 172.20.38.65:6783 pending fastdp 2a:bf:31:9e:5f:f8(ip-172-20-38-65.us-west-2.compute.internal) mtu=8912
-> 172.20.91.217:6783 pending none d2:94:b3:b8:26:0f(ip-172-20-91-217.us-west-2.compute.internal)
-> 172.20.92.29:6783 pending fastdp d6:cf:a2:05:a9:36(ip-172-20-92-29.us-west-2.compute.internal) mtu=8912
-> 172.20.80.54:6783 pending none ae:35:8b:48:fa:66(ip-172-20-80-54.us-west-2.compute.internal)
<- 172.20.56.168:53665 established sleeve 36:df:1b:90:26:73(ip-172-20-56-168.us-west-2.compute.internal) mtu=772
-> 172.20.34.53:6783 pending none c6:c2:20:d3:ed:31(ip-172-20-34-53.us-west-2.compute.internal)
-> 172.20.66.194:6783 pending fastdp be:4e:67:bc:a4:f7(ip-172-20-66-194.us-west-2.compute.internal) mtu=8912
-> 172.20.51.54:6783 pending fastdp 9e:bf:f4:6b:0c:0d(ip-172-20-51-54.us-west-2.compute.internal) mtu=8912
-> 172.20.69.1:6783 pending fastdp 06:fa:8c:d2:4b:f8(ip-172-20-69-1.us-west-2.compute.internal) mtu=8912
<- 172.20.51.50:64848 established sleeve ce:1d:8b:ed:22:fd(ip-172-20-51-50.us-west-2.compute.internal) mtu=1431
<- 172.20.68.115:14661 pending fastdp 7a:78:54:4d:6f:e4(ip-172-20-68-115.us-west-2.compute.internal) mtu=8912
<- 172.20.33.174:27282 pending fastdp f2:55:2e:ae:16:7c(ip-172-20-33-174.us-west-2.compute.internal) mtu=8912
-> 172.20.40.229:6783 pending fastdp 5e:37:6f:54:38:49(ip-172-20-40-229.us-west-2.compute.internal) mtu=8912
-> 172.20.59.28:6783 pending none 3a:39:14:c8:37:d7(ip-172-20-59-28.us-west-2.compute.internal)
-> 172.20.37.249:6783 pending fastdp 12:5c:30:66:8d:04(ip-172-20-37-249.us-west-2.compute.internal) mtu=8912
-> 172.20.68.115:6783 pending none 7a:78:54:4d:6f:e4(ip-172-20-68-115.us-west-2.compute.internal)
-> 172.20.71.159:6783 pending none b6:af:8e:99:8a:40(ip-172-20-71-159.us-west-2.compute.internal)
<- 172.20.67.198:23705 pending fastdp 0a:49:a7:d4:a8:c0(ip-172-20-67-198.us-west-2.compute.internal) mtu=8912
-> 172.20.93.111:6783 pending none 2a:8f:3f:98:65:0a(ip-172-20-93-111.us-west-2.compute.internal)
-> 172.20.52.91:6783 pending none 82:c7:cd:95:ff:04(ip-172-20-52-91.us-west-2.compute.internal)
-> 172.20.32.244:6783 pending fastdp 4e:b1:b9:09:91:f9(ip-172-20-32-244.us-west-2.compute.internal) mtu=8912
-> 172.20.72.201:6783 pending fastdp 62:ed:5a:87:25:41(ip-172-20-72-201.us-west-2.compute.internal) mtu=8912
-> 172.20.45.20:6783 pending fastdp de:88:b1:a6:64:55(ip-172-20-45-20.us-west-2.compute.internal) mtu=8912
-> 172.20.37.58:6783 pending fastdp fa:80:e3:32:d1:4b(ip-172-20-37-58.us-west-2.compute.internal) mtu=8912
<- 172.20.73.125:33893 established sleeve c6:43:d5:5d:dd:06(ip-172-20-73-125.us-west-2.compute.internal) mtu=1438
-> 172.20.79.83:6783 pending none 96:f7:2d:fb:c7:06(ip-172-20-79-83.us-west-2.compute.internal)
-> 172.20.67.171:6783 pending fastdp 12:ed:9f:34:22:a2(ip-172-20-67-171.us-west-2.compute.internal) mtu=8912
-> 172.20.59.82:6783 pending fastdp fe:0f:f8:5d:57:b8(ip-172-20-59-82.us-west-2.compute.internal) mtu=8912
-> 172.20.65.153:6783 pending fastdp 96:88:8c:a1:9f:e7(ip-172-20-65-153.us-west-2.compute.internal) mtu=8912
-> 172.20.35.115:6783 pending fastdp 16:b8:84:52:3f:bf(ip-172-20-35-115.us-west-2.compute.internal) mtu=8912
-> 172.20.46.102:6783 pending fastdp ae:aa:1d:11:73:fa(ip-172-20-46-102.us-west-2.compute.internal) mtu=8912
-> 172.20.86.131:6783 pending none 32:f6:9f:d2:be:28(ip-172-20-86-131.us-west-2.compute.internal)
-> 172.20.77.20:6783 pending none 22:6f:ff:85:f3:f0(ip-172-20-77-20.us-west-2.compute.internal)
-> 172.20.43.222:6783 pending fastdp 12:02:67:09:34:4b(ip-172-20-43-222.us-west-2.compute.internal) mtu=8912
-> 172.20.39.201:6783 pending fastdp 7e:3a:93:14:5a:53(ip-172-20-39-201.us-west-2.compute.internal) mtu=8912
<- 172.20.37.42:59907 pending none 1a:ab:4d:3f:69:ae(ip-172-20-37-42.us-west-2.compute.internal)
-> 172.20.58.244:6783 pending none fe:09:43:ae:22:7f(ip-172-20-58-244.us-west-2.compute.internal)
<- 172.20.67.191:25665 pending none ea:98:66:1d:ec:4a(ip-172-20-67-191.us-west-2.compute.internal)
-> 172.20.64.9:6783 pending none 8e:ce:b7:0c:d7:20(ip-172-20-64-9.us-west-2.compute.internal)
-> 172.20.38.31:6783 established sleeve ea:02:5d:c9:ab:d6(ip-172-20-38-31.us-west-2.compute.internal) mtu=1438
-> 172.20.84.251:6783 pending none e6:f6:63:ff:68:8e(ip-172-20-84-251.us-west-2.compute.internal)
-> 172.20.36.181:6783 pending fastdp 4e:ed:6a:b6:29:4d(ip-172-20-36-181.us-west-2.compute.internal) mtu=8912
-> 172.20.35.131:6783 established sleeve 4e:4b:69:ef:6a:2c(ip-172-20-35-131.us-west-2.compute.internal) mtu=1438
-> 172.20.40.58:6783 pending fastdp 96:ea:bf:a7:58:1e(ip-172-20-40-58.us-west-2.compute.internal) mtu=8912
-> 172.20.62.224:6783 pending none ba:55:63:2d:fd:9d(ip-172-20-62-224.us-west-2.compute.internal)
<- 172.20.85.253:39844 pending fastdp ee:5a:bf:7b:67:40(ip-172-20-85-253.us-west-2.compute.internal) mtu=8912
-> 172.20.57.191:6783 pending fastdp a2:8a:5d:f0:1f:8b(ip-172-20-57-191.us-west-2.compute.internal) mtu=8912
-> 172.20.47.37:6783 pending fastdp f6:b2:35:a2:ab:82(ip-172-20-47-37.us-west-2.compute.internal) mtu=8912
-> 172.20.66.137:6783 pending fastdp 0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal) mtu=8912
-> 172.20.77.93:6783 pending fastdp 6a:46:98:98:cd:06(ip-172-20-77-93.us-west-2.compute.internal) mtu=8912
<- 172.20.74.85:41471 established sleeve d2:1e:e9:12:2e:32(ip-172-20-74-85.us-west-2.compute.internal) mtu=1438
-> 172.20.92.116:6783 pending fastdp da:ff:b7:d2:c4:b2(ip-172-20-92-116.us-west-2.compute.internal) mtu=8912
-> 172.20.79.12:6783 pending fastdp ba:c1:4f:69:a5:a8(ip-172-20-79-12.us-west-2.compute.internal) mtu=8912
-> 172.20.36.4:6783 pending none 12:68:e0:d3:c3:5a(ip-172-20-36-4.us-west-2.compute.internal)
-> 172.20.78.127:6783 pending fastdp ca:d0:fb:c5:a9:bd(ip-172-20-78-127.us-west-2.compute.internal) mtu=8912
-> 172.20.59.223:6783 pending fastdp da:91:c1:04:0a:a0(ip-172-20-59-223.us-west-2.compute.internal) mtu=8912
<- 172.20.74.20:31849 pending none a6:85:f5:51:66:75(ip-172-20-74-20.us-west-2.compute.internal)
-> 172.20.53.199:6783 pending none c2:63:02:67:cf:8b(ip-172-20-53-199.us-west-2.compute.internal)
-> 172.20.45.113:6783 pending fastdp 22:b6:8e:b1:a0:e7(ip-172-20-45-113.us-west-2.compute.internal) mtu=8912
-> 172.20.32.39:6783 pending fastdp 3e:24:61:11:d3:f8(ip-172-20-32-39.us-west-2.compute.internal) mtu=8912
-> 172.20.52.46:6783 pending none 22:a6:78:7c:84:cc(ip-172-20-52-46.us-west-2.compute.internal)
-> 172.20.92.218:6783 pending fastdp da:c3:5d:e1:cf:9f(ip-172-20-92-218.us-west-2.compute.internal) mtu=8912
-> 172.20.45.164:6783 pending none 1e:e5:2a:25:7e:90(ip-172-20-45-164.us-west-2.compute.internal)
-> 172.20.73.47:6783 pending fastdp 0e:76:c7:72:d6:3d(ip-172-20-73-47.us-west-2.compute.internal) mtu=8912
<- 172.20.46.216:50066 pending fastdp 42:5b:69:b4:24:2a(ip-172-20-46-216.us-west-2.compute.internal) mtu=8912
-> 172.20.32.55:6783 pending none 3a:a5:fe:cc:8d:c2(ip-172-20-32-55.us-west-2.compute.internal)
-> 172.20.66.15:6783 pending none 7a:b6:f3:0a:f5:ba(ip-172-20-66-15.us-west-2.compute.internal)
-> 172.20.86.182:6783 pending fastdp 06:8c:3e:6f:7d:5f(ip-172-20-86-182.us-west-2.compute.internal) mtu=8912
-> 172.20.47.161:6783 pending none aa:4a:49:4f:31:31(ip-172-20-47-161.us-west-2.compute.internal)
-> 172.20.48.250:6783 pending none 9e:f1:26:40:ef:10(ip-172-20-48-250.us-west-2.compute.internal)
-> 172.20.33.174:6783 pending none f2:55:2e:ae:16:7c(ip-172-20-33-174.us-west-2.compute.internal)
-> 172.20.43.192:6783 pending none ca:37:ff:0d:bf:b3(ip-172-20-43-192.us-west-2.compute.internal)
-> 172.20.55.52:6783 pending none 46:a7:fc:4b:df:ff(ip-172-20-55-52.us-west-2.compute.internal)
-> 172.20.82.117:6783 pending none c2:9e:94:1a:90:fe(ip-172-20-82-117.us-west-2.compute.internal)
<- 172.20.47.226:58163 established sleeve 3a:f4:1d:b1:89:3f(ip-172-20-47-226.us-west-2.compute.internal) mtu=1438
-> 172.20.77.34:6783 pending fastdp 52:e5:80:09:dc:45(ip-172-20-77-34.us-west-2.compute.internal) mtu=8912
-> 172.20.70.25:6783 pending fastdp be:db:3c:81:7e:95(ip-172-20-70-25.us-west-2.compute.internal) mtu=8912
-> 172.20.95.63:6783 pending none 32:1e:9d:18:ec:f3(ip-172-20-95-63.us-west-2.compute.internal)
<- 172.20.48.130:23691 pending fastdp de:92:89:48:4d:52(ip-172-20-48-130.us-west-2.compute.internal) mtu=8912
-> 172.20.41.189:6783 pending none e6:45:59:5d:37:c7(ip-172-20-41-189.us-west-2.compute.internal)
-> 172.20.92.225:6783 pending none 56:4f:6f:fd:45:60(ip-172-20-92-225.us-west-2.compute.internal)
-> 172.20.84.48:6783 pending none ba:51:49:77:7f:6f(ip-172-20-84-48.us-west-2.compute.internal)
-> 172.20.90.60:6783 pending fastdp aa:fe:a6:2c:97:d5(ip-172-20-90-60.us-west-2.compute.internal) mtu=8912
-> 172.20.50.147:6783 pending fastdp be:0c:9e:f9:b1:ac(ip-172-20-50-147.us-west-2.compute.internal) mtu=8912
<- 172.20.41.74:53829 established sleeve 26:e0:af:cc:c9:b4(ip-172-20-41-74.us-west-2.compute.internal) mtu=1438
-> 172.20.57.134:6783 retrying no working forwarders to 26:27:46:6d:7f:05(ip-172-20-57-134.us-west-2.compute.internal)
-> 172.20.49.98:6783 retrying Multiple connections to 56:f2:7a:ba:82:ea(ip-172-20-49-98.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
-> 172.20.77.159:6783 retrying no working forwarders to ae:7e:5a:46:f9:94(ip-172-20-77-159.us-west-2.compute.internal)
-> 172.20.84.170:6783 retrying read tcp4 172.20.83.113:36800->172.20.84.170:6783: i/o timeout
-> 172.20.57.88:6783 retrying read tcp4 172.20.83.113:44699->172.20.57.88:6783: i/o timeout
-> 172.20.61.185:6783 retrying no working forwarders to 26:db:c4:08:b4:b8(ip-172-20-61-185.us-west-2.compute.internal)
-> 172.20.74.37:6783 retrying no working forwarders to b6:58:9d:e7:c3:69(ip-172-20-74-37.us-west-2.compute.internal)
-> 172.20.46.140:6783 retrying no working forwarders to ca:10:cb:b1:bb:96(ip-172-20-46-140.us-west-2.compute.internal)
-> 172.20.42.113:6783 retrying read tcp4 172.20.83.113:24184->172.20.42.113:6783: i/o timeout
-> 172.20.92.199:6783 retrying no working forwarders to ca:1d:05:e3:52:31(ip-172-20-92-199.us-west-2.compute.internal)
-> 172.20.48.136:6783 retrying read tcp4 172.20.83.113:28665->172.20.48.136:6783: i/o timeout
-> 172.20.74.252:6783 retrying read tcp4 172.20.83.113:42920->172.20.74.252:6783: i/o timeout
-> 172.20.54.131:6783 retrying Multiple connections to ba:7b:10:ba:8e:96(ip-172-20-54-131.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
-> 172.20.32.57:6783 retrying write tcp4 172.20.83.113:43773->172.20.32.57:6783: write: broken pipe
-> 172.20.39.242:6783 retrying Multiple connections to 36:a7:de:7e:10:31(ip-172-20-39-242.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
-> 172.20.71.163:6783 retrying read tcp4 172.20.83.113:49690->172.20.71.163:6783: i/o timeout
-> 172.20.85.181:6783 retrying write tcp4 172.20.83.113:41822->172.20.85.181:6783: write: broken pipe
-> 172.20.94.126:6783 retrying no working forwarders to a2:45:ef:e4:8a:82(ip-172-20-94-126.us-west-2.compute.internal)
-> 172.20.63.252:6783 retrying write tcp4 172.20.83.113:36174->172.20.63.252:6783: write: broken pipe
-> 172.20.81.26:6783 retrying read tcp4 172.20.83.113:45127->172.20.81.26:6783: i/o timeout
-> 172.20.79.17:6783 retrying no working forwarders to de:65:f6:50:a6:40(ip-172-20-79-17.us-west-2.compute.internal)
-> 172.20.68.233:6783 retrying read tcp4 172.20.83.113:45218->172.20.68.233:6783: i/o timeout
-> 172.20.57.207:6783 failed dial tcp4 :0->172.20.57.207:6783: connect: connection refused, retry: 2019-02-06 11:56:40.990686544 +0000 UTC m=+496.669563883
-> 172.20.53.188:6783 retrying read tcp4 172.20.83.113:49380->172.20.53.188:6783: i/o timeout
-> 172.20.89.87:6783 retrying write tcp4 172.20.83.113:12262->172.20.89.87:6783: write: broken pipe
-> 172.20.56.52:6783 retrying no working forwarders to f2:cd:83:66:79:b4(ip-172-20-56-52.us-west-2.compute.internal)
-> 172.20.59.181:6783 retrying no working forwarders to 8e:13:07:ef:d3:2d(ip-172-20-59-181.us-west-2.compute.internal)
-> 172.20.56.168:6783 retrying write tcp4 172.20.83.113:44407->172.20.56.168:6783: write: broken pipe
-> 172.20.64.79:6783 retrying write tcp4 172.20.83.113:24116->172.20.64.79:6783: write: broken pipe
-> 172.20.62.110:6783 retrying read tcp4 172.20.83.113:64287->172.20.62.110:6783: i/o timeout
-> 172.20.49.107:6783 failed dial tcp4 :0->172.20.49.107:6783: connect: connection refused, retry: 2019-02-06 11:53:26.414333267 +0000 UTC m=+302.093210632
-> 172.20.56.187:6783 retrying read tcp4 172.20.83.113:53265->172.20.56.187:6783: i/o timeout
-> 172.20.35.55:6783 retrying no working forwarders to be:01:39:2b:26:65(ip-172-20-35-55.us-west-2.compute.internal)
-> 172.20.57.50:6783 retrying read tcp4 172.20.83.113:47644->172.20.57.50:6783: i/o timeout
-> 172.20.71.190:6783 retrying read tcp4 172.20.83.113:61972->172.20.71.190:6783: i/o timeout
-> 172.20.42.79:6783 retrying no working forwarders to 8a:4c:cb:ce:46:50(ip-172-20-42-79.us-west-2.compute.internal)
-> 172.20.64.212:6783 retrying read tcp4 172.20.83.113:25892->172.20.64.212:6783: i/o timeout
-> 172.20.80.76:6783 retrying no working forwarders to 4e:f7:73:dd:ed:3a(ip-172-20-80-76.us-west-2.compute.internal)
-> 172.20.32.84:6783 retrying no working forwarders to fe:29:2c:f9:ed:66(ip-172-20-32-84.us-west-2.compute.internal)
-> 172.20.95.36:6783 retrying read tcp4 172.20.83.113:31947->172.20.95.36:6783: read: connection reset by peer
Logs indicate various sorts of errors as well.
As I narrow down the root cause based on the symptoms I will refine the bug accordingly.
How to reproduce it?
Increase CPU and memory requests to 500m and 500 MB so that pods does not crash or OOMKilled and increase the connection limit (CONN_LIMIT) and provision more than 150 nodes
Versions:
$ weave version
2.5.1
$ kubectl version
v1.10.8
Logs:
$ kubectl logs -n kube-system <weave-net-pod> weave
019/02/06 11:59:04.996723 ->[172.20.89.63:11787] connection accepted
INFO: 2019/02/06 11:59:04.997252 Removed unreachable peer 6a:de:31:cf:24:2f(ip-172-20-74-252.us-west-2.compute.internal)
INFO: 2019/02/06 11:59:05.208747 EMSGSIZE on send, expecting PMTU update (IP packet was 60028 bytes, payload was 60020 bytes)
gc 436 @640.676s 14%: 0.11+499+0.068 ms clock, 0.22+403/197/0+0.13 ms cpu, 165->172->85 MB, 174 MB goal, 2 P
INFO: 2019/02/06 11:59:05.509295 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.601341 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.611366 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.612293 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.612741 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.617464 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.618458 ->[172.20.89.254:6783|82:33:3d:60:1a:2b(ip-172-20-89-254.us-west-2.compute.internal)]: connection added (new peer)
INFO: 2019/02/06 11:59:05.619166 ->[172.20.54.141:6783|32:ba:e2:0e:aa:59(ip-172-20-54-141.us-west-2.compute.internal)]: connection deleted
INFO: 2019/02/06 11:59:05.696022 ->[172.20.81.32:41081] connection accepted
INFO: 2019/02/06 11:59:05.697484 ->[172.20.67.185:27325|be:a6:c3:02:c6:9a(ip-172-20-67-185.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to be:a6:c3:02:c6:9a(ip-172-20-67-185.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
INFO: 2019/02/06 11:59:05.697554 ->[172.20.86.131:27966|32:f6:9f:d2:be:28(ip-172-20-86-131.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 32:f6:9f:d2:be:28(ip-172-20-86-131.us-west-2.compute.internal) added to 72:b6:4d:02:51:03(ip-172-20-83-113.us-west-2.compute.internal)
INFO: 2019/02/06 11:59:05.702059 ->[172.20.32.55:6783|3a:a5:fe:cc:8d:c2(ip-172-20-32-55.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/02/06 11:59:05.702147 overlay_switch ->[3a:a5:fe:cc:8d:c2(ip-172-20-32-55.us-west-2.compute.internal)] using fastdp
INFO: 2019/02/06 11:59:05.707288 ->[172.20.74.252:6783] attempting connection
INFO: 2019/02/06 11:59:05.710945 overlay_switch ->[62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal)] using sleeve
INFO: 2019/02/06 11:59:05.711187 ->[172.20.51.128:62804|62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal)]: connection shutting down due to error: read tcp4 172.20.83.113:6783->172.20.51.128:62804: i/o timeout
INFO: 2019/02/06 11:59:05.803459 overlay_switch ->[0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal)] using sleeve
INFO: 2019/02/06 11:59:05.803517 ->[172.20.66.137:6783|0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal)]: connection shutting down due to error: read tcp4 172.20.83.113:28846->172.20.66.137:6783: i/o timeout
INFO: 2019/02/06 11:59:05.803721 overlay_switch ->[62:0a:6a:bb:ce:90(ip-172-20-51-128.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/02/06 11:59:05.803819 overlay_switch ->[0e:b3:e8:62:b2:11(ip-172-20-66-137.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/02/06 11:59:05.904841 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.906845 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.907700 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:05.996835 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.006184 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.007340 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.016334 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.016894 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.099141 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.104001 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.104854 Removed unreachable peer 6a:de:31:cf:24:2f()
INFO: 2019/02/06 11:59:06.105366 Removed unreachable peer 6a:de:31:cf:24:2f()
Metadata
Metadata
Assignees
Labels
No labels