CNI chaining fails and does not self-heal in AKS #6499
Labels
kind/bug
Categorizes issue or PR as related to a bug.
reported-by/end-user
Issues reported by end users.
Milestone
Describe the bug
When Antrea agent pods start up before the pods of the default network stack of AKS, the CNI chaining occasionally fails and pod creation is blocked for the given nodes. Manually restarting the
install-cni
container, after the AKS network pods have started successfully, leads to a sufficient recovery of the CNI chaining.To Reproduce
container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized
.install-cni
container and print its content:cat /host/etc/cni/net.d/05-antrea.conflist
. Which shows that it is empty.Expected
Expected Antrea not to leave the CNI plugin setup in an invalid state, but resolve the issue automatically.
Actual behavior
Node is marked as NotReady until the AKS engine decides to replace the node, due to it being stuck in
NotReady
state. The issue also blocks AKS clusters from properly autoscaling, as the new nodes are not usable and can not host pods (as they are NotReady).Versions:
Antrea: 2.0.1
Kubernetes: 1.28.5
Container runtime: Containerd 1.7.15-1
Linux kernel version: 5.15.0-1061-azure
Additional context
I have applied the following workaround:
to this script and line number:
antrea/build/images/scripts/install_cni_chaining
Line 90 in fc40157
The text was updated successfully, but these errors were encountered: