What steps did you take and what happened?
The steps below were done manually as part of a high availability test:
- Create a 5 control-plane nodes, 0 (zero) worker nodes
- Drop two of them after they finish to reconcile and fully running - e.g. by powering off and destroying their VMs
- Machine healthcheck configured and act fast by identifying the missing nodes
- Using vSphere client
KCP seems to start the reconciliation of the first node and stuck, without finishing and without starting the remediation of the second one.
Some related logs:
I0119 12:48:50.143423 1 remediation.go:424] "etcd cluster projected after remediation of test-cluster1-srskl" controller="kubeadmcontrolplane" controllerGroup="" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=1d51dec1-3547-42d1-bdec-a2ebad24afb2 Cluster="default/test-cluster1" healthyMembers=[test-cluster1-xcb42 (test-cluster1-xcb42) test-cluster1-vw754 (test-cluster1-vw754) test-cluster1-4rsnz (test-cluster1-4rsnz)] unhealthyMembers=[test-cluster1-cqd5t (test-cluster1-cqd5t)] targetTotalMembers=4 targetQuorum=3 targetUnhealthyMembers=1 canSafelyRemediate=true
I0119 12:49:39.391449 1 scale.go:204] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=7a7a1216-0554-461c-b6a1-35704e964132 Cluster="default/test-cluster1" failures="[Machine test-cluster1-cqd5t reports APIServerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports ControllerManagerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports SchedulerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the test-cluster1-cqd5t node: could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded)]"
I0119 12:49:56.036901 1 remediation.go:101] "Another remediation is already in progress. Skipping remediation." controller="kubeadmcontrolplane" controllerGroup="" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=e60ffc9a-6c94-4f20-9451-4967235caddf Cluster="default/test-cluster1" Machine="default/test-cluster1-cqd5
What did you expect to happen?
A 5 control-plane nodes cluster can loose two of them. I expected that the reconciliation finishes successfully and the cluster be recovered.
Cluster API version
Kubernetes version
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug
/area control-plane