A 5 control-plane nodes cluster does not recover if loosing 2 nodes at the same time

### What steps did you take and what happened?

The steps below were done manually as part of a high availability test:

* Create a 5 control-plane nodes, 0 (zero) worker nodes
* Drop two of them after they finish to reconcile and fully running - e.g. by powering off and destroying their VMs
* Machine healthcheck configured and act fast by identifying the missing nodes
* Using vSphere client

KCP seems to start the reconciliation of the first node and stuck, without finishing and without starting the remediation of the second one.

Some related logs:

```
I0119 12:48:50.143423       1 remediation.go:424] "etcd cluster projected after remediation of test-cluster1-srskl" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=1d51dec1-3547-42d1-bdec-a2ebad24afb2 Cluster="default/test-cluster1" healthyMembers=[test-cluster1-xcb42 (test-cluster1-xcb42) test-cluster1-vw754 (test-cluster1-vw754) test-cluster1-4rsnz (test-cluster1-4rsnz)] unhealthyMembers=[test-cluster1-cqd5t (test-cluster1-cqd5t)] targetTotalMembers=4 targetQuorum=3 targetUnhealthyMembers=1 canSafelyRemediate=true
```

```
I0119 12:49:39.391449       1 scale.go:204] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=7a7a1216-0554-461c-b6a1-35704e964132 Cluster="default/test-cluster1" failures="[Machine test-cluster1-cqd5t reports APIServerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports ControllerManagerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports SchedulerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the test-cluster1-cqd5t node: could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded)]"
```

```
I0119 12:49:56.036901       1 remediation.go:101] "Another remediation is already in progress. Skipping remediation." controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=e60ffc9a-6c94-4f20-9451-4967235caddf Cluster="default/test-cluster1" Machine="default/test-cluster1-cqd5
```


### What did you expect to happen?

A 5 control-plane nodes cluster can loose two of them. I expected that the reconciliation finishes successfully and the cluster be recovered.

### Cluster API version

v1.5.4

### Kubernetes version

v1.27.7

### Anything else you would like to add?

_No response_

### Label(s) to be applied

/kind bug
/area control-plane


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A 5 control-plane nodes cluster does not recover if loosing 2 nodes at the same time #10125

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

A 5 control-plane nodes cluster does not recover if loosing 2 nodes at the same time #10125

Description

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions