Skip to content

A 5 control-plane nodes cluster does not recover if loosing 2 nodes at the same time #10125

Open
@jcmoraisjr

Description

@jcmoraisjr

What steps did you take and what happened?

The steps below were done manually as part of a high availability test:

  • Create a 5 control-plane nodes, 0 (zero) worker nodes
  • Drop two of them after they finish to reconcile and fully running - e.g. by powering off and destroying their VMs
  • Machine healthcheck configured and act fast by identifying the missing nodes
  • Using vSphere client

KCP seems to start the reconciliation of the first node and stuck, without finishing and without starting the remediation of the second one.

Some related logs:

I0119 12:48:50.143423       1 remediation.go:424] "etcd cluster projected after remediation of test-cluster1-srskl" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=1d51dec1-3547-42d1-bdec-a2ebad24afb2 Cluster="default/test-cluster1" healthyMembers=[test-cluster1-xcb42 (test-cluster1-xcb42) test-cluster1-vw754 (test-cluster1-vw754) test-cluster1-4rsnz (test-cluster1-4rsnz)] unhealthyMembers=[test-cluster1-cqd5t (test-cluster1-cqd5t)] targetTotalMembers=4 targetQuorum=3 targetUnhealthyMembers=1 canSafelyRemediate=true
I0119 12:49:39.391449       1 scale.go:204] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=7a7a1216-0554-461c-b6a1-35704e964132 Cluster="default/test-cluster1" failures="[Machine test-cluster1-cqd5t reports APIServerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports ControllerManagerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports SchedulerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the test-cluster1-cqd5t node: could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded)]"
I0119 12:49:56.036901       1 remediation.go:101] "Another remediation is already in progress. Skipping remediation." controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=e60ffc9a-6c94-4f20-9451-4967235caddf Cluster="default/test-cluster1" Machine="default/test-cluster1-cqd5

What did you expect to happen?

A 5 control-plane nodes cluster can loose two of them. I expected that the reconciliation finishes successfully and the cluster be recovered.

Cluster API version

v1.5.4

Kubernetes version

v1.27.7

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
/area control-plane

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/control-planeIssues or PRs related to control-plane lifecycle managementkind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions