ETCD-328: automatic replacement of an unhealthy member machine by Elbehery · Pull Request #947 · openshift/cluster-etcd-operator

Elbehery · 2022-10-16T22:18:29Z

This PR attempts to allow replacing an unhealthy etcd member from the cluster, with a replacement as learner. It relaxes the scaling down invariant to the minimum tolerable threshold in order to maintain Quorum.

Elbehery · 2022-10-17T08:58:16Z

/hold

Elbehery · 2022-10-17T12:53:02Z

Tested on a 4.12 cluster && this PR atop ..

Scenario :

kill ec2 instance backing master-0 without pre-provision a replacement
the etcd cluster is unhealthy due to unreachable voting member
run oc delete machine master-0
master-0 become failed
a replacement master machine is being created by CPMO
the etcd voting member within master-0 is being removed from the cluster, even though the cluster is unhealthy
the new etcd pod successfully joins the etcd cluster as learner and being promoted
the etcd cluster recovers and becomes healthy again with 3 voting member ( i.e. desiredControlPlaneReplicaCount` )

Logs from etcd operator

I1017 12:21:28.740923       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:28.775380       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.005558       1 request.go:601] Waited for 1.392216647s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:29.620351       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/etcd-guard-ip-10-0-168-9.us-west-2.compute.internal -n openshift-etcd because it was missing
I1017 12:21:29.622796       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.633835       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.643377       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.675572       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.720747       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.761795       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.810172       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAddAsLearner' successfully added new member https://10.0.168.9:2380
I1017 12:21:29.833832       1 clustermembercontroller.go:283] Not ready for promotion: etcd learner member (https://10.0.168.9:2380) is not yet in sync with leader's log 
I1017 12:21:30.205573       1 request.go:601] Waited for 1.192895381s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:30.366933       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:30.758816       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:31.209179       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
I1017 12:21:32.896457       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.230975       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberPromote' successfully promoted learner member https://10.0.168.9:2380
I1017 12:21:33.565330       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.623126       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-17T12:19:41Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-10-17T12:19:43Z","message":"NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-17T11:32:56Z","message":"StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10\nEtcdMembersAvailable: 2 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-17T11:29:55Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2022-10-17T11:29:28Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"}]}}
I1017 12:21:33.639634       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nGuardControllerDegraded: Missing PodIP in operand etcd-ip-10-0-168-9.us-west-2.compute.internal on node ip-10-0-168-9.us-west-2.compute.internal\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
I1017 12:21:33.639742       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.649419       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/etcd-endpoints -n openshift-etcd:
cause by changes in data.678c11558ac790e0
I1017 12:21:33.652004       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:33.652017       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:33.652271       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:33.652279       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:33.652286       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc0017df200 0xc0017df420 0xc001db98c0]
I1017 12:21:33.652297       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]
W1017 12:21:33.656180       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.661582       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 11 triggered by "configmap/etcd-endpoints has changed"
I1017 12:21:33.808887       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
W1017 12:21:33.843243       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.846922       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.872262       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.885684       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.923949       1 defragcontroller.go:289] etcd member "ip-10-0-168-9.us-west-2.compute.internal" backend store fragmented: 16.65 %, dbSize: 138178560
I1017 12:21:33.923965       1 defragcontroller.go:289] etcd member "ip-10-0-130-100.us-west-2.compute.internal" backend store fragmented: 4.26 %, dbSize: 75403264
I1017 12:21:33.923972       1 defragcontroller.go:289] etcd member "ip-10-0-253-42.us-west-2.compute.internal" backend store fragmented: 3.25 %, dbSize: 74633216
I1017 12:21:33.939388       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.537466       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.807733       1 request.go:601] Waited for 1.183291648s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/etcd-pod
I1017 12:21:35.409752       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/revision-status-11 -n openshift-etcd because it was missing
I1017 12:21:35.410709       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:35.410730       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:35.411069       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:35.411079       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:35.411087       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc001db98c0 0xc0017df200 0xc0017df420]
I1017 12:21:35.411096       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]

Elbehery · 2022-10-17T15:12:33Z

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

 	}
-	if currentVotingMemberIPListSet.Len() <= desiredControlPlaneReplicasCount {
+	klog.V(2).Infof("@Mustafa: currentVotingMemberIPListSet.Len() is %v", currentVotingMemberIPListSet.Len())
+	if currentVotingMemberIPListSet.Len() < desiredControlPlaneReplicasCount {


Suggested change

if currentVotingMemberIPListSet.Len() < desiredControlPlaneReplicasCount {

if liveVotingMembers, err := c.getAllVotingMembers(ctx); err == nil && len(liveVotingMembers) < desiredControlPlaneReplicasCount {

@hasbro17 I propose this change, since the etcd-config map takes some time to remove member's IP after being removed from the membership

@tjungblu wdyt ?
cc @JoelSpeed

Just so I understand the change, is the change to make sure we are getting the current voting members from the etcd cluster itself rather than from the configmap? If that's the case that seems like a sensible move to make sure we are not going to break quorum at all, so +1 from me

@JoelSpeed yes exactly, this is the reason .. also all test passed :D 🎉

i am testing now and will update with the result, before cleaning the logs and adding tests

Elbehery · 2022-10-17T19:10:52Z

/retest-required

The test cases passes locally !! :/

Elbehery · 2022-10-18T04:39:55Z

/retest-required

Elbehery · 2022-10-18T18:32:29Z

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

+
+	liveVotingMembers, err := c.getAllVotingMembers(ctx)
+	if err != nil {
+		return err


@Elbehery TODO: add log and error msg

Elbehery · 2022-10-18T20:20:32Z

Scenario `Replace one failed machine from 3 master nodes`

logic

   - Destroy `1 `ec2 instance backing `master-0`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `3`.

expected

    - CEO should scale down the unhealthy master-0.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - CEO  scales the unhealthy node down, once the machine is being deleted by the admin.
    - CPMS creates a replacement.
    - CEO joins the replacement to the etcd cluster.

CEO logs

I1017 12:21:28.740923       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:28.775380       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.005558       1 request.go:601] Waited for 1.392216647s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:29.620351       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/etcd-guard-ip-10-0-168-9.us-west-2.compute.internal -n openshift-etcd because it was missing
I1017 12:21:29.622796       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.633835       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.643377       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.675572       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.720747       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.761795       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.810172       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAddAsLearner' successfully added new member https://10.0.168.9:2380
I1017 12:21:29.833832       1 clustermembercontroller.go:283] Not ready for promotion: etcd learner member (https://10.0.168.9:2380) is not yet in sync with leader's log 
I1017 12:21:30.205573       1 request.go:601] Waited for 1.192895381s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:30.366933       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:30.758816       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:31.209179       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
I1017 12:21:32.896457       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.230975       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberPromote' successfully promoted learner member https://10.0.168.9:2380
I1017 12:21:33.565330       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.623126       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-17T12:19:41Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-10-17T12:19:43Z","message":"NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-17T11:32:56Z","message":"StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10\nEtcdMembersAvailable: 2 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-17T11:29:55Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2022-10-17T11:29:28Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"}]}}
I1017 12:21:33.639634       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nGuardControllerDegraded: Missing PodIP in operand etcd-ip-10-0-168-9.us-west-2.compute.internal on node ip-10-0-168-9.us-west-2.compute.internal\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
I1017 12:21:33.639742       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.649419       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/etcd-endpoints -n openshift-etcd:
cause by changes in data.678c11558ac790e0
I1017 12:21:33.652004       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:33.652017       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:33.652271       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:33.652279       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:33.652286       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc0017df200 0xc0017df420 0xc001db98c0]
I1017 12:21:33.652297       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]
W1017 12:21:33.656180       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.661582       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 11 triggered by "configmap/etcd-endpoints has changed"
I1017 12:21:33.808887       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
W1017 12:21:33.843243       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.846922       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.872262       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.885684       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.923949       1 defragcontroller.go:289] etcd member "ip-10-0-168-9.us-west-2.compute.internal" backend store fragmented: 16.65 %, dbSize: 138178560
I1017 12:21:33.923965       1 defragcontroller.go:289] etcd member "ip-10-0-130-100.us-west-2.compute.internal" backend store fragmented: 4.26 %, dbSize: 75403264
I1017 12:21:33.923972       1 defragcontroller.go:289] etcd member "ip-10-0-253-42.us-west-2.compute.internal" backend store fragmented: 3.25 %, dbSize: 74633216
I1017 12:21:33.939388       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.537466       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.807733       1 request.go:601] Waited for 1.183291648s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/etcd-pod
I1017 12:21:35.409752       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/revision-status-11 -n openshift-etcd because it was missing
I1017 12:21:35.410709       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:35.410730       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:35.411069       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:35.411079       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:35.411087       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc001db98c0 0xc0017df200 0xc0017df420]
I1017 12:21:35.411096       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]

Elbehery · 2022-10-18T20:40:06Z

Scenario `Replace one failed machine from 5 master nodes`

logic

   - Destroy `1 `ec2 instance backing `master-0`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `5`.

expected

    - CEO should scale down the unhealthy master-0.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - CEO  scales the unhealthy node down, once the machine is being deleted by the admin.
    - No replacement has been created by CPMS ( possible Bug 🐛  🚨 )

CEO logs

I1018 18:50:49.143942       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-zy61bgb-76ef8-j9wf7-master-3 since its member is still present with any of: [{InternalIP 10.0.227.168} {InternalDNS ip-10-0-227-168.us-west-2.compute.internal} {Hostname ip-10-0-227-168.us-west-2.compute.internal}]
I1018 18:50:50.418019       1 clustermemberremovalcontroller.go:184] @Mustafa: unhealthy members [ID:10374916192776508190 name:"ip-10-0-227-168.us-west-2.compute.internal" peerURLs:"https://10.0.227.168:2380" clientURLs:"https://10.0.227.168:2379" ]
I1018 18:50:50.418073       1 clustermemberremovalcontroller.go:187] @Mustafa: votingMemberMachinesPendingDeletionIndex map[10.0.227.168:0xc002a326c0]
I1018 18:50:50.418245       1 clustermemberremovalcontroller.go:209] found unhealthy voting members with machine pending deletion: [0xc002a326c0]
I1018 18:50:50.418257       1 clustermemberremovalcontroller.go:210] unhealthy members found: [https://10.0.227.168:2380]
I1018 18:50:50.418266       1 clustermemberremovalcontroller.go:216] @Mustafa: unhealthyVotingMemberMachinesPendingDeletion: [0xc002a326c0]
I1018 18:50:50.418272       1 clustermemberremovalcontroller.go:401] @Mustafa: entering attemptToRemoveMemberFor()
I1018 18:50:50.418301       1 clustermemberremovalcontroller.go:402] @Mustafa: members are [ID:7583016053260549916 name:"ip-10-0-237-158.us-west-2.compute.internal" peerURLs:"https://10.0.237.158:2380" clientURLs:"https://10.0.237.158:2379"  ID:8697460826291536016 name:"ip-10-0-209-73.us-west-2.compute.internal" peerURLs:"https://10.0.209.73:2380" clientURLs:"https://10.0.209.73:2379"  ID:10374916192776508190 name:"ip-10-0-227-168.us-west-2.compute.internal" peerURLs:"https://10.0.227.168:2380" clientURLs:"https://10.0.227.168:2379"  ID:10881545389327994471 name:"ip-10-0-241-79.us-west-2.compute.internal" peerURLs:"https://10.0.241.79:2380" clientURLs:"https://10.0.241.79:2379"  ID:13043031258985252455 name:"ip-10-0-174-177.us-west-2.compute.internal" peerURLs:"https://10.0.174.177:2380" clientURLs:"https://10.0.174.177:2379" ]
I1018 18:50:50.424876       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.227.168, name: ip-10-0-227-168.us-west-2.compute.internal, id: 10374916192776508190 ] from the cluster
I1018 18:50:50.425081       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"314c904e-f673-4f8b-84aa-6ca8c129d9f3", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' successfully removed member: [ url: 10.0.227.168, name: ip-10-0-227-168.us-west-2.compute.internal, id: 10374916192776508190 ] from the cluster
I1018 18:50:52.186699       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-zy61bgb-76ef8-j9wf7-master-3
I1018 18:50:52.187430       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"314c904e-f673-4f8b-84aa-6ca8c129d9f3", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MachineDeletionHook' successfully removed the deletion hook from machine ci-ln-zy61bgb-76ef8-j9wf7-master-3 as it doesn't have a member
I1018 18:50:52.187577       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1018 18:50:52.194636       1 clustermemberremovalcontroller.go:135] @Mustafa: liveVotingMembers is [ID:7583016053260549916 name:"ip-10-0-237-158.us-west-2.compute.internal" peerURLs:"https://10.0.237.158:2380" clientURLs:"https://10.0.237.158:2379"  ID:8697460826291536016 name:"ip-10-0-209-73.us-west-2.compute.internal" peerURLs:"https://10.0.209.73:2380" clientURLs:"https://10.0.209.73:2379"  ID:10881545389327994471 name:"ip-10-0-241-79.us-west-2.compute.internal" peerURLs:"https://10.0.241.79:2380" clientURLs:"https://10.0.241.79:2379"  ID:13043031258985252455 name:"ip-10-0-174-177.us-west-2.compute.internal" peerURLs:"https://10.0.174.177:2380" clientURLs:"https://10.0.174.177:2379" ]
I1018 18:50:52.194728       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc00228e400 0xc00228e620 0xc00228e840 0xc0032b0240]
I1018 18:50:52.194741       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc00228e400 0xc00228e620 0xc00228e840 0xc0032b0240]

Elbehery · 2022-10-18T20:47:59Z

Scenario `Replace one failed machine from 4 master nodes`

logic

   - Destroy `1 `ec2 instance backing `master-0`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `4`.

expected

    - CEO should scale down the unhealthy master-0.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - CEO  scales the unhealthy node down, once the machine is being deleted by the admin.
    - No replacement has been created by CPMS.

CEO logs

I1018 22:47:50.152487       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1018 22:47:50.152514       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.177.164:{} 10.0.181.165:{} 10.0.185.114:{} 10.0.231.247:{}]
I1018 22:47:50.152881       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1018 22:47:50.170840       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node ip-10-0-177-164.us-west-2.compute.internal
I1018 22:49:03.592953       1 clustermemberremovalcontroller.go:209] found unhealthy voting members with machine pending deletion: [0xc003a2dd40]
I1018 22:49:03.592989       1 clustermemberremovalcontroller.go:210] unhealthy members found: [https://10.0.177.164:2380]
I1018 22:49:03.593035       1 clustermemberremovalcontroller.go:216] @Mustafa: unhealthyVotingMemberMachinesPendingDeletion: [0xc003a2dd40]
I1018 22:49:03.593055       1 clustermemberremovalcontroller.go:401] @Mustafa: entering attemptToRemoveMemberFor()
I1018 22:49:03.593060       1 clustermemberremovalcontroller.go:402] @Mustafa: members are [ID:2193243551350031740 name:"ip-10-0-231-247.us-west-2.compute.internal" peerURLs:"https://10.0.231.247:2380" clientURLs:"https://10.0.231.247:2379"  ID:2203788137076361769 name:"ip-10-0-185-114.us-west-2.compute.internal" peerURLs:"https://10.0.185.114:2380" clientURLs:"https://10.0.185.114:2379"  ID:10711119892892765467 name:"ip-10-0-177-164.us-west-2.compute.internal" peerURLs:"https://10.0.177.164:2380" clientURLs:"https://10.0.177.164:2379"  ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379" ]
I1018 22:49:03.601644       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.177.164, name: ip-10-0-177-164.us-west-2.compute.internal, id: 10711119892892765467 ] from the cluster
I1018 22:49:04.232846       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-0

Elbehery · 2022-10-18T22:42:07Z

/hold cancel

Elbehery · 2022-10-18T22:42:36Z

/label tide/merge-method-squash

Elbehery · 2022-10-18T23:35:38Z

Scenario `Replace two failed machine from 5 master nodes`

logic

   - Destroy `2 `ec2 instance backing `master-0`, `master-1`
   - Also tried destroying `master-0` first, wait till its machine becomes `failed`, then destory `master-1`

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `5`.

expected

    - CEO should scale down the unhealthy master-0, and master-1.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - API Server is down 🆘 🚨 
    - Cluster lost completely
    - The cluster is unreachable again after a duration from 12-15 minutes
    - After 15 minutes the cluster became reachable.
    - Deleting the failed machines, results in `1` replacement being created and joins the etcd cluster. 
    - Expected `2` replacement being created. However, only `1` has been created 🐛 🚨 
    - The cluster recovers to the desiredState of `3` control plane machines.
    - There is a duration where we are down to only `one node cluster, and the node is not a leader.

ETCD Member logs

sh-4.4# etcdctl endpoint status -w table
{"level":"warn","ts":"2022-10-19T00:12:09.556Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004a4000/10.0.171.204:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.181.165:2379: connect: connection refused\""}
Failed to get the status of endpoint https://10.0.181.165:2379 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.171.204:2379 | 6a345a2a4a7f60ad |   3.5.5 |  140 MB |     false |      false |        24 |     107019 |             107019 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6a345a2a4a7f60ad | started | ip-10-0-171-204.us-west-2.compute.internal | https://10.0.171.204:2380 | https://10.0.171.204:2379 |      false |
| 9c05eee3ac29d8e1 | started | ip-10-0-181-165.us-west-2.compute.internal | https://10.0.181.165:2380 | https://10.0.181.165:2379 |      false |
| b7a40406e4344087 | started | ip-10-0-228-161.us-west-2.compute.internal | https://10.0.228.161:2380 | https://10.0.228.161:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# etcdctl endpoint status -w table
{"level":"warn","ts":"2022-10-19T00:12:34.441Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000328c40/10.0.171.204:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.181.165:2379: connect: connection refused\""}
Failed to get the status of endpoint https://10.0.181.165:2379 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.171.204:2379 | 6a345a2a4a7f60ad |   3.5.5 |  141 MB |     false |      false |        24 |     107283 |             107283 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6a345a2a4a7f60ad | started | ip-10-0-171-204.us-west-2.compute.internal | https://10.0.171.204:2380 | https://10.0.171.204:2379 |      false |
| 9c05eee3ac29d8e1 | started | ip-10-0-181-165.us-west-2.compute.internal | https://10.0.181.165:2380 | https://10.0.181.165:2379 |      false |
| b7a40406e4344087 | started | ip-10-0-228-161.us-west-2.compute.internal | https://10.0.228.161:2380 | https://10.0.228.161:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
sh-4.4# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.171.204:2379 | 6a345a2a4a7f60ad |   3.5.5 |  141 MB |     false |      false |        24 |     107676 |             107676 |        |
| https://10.0.181.165:2379 | 9c05eee3ac29d8e1 |   3.5.5 |  141 MB |     false |      false |        24 |     107676 |             107676 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6a345a2a4a7f60ad | started | ip-10-0-171-204.us-west-2.compute.internal | https://10.0.171.204:2380 | https://10.0.171.204:2379 |      false |
| 9c05eee3ac29d8e1 | started | ip-10-0-181-165.us-west-2.compute.internal | https://10.0.181.165:2380 | https://10.0.181.165:2379 |      false |
| b7a40406e4344087 | started | ip-10-0-228-161.us-west-2.compute.internal | https://10.0.228.161:2380 | https://10.0.228.161:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+

CEO logs

I1018 23:21:55.778813       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:55.779020       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded changed from False to True ("ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy")
I1018 23:21:56.469666       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:56.567048       1 request.go:601] Waited for 1.393155531s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:21:56.970214       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1018 23:21:56.970306       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.167.49:{} 10.0.171.204:{} 10.0.181.165:{} 10.0.185.114:{} 10.0.231.247:{}]
I1018 23:21:56.970518       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/etcd-endpoints-19 -n openshift-etcd because it was missing
I1018 23:21:56.970582       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1018 23:21:56.976771       1 clustermemberremovalcontroller.go:135] @Mustafa: liveVotingMembers is [ID:2193243551350031740 name:"ip-10-0-231-247.us-west-2.compute.internal" peerURLs:"https://10.0.231.247:2380" clientURLs:"https://10.0.231.247:2379"  ID:2203788137076361769 name:"ip-10-0-185-114.us-west-2.compute.internal" peerURLs:"https://10.0.185.114:2380" clientURLs:"https://10.0.185.114:2379"  ID:6504313761527134357 name:"ip-10-0-167-49.us-west-2.compute.internal" peerURLs:"https://10.0.167.49:2380" clientURLs:"https://10.0.167.49:2379"  ID:7652840804494303405 name:"ip-10-0-171-204.us-west-2.compute.internal" peerURLs:"https://10.0.171.204:2380" clientURLs:"https://10.0.171.204:2379"  ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379" ]
I1018 23:21:56.976817       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc000978880 0xc0032366c0 0xc0009ee480 0xc0022ec000 0xc000978660]
I1018 23:21:56.976826       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc0022ec000 0xc000978660 0xc000978880 0xc0032366c0 0xc0009ee480]
I1018 23:21:57.766618       1 request.go:601] Waited for 1.397145453s due to client-side throttling, not priority and fairness, request: POST:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods
I1018 23:21:57.772173       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-18-ip-10-0-171-204.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:21:57.773744       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:57.782356       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:57.790014       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:58.169354       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 18, but has not made progress because installer is not finished, but in Running phase
I1018 23:21:58.335435       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:58.372838       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretCreated' Created Secret/etcd-all-certs-19 -n openshift-etcd because it was missing
I1018 23:21:58.390059       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionCreate' Revision 18 created because configmap/etcd-pod has changed
I1018 23:21:58.486270       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:58.767369       1 request.go:601] Waited for 1.395191466s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:21:59.614528       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:59.966688       1 request.go:601] Waited for 1.58161678s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/etcd-scripts
I1018 23:22:00.619542       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.632450       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.640662       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.651968       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.967153       1 request.go:601] Waited for 1.797613102s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/revision-pruner-18-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:22:01.369765       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 18, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:01.381446       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2022-10-18T23:21:55Z","message":"ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy","reason":"ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-10-18T22:49:12Z","message":"NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-18T21:52:34Z","message":"StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I1018 23:22:01.390175       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Progressing message changed from "NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 18" to "NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19",Available message changed from "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 18\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy" to "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy"
I1018 23:22:01.390292       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:01.497855       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:01.626514       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:01.967254       1 request.go:601] Waited for 1.596638118s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-171-204.us-west-2.compute.internal
I1018 23:22:02.505556       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:03.166535       1 request.go:601] Waited for 1.779795255s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-18-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:03.511640       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:04.167298       1 request.go:601] Waited for 1.797644382s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/restore-etcd-pod
I1018 23:22:04.527204       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:04.786237       1 installer_controller.go:500] "ip-10-0-167-49.us-west-2.compute.internal" moving to (v1.NodeStatus) {
 NodeName: (string) (len=41) "ip-10-0-167-49.us-west-2.compute.internal",
 CurrentRevision: (int32) 15,
 TargetRevision: (int32) 19,
 LastFailedRevision: (int32) 0,
 LastFailedTime: (*v1.Time)(<nil>),
 LastFailedReason: (string) "",
 LastFailedCount: (int) 0,
 LastFallbackCount: (int) 0,
 LastFailedRevisionErrors: ([]string) <nil>
}
 because new revision pending
I1018 23:22:04.789573       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:05.366414       1 request.go:601] Waited for 1.596102379s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-181-165.us-west-2.compute.internal
E1018 23:22:05.740230       1 base_controller.go:272] FSyncController reconciliation failed: client_error: client error: 403
E1018 23:22:05.751986       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-231-247.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.231.247:2379]: context deadline exceeded
E1018 23:22:05.752072       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-167-49.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.167.49:2379]: context deadline exceeded
I1018 23:22:05.765549       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2022-10-18T23:21:55Z","message":"ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy","reason":"ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-10-18T22:49:12Z","message":"NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-18T21:52:34Z","message":"StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I1018 23:22:05.781195       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy" to "ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy"
I1018 23:22:05.781446       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:05.792876       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2022-10-18T23:21:55Z","message":"ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy","reason":"ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-10-18T22:49:12Z","message":"NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-18T21:52:34Z","message":"StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I1018 23:22:05.800703       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:05.801375       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy" to "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy"
I1018 23:22:06.366576       1 request.go:601] Waited for 1.564555295s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-18-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:07.367001       1 request.go:601] Waited for 1.796474623s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:07.775900       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:07.778074       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-19-ip-10-0-185-114.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:07.788797       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:07.801432       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:08.170061       1 installer_controller.go:500] "ip-10-0-167-49.us-west-2.compute.internal" moving to (v1.NodeStatus) {
 NodeName: (string) (len=41) "ip-10-0-167-49.us-west-2.compute.internal",
 CurrentRevision: (int32) 15,
 TargetRevision: (int32) 19,
 LastFailedRevision: (int32) 0,
 LastFailedTime: (*v1.Time)(<nil>),
 LastFailedReason: (string) "",
 LastFailedCount: (int) 0,
 LastFallbackCount: (int) 0,
 LastFailedRevisionErrors: ([]string) <nil>
}
 because new revision pending
I1018 23:22:08.367314       1 request.go:601] Waited for 1.795808831s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/serviceaccounts/installer-sa
I1018 23:22:08.389669       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:09.567470       1 request.go:601] Waited for 1.789547613s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/revision-pruner-19-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:22:10.767184       1 request.go:601] Waited for 1.290127155s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-171-204.us-west-2.compute.internal
E1018 23:22:18.175043       1 base_controller.go:272] PruneController reconciliation failed: etcdserver: request timed out
I1018 23:22:18.175270       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'PodCreateFailed' Failed to create Pod/revision-pruner-19-ip-10-0-181-165.us-west-2.compute.internal -n openshift-etcd: etcdserver: request timed out
E1018 23:22:25.260962       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-operator.171f4d4333a124a1", GenerateName:"", Namespace:"openshift-etcd-operator", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}, Reason:"PodCreateFailed", Message:"Failed to create Pod/revision-pruner-19-ip-10-0-181-165.us-west-2.compute.internal -n openshift-etcd: etcdserver: request timed out", Source:v1.EventSource{Component:"openshift-cluster-etcd-operator-prunecontroller", Host:""}, FirstTimestamp:time.Date(2022, time.October, 18, 23, 22, 18, 174719137, time.Local), LastTimestamp:time.Date(2022, time.October, 18, 23, 22, 18, 174719137, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'rpc error: code = Unknown desc = context deadline exceeded' (will not retry!)
I1018 23:22:28.861289       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:29.115944       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:29.143537       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:29.159093       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-19-ip-10-0-167-49.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:29.176324       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
E1018 23:22:29.640639       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
I1018 23:22:29.641046       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal
E1018 23:22:29.838223       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-185-114.us-west-2.compute.internal is unhealthy
I1018 23:22:29.883132       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Pending phase
I1018 23:22:30.269496       1 request.go:601] Waited for 1.079151574s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:22:30.491705       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-19-ip-10-0-167-49.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:31.462224       1 request.go:601] Waited for 1.183273956s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:22:32.264994       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:32.661886       1 request.go:601] Waited for 1.193524113s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:22:32.870222       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-19-ip-10-0-171-204.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:33.862200       1 request.go:601] Waited for 1.393359552s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:35.061269       1 request.go:601] Waited for 1.393916742s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-19-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:35.065083       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:35.754795       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdLeaderChangeMetrics' Detected leader change increase of 3.1676198856137265 over 5 minutes on "AWS"; disk metrics are: etcd-ip-10-0-167-49.us-west-2.compute.internal=0.005774,etcd-ip-10-0-171-204.us-west-2.compute.internal=0.009568,etcd-ip-10-0-181-165.us-west-2.compute.internal=0.005619,etcd-ip-10-0-185-114.us-west-2.compute.internal=0.007122,etcd-ip-10-0-231-247.us-west-2.compute.internal=0.006827. Most often this is as a result of inadequate storage or sometimes due to networking issues.
I1018 23:22:36.061432       1 request.go:601] Waited for 1.197324728s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/restore-etcd-pod
I1018 23:22:37.061608       1 request.go:601] Waited for 1.237975558s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd
I1018 23:22:38.066517       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:38.261412       1 request.go:601] Waited for 1.796605973s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:22:39.261833       1 request.go:601] Waited for 1.56636698s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:40.262216       1 request.go:601] Waited for 1.396560818s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/services/etcd
I1018 23:22:40.864599       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
E1018 23:22:59.675889       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
I1018 23:22:59.676025       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal
E1018 23:22:59.870257       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-185-114.us-west-2.compute.internal is unhealthy
E1018 23:23:35.842744       1 leaderelection.go:330] error retrieving resource lock openshift-etcd-operator/openshift-cluster-etcd-operator-lock: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
W1018 23:23:36.139706       1 etcdcli_pool.go:94] cached client considered unhealthy after 0 tries, trying again. Err: error during cache client health connection check: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
I1018 23:23:36.139723       1 etcdcli_pool.go:157] closing cached client
W1018 23:23:36.154468       1 etcdcli_pool.go:94] cached client considered unhealthy after 0 tries, trying again. Err: error during cache client health connection check: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
I1018 23:23:36.154631       1 etcdcli_pool.go:157] closing cached client
I1018 23:23:38.154946       1 etcdcli_pool.go:70] creating a new cached client
I1018 23:23:42.417523       1 request.go:601] Waited for 1.195727347s, retries: 1, retry-after: 1s - retry-reason: due to retryable error, error: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-185-114.us-west-2.compute.internal": unexpected EOF - request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-185-114.us-west-2.compute.internal
I1018 23:23:43.617177       1 request.go:601] Waited for 2.38428945s, retries: 1, retry-after: 1s - retry-reason: due to retryable error, error: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?allowWatchBookmarks=true&resourceVersion=71516&timeout=7m53s&timeoutSeconds=473&watch=true": unexpected EOF - request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?allowWatchBookmarks=true&resourceVersion=71516&timeout=7m53s&timeoutSeconds=473&watch=true
E1018 23:23:51.139108       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-231-247.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.231.247:2379]: context deadline exceeded
E1018 23:23:51.139130       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-185-114.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.185.114:2379]: context deadline exceeded
E1018 23:23:51.139138       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-167-49.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.167.49:2379]: context deadline exceeded
E1018 23:23:51.139145       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-171-204.us-west-2.compute.internal, took=, err=create client failure: error during client connection check: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
E1018 23:23:51.139152       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-181-165.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.181.165:2379]: context deadline exceeded
W1018 23:23:53.155501       1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again. Err: failed to make etcd client for endpoints [https://10.0.167.49:2379 https://10.0.171.204:2379 https://10.0.181.165:2379 https://10.0.185.114:2379 https://10.0.231.247:2379]: context deadline exceeded
I1018 23:24:14.687977       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal,ip-10-0-167-49.us-west-2.compute.internal,ip-10-0-171-204.us-west-2.compute.internal,ip-10-0-181-165.us-west-2.compute.internal
E1018 23:24:14.688175       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
E1018 23:24:14.878571       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 0 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-185-114.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy, ip-10-0-171-204.us-west-2.compute.internal is unhealthy, ip-10-0-181-165.us-west-2.compute.internal is unhealthy
I1018 23:24:25.142227       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdMembersErrorUpdatingStatus' Timeout: request did not complete within requested timeout - context deadline exceeded
E1018 23:24:41.379035       1 base_controller.go:272] InstallerStateController reconciliation failed: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
E1018 23:24:41.381346       1 base_controller.go:272] FSyncController reconciliation failed: the server was unable to return a response in the time allotted, but may still be processing the request (get infrastructures.config.openshift.io cluster)
E1018 23:24:48.691180       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-operator.171f4d5e545a0f31", GenerateName:"", Namespace:"openshift-etcd-operator", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}, Reason:"UnhealthyEtcdMember", Message:"unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal,ip-10-0-167-49.us-west-2.compute.internal,ip-10-0-171-204.us-west-2.compute.internal,ip-10-0-181-165.us-west-2.compute.internal", Source:v1.EventSource{Component:"openshift-cluster-etcd-operator-etcd-client", Host:""}, FirstTimestamp:time.Date(2022, time.October, 18, 23, 24, 14, 687825713, time.Local), LastTimestamp:time.Date(2022, time.October, 18, 23, 24, 14, 687825713, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within requested timeout - context deadline exceeded' (will not retry!)
E1018 23:24:59.144341       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-operator.171f4d60c3779bf9", GenerateName:"", Namespace:"openshift-etcd-operator", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}, Reason:"EtcdMembersErrorUpdatingStatus", Message:"Timeout: request did not complete within requested timeout - context deadline exceeded", Source:v1.EventSource{Component:"openshift-cluster-etcd-operator-member-observer-controller-etcdmemberscontroller", Host:""}, FirstTimestamp:time.Date(2022, time.October, 18, 23, 24, 25, 141967865, time.Local), LastTimestamp:time.Date(2022, time.October, 18, 23, 24, 25, 141967865, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within requested timeout - context deadline exceeded' (will not retry!)
I1018 23:24:59.144449       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdMembersErrorUpdatingStatus' Timeout: request did not complete within requested timeout - context deadline exceeded
E1018 23:25:01.844558       1 leaderelection.go:330] error retrieving resource lock openshift-etcd-operator/openshift-cluster-etcd-operator-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps openshift-cluster-etcd-operator-lock)
I1018 23:25:08.231328       1 leaderelection.go:283] failed to renew lease openshift-etcd-operator/openshift-cluster-etcd-operator-lock: timed out waiting for the condition
E1018 23:25:08.231418       1 leaderelection.go:306] Failed to release lock: resource name may not be empty
W1018 23:25:08.231450       1 leaderelection.go:82] leader election lost

Elbehery · 2022-10-19T01:12:52Z

Scenario `Replace two failed machine from 4 master nodes`

logic

   - Destroy `2 `ec2 instance backing `master-0`, `master-1`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `4`.

expected

    - CEO should scale down the unhealthy master-0, and master-1.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - After creating the extra 4th machine, another master machine became `deleting`.
    - I did not destroy or delete this machine, this is a potential bug 🐛 🚨 🆘 
    - CEO logs that shows scaling down this machine, this is not expected.
    - killing ec2 instance backing another master machine, considering the failure domain during picking the victim master machine.
    - Once the corresponding machine resource becomes `failed`, delete the machine resource.
    - CEO scales down unhealthy member.
    - No Replacement has been created. 🐛 🆘 🚨 
    - The cluster is healthy, but with only `2` master nodes.
    - Initial setup : `4` master nodes.
    - DesiredControlPlaneReplicas : `3` master nodes.
    - Actual : `2` master nodes.

CEO logs

before killing `master-1`

I1019 00:27:53.269144       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5 since its member is still present with any of: [{InternalIP 10.0.171.204} {InternalDNS ip-10-0-171-204.us-west-2.compute.internal} {Hostname ip-10-0-171-204.us-west-2.compute.internal}]
I1019 00:27:53.549257       1 clustermemberremovalcontroller.go:187] @Mustafa: votingMemberMachinesPendingDeletionIndex map[10.0.171.204:0xc001919d40]
I1019 00:27:53.559880       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.171.204, name: ip-10-0-171-204.us-west-2.compute.internal, id: 7652840804494303405 ] from the cluster
I1019 00:27:53.560360       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1019 00:27:53.560504       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.171.204:{} 10.0.181.165:{} 10.0.228.161:{}]
I1019 00:27:53.561501       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1019 00:27:53.561992       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberRemove' removed member with ID: 7652840804494303405
I1019 00:27:53.562055       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' successfully removed member: [ url: 10.0.171.204, name: ip-10-0-171-204.us-west-2.compute.internal, id: 7652840804494303405 ] from the cluster
I1019 00:28:02.834926       1 clustermembercontroller.go:208] Ignoring node (ip-10-0-171-204.us-west-2.compute.internal) for scale-up since its machine (ci-ln-g177chb-76ef8-9grr8-master-5) is pending deletion
I1019 00:27:53.269144       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5 since its member is still present with any of: [{InternalIP 10.0.171.204} {InternalDNS ip-10-0-171-204.us-west-2.compute.internal} {Hostname ip-10-0-171-204.us-west-2.compute.internal}]
I1019 00:27:54.631741       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5
I1019 00:27:54.649939       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MachineDeletionHook' successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5 as it doesn't have a member

after killing `master-1` , before deleting `master-1`

I1019 00:54:16.526417       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1019 00:54:16.526592       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.181.165:{} 10.0.200.166:{} 10.0.228.161:{}]
I1019 00:54:16.526932       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1019 00:54:16.538665       1 clustermemberremovalcontroller.go:135] @Mustafa: liveVotingMembers is [ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379"  ID:13232706032766304391 name:"ip-10-0-228-161.us-west-2.compute.internal" peerURLs:"https://10.0.228.161:2380" clientURLs:"https://10.0.228.161:2379"  ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ]
I1019 00:54:16.538799       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc001c3e240 0xc003d93b00 0xc001f96fc0]
I1019 00:54:16.538812       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc001f96fc0 0xc001c3e240 0xc003d93b00]
E1019 00:54:19.314341       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-200-166.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.200.166:2379]: context deadline exceeded
I1019 00:54:20.907165       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-200-166.us-west-2.compute.internal
E1019 00:54:20.907337       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
I1019 00:27:48.968375       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc001c3e240 0xc002532480 0xc001f96fc0 0xc003b50b40]
I1019 00:27:48.968415       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc001f96fc0 0xc003b50b40 0xc001c3e240]
I1019 00:27:52.013537       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1019 00:27:52.013559       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.171.204:{} 10.0.181.165:{} 10.0.228.161:{}]
I1019 00:27:52.013847       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3

after deleting `master-1`

I1019 01:00:39.497872       1 machinedeletionhooks.go:137] current members [ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379"  ID:13232706032766304391 name:"ip-10-0-228-161.us-west-2.compute.internal" peerURLs:"https://10.0.228.161:2380" clientURLs:"https://10.0.228.161:2379"  ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ] with IPSet: map[10.0.181.165:{} 10.0.200.166:{} 10.0.228.161:{}]
I1019 01:00:39.497942       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-wjcfl-5 since its member is still present with any of: [{InternalIP 10.0.200.166} {InternalDNS ip-10-0-200-166.us-west-2.compute.internal} {Hostname ip-10-0-200-166.us-west-2.compute.internal}]
I1019 01:00:42.250785       1 clustermemberremovalcontroller.go:184] @Mustafa: unhealthy members [ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ]
I1019 01:00:42.250841       1 clustermemberremovalcontroller.go:187] @Mustafa: votingMemberMachinesPendingDeletionIndex map[10.0.200.166:0xc001c218c0]
I1019 01:00:42.251027       1 clustermemberremovalcontroller.go:402] @Mustafa: members are [ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379"  ID:13232706032766304391 name:"ip-10-0-228-161.us-west-2.compute.internal" peerURLs:"https://10.0.228.161:2380" clientURLs:"https://10.0.228.161:2379"  ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ]
I1019 01:00:42.258269       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.200.166, name: ip-10-0-200-166.us-west-2.compute.internal, id: 15106259840278836149 ] from the cluster
I1019 01:00:42.258454       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberRemove' removed member with ID: 15106259840278836149
I1019 01:00:42.540436       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-wjcfl-5

Elbehery · 2022-10-19T01:23:13Z

/assign @hasbro17
/assign @tjungblu
/assign @JoelSpeed

Elbehery · 2022-10-19T02:10:15Z

/retest-required

Elbehery · 2022-10-19T05:09:59Z

/retest-required

JoelSpeed · 2022-10-19T07:38:54Z

I've left some comments on the testing docs but I think there's some assumptions going into these test cases about how the CPMS works that aren't true.

CPMS works based on a concept of indexes, each index represents one desired control plane instance, and these indexes are typically 0, 1 and 2 in a cluster that you've just created. So for example, the installer creates a machine in index 0 called master-0.

When the CPMS sees an additional machine in the cluster, say, called master-3, it thinks this is a 4th index. Because the CPMS by default is set to replicas: 3, this is an error and the CPMS will degrade until the administrator removes the extraneous index.

If the CPMS were to see a machine call master-foo-0, then, because this has a 0 at the end, it would assume this is part of index 0. This is how we track replacing machines btw. So it would say, ok, I now have 2 machines in index 0, one of them must be replacing the other. If you delete one of these, eventually the cluster should settle again and CPMS will be happy.

We actually intend to delete those extra machines in the index, but our bug is currently blocked because the etcd e2e can't handle that behaviour yet.

I think in the testing here you've created extra indexes, then when delete the machines the CPMS sees 3 machines in 3 indexes and goes back to an available state, which is why no more machines are created, it's happy once they've been deleted

Note horizontal scaling is not a feature of CPMS yet, we need to make sure etcd can horizontally scale before we were to introduce that, so you are stuck with 3 replicas (you can force it to 5 if you want but this isn't technically supported, no customer is doing this)

tjungblu · 2022-10-19T09:02:16Z

(you can force it to 5 if you want but this isn't technically supported, no customer is doing this)

@JoelSpeed there's a customer that runs this, which is why we have the five control plane test in our PRs, so it's certainly supported and used.

tjungblu

left couple comments, lgtm otherwise. Great tests and love the logs.

tjungblu · 2022-10-19T09:20:29Z

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

+	liveVotingMembers, err := c.getAllVotingMembers(ctx)
+	if err != nil {
+		// TODO : update status condition
+		// TODO : should go degraded ?


nope, that's a transient issue and will be retried next recon cycle

hmmm, good point. So out of curiosity, when should we use Status.conditions ?

I learnt about them recently in k8s operators tutorials :D

when you want to tell the admin to fix something specifically. I'd personally still prefer logs as they are a bit easier to understand over time.

Going Degraded only if we should block the upgrades, but my philosophy is also different than what's used across the codebase.

thanks a lot, now i got it :)

tjungblu · 2022-10-19T09:23:24Z

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

+		return fmt.Errorf("could not find master machines with deletion hook: %w", err)
 	}

 	var votingMembersMachines []*machinev1beta1.Machine


just to make the below logic consistent, should the currentVotingMemberIPListSet sourced from liveVotingMembers too? Otherwise you have one set generated by configmap, the other from the etcd memberlist.

I actually wanted to do this, but did not want to change the logic completely.

I had to use liveVotingMembers for deciding whether attempt scaling down, or skip. During testing, I noticed that the configmap takes time to reflect the cluster membership, and the CEO keep spinning waiting for a new CM to be created.

+1 to replace currentVotingMemberIPListSet with liveVotingMembers, wdyt @hasbro17 ?

I had the opposite thought actually. For the default case when we're checking if scale-down is allowed i.e votingMembers >= desiredControlPlaneReplicasCount we don't want to make live calls to the etcd cluster.

The reason why we use currentVotingMemberIPListSet (sourced from the etcd-endpoints configmap) is so we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.

During testing, I noticed that the configmap takes time to reflect the cluster membership, and the CEO keep spinning waiting for a new CM to be created.

Not sure what CM here is (assuming machine) but other than the delay @Elbehery Is there a reason why you can't use the currentVotingMemberIPListSet in L128-135 above? How much is the delay in practice? I would expect the etcd-endpoints controller to update within the minute once there is an actual membership change.

Thanks @hasbro17 for your review

votingMembers >= desiredControlPlaneReplicasCount we don't want to make live calls to the etcd cluster.

the etcd-endpoint CM contains all the voting member ( i.e. non-learner ) whether healthy or unhealthy. In #937 (comment) we agreed to constrain to the scaling-down invariant around healthyVotingMembers iiuc

we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.

I think listing the cluster members is not a load, moreover the recent changes to etcdcli package reuses the connections :)

I would expect the etcd-endpoints controller to update within the minute once there is an actual membership change.

Yes this is correct, it takes around a minutes to update the CM .. I can change it back, but may I ask listing the healthy members directly from the etcd cluster is not correct ?

The reason why we use currentVotingMemberIPListSet (sourced from the etcd-endpoints configmap) is so we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.

That makes sense, I think we make lots of MemberList calls across the operator already (health probe, etc). The connection to etcd is already open as Mustafa said, so getting the current list of members is not an expensive call.

I'm all up for consolidating this listing into a single controller that queries this in the background every 5-10s and all other controllers just feed from that.

How much is the delay in practice?

I don't have a direct answer for that, but let's go with the case I've added to the quorum guard here:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L145

So when you're down to two nodes, we'll never update the CM until you add a new node. I realize now that this is stupid, but we have no other way to selectively tell the static pod controller not to rollout a revision based on that.
In that worst case, it takes how long it takes the cluster admin to add a new node.

I'm all up for consolidating this listing into a single controller that queries this in the background every 5-10s and all other controllers just feed from that.

+1

we'll never update the CM until you add a new node.

In this case I think listing the members from etcd client makes more sense, or ?

Yeah I forgot that we'll have client connections open from the client cache so I guess having this controller make periodic live list calls might not add a significant load. So I guess we can keep the live call here.

So when you're down to two nodes, we'll never update the CM until you add a new node

Yeah that's not ideal especially since the apiserver is using the endpoints configmap and will end up talking to a non-existent endpoint if we have 2 members and the CM is outdated with 3. But it's a necessary trade-off that we've made given that we don't want quorum loss mid rollout.
Plus with CPMS once we should have a new machine who's member would soon be added and promoted right after the unhealthy one is removed so this gap shouldn't be too large.

Let's keep the live list for now, and leave off the consolidation of list calls to a better internal cache than the etcd-endpoints CM as a follow up.

if we agree on sticking with the member listing, then I would definitely like to see currentVotingMemberIPListSet to come from the member listing. Otherwise this is going to be wildly inconsistent.

tjungblu · 2022-10-19T09:29:38Z

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

 	return false
 }
+
+func minimumTolerableQuorum(desiredControlPlaneReplicasCount int) int {


we have something similar already in:

cluster-etcd-operator/pkg/etcdcli/health.go

Lines 241 to 268 in 2ec6e4d

func IsQuorumFaultTolerant(memberHealth []healthCheck) bool {

totalMembers := len(memberHealth)

quorum := totalMembers/2 + 1

healthyMembers := len(GetHealthyMemberNames(memberHealth))

switch {

case totalMembers-quorum < 1:

klog.Errorf("etcd cluster has quorum of %d which is not fault tolerant: %+v", quorum, memberHealth)

return false

case healthyMembers-quorum < 1:

klog.Errorf("etcd cluster has quorum of %d and %d healthy members which is not fault tolerant: %+v", quorum, healthyMembers, memberHealth)

return false

}

return true

}

// IsQuorumFaultTolerantErr is the same as IsQuorumFaultTolerant but with an error return instead of the log

func IsQuorumFaultTolerantErr(memberHealth []healthCheck) error {

totalMembers := len(memberHealth)

quorum := totalMembers/2 + 1

healthyMembers := len(GetHealthyMemberNames(memberHealth))

switch {

case totalMembers-quorum < 1:

return fmt.Errorf("etcd cluster has quorum of %d which is not fault tolerant: %+v", quorum, memberHealth)

case healthyMembers-quorum < 1:

return fmt.Errorf("etcd cluster has quorum of %d and %d healthy members which is not fault tolerant: %+v", quorum, healthyMembers, memberHealth)

}

return nil

}

Can we reuse that instead?

Yes I tried, but the conditions used here are too strict and this was the problem of scaling down

I use totalMembers < desiredControlPlaneCount as a first check to decide if we can attempt to scale down, or skip.

if len(liveVotingMembers) < desiredControlPlaneReplicasCount { klog.V(2).Infof("Ignoring scale-down since the number of etcd members (%d) < desired number of control-plane replicas (%d) ", len(liveVotingMembers), desiredControlPlaneReplicasCount) return nil }

I use len(healthyLiveVotingMembers) < minimumTolerableQuorum(desiredControlPlaneReplicasCount) to decide if we have a chance to scale down without losing Quorum

The conditions used in IsQuorumFaultTolerant() is too strict for scaling down

that's fair, but can we consolidate the formula for computing this into a shared func instead? so not everybody has to re-invent this math and test it again, with what parameters you feed the formula is up to the caller.

I am fine with that, but both methods have different args 🤔

@Elbehery You don't have to reuse the IsQuorumFaultTolerant function as it is or change its args but you can consolidate the quorum computation to a new function and reuse that in both IsQuorumFaultTolerant and in minimumTolerableQuorum so we're not maintaining the same thing in different packages.

@hasbro17 @tjungblu I tried refactoring minimumTolerableQuorum to common but it resulted in cyclic dependency .. I added it to etcdcli as exported

Elbehery · 2022-10-27T08:16:29Z

/retest-required

tjungblu · 2022-10-27T09:04:30Z

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

+		return fmt.Errorf("could not find master machines with deletion hook: %w", err)
 	}

 	var votingMembersMachines []*machinev1beta1.Machine


if we agree on sticking with the member listing, then I would definitely like to see currentVotingMemberIPListSet to come from the member listing. Otherwise this is going to be wildly inconsistent.

tjungblu · 2022-10-27T09:04:52Z

pkg/etcdcli/health.go

 	}
 }
+
+func MinimumTolerableQuorum(members int) int {


can you please add a unit test for that?

sure 👍🏽

added @tjungblu 👍🏽

tjungblu · 2022-10-27T14:23:59Z

pkg/etcdcli/health_test.go

+		t.Run(scenario.name, func(t *testing.T) {
+			actual, err := MinimumTolerableQuorum(scenario.input)
+
+			if scenario.expErr != nil {


you can simplify all your if conditions below with a simple:

require.Equal(t, scenario.expErr, err) require.Equal(t, scenario.exp, actual)

thanks I never used this :D

this is testify ?

Elbehery · 2022-10-27T14:45:33Z

pkg/operator/ceohelpers/common.go

-func VotingMemberIPListSet(configMapLister corev1listers.ConfigMapNamespaceLister) (sets.String, error) {
-	etcdEndpointsConfigMap, err := configMapLister.Get("etcd-endpoints")
+func VotingMemberIPListSet(ctx context.Context, cli etcdcli.EtcdClient) (sets.String, error) {
+	members, err := cli.VotingMemberList(ctx)


@tjungblu is this correct ?

just replicate what the member controller does in:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L132-L136

good catch 👍🏽

Elbehery · 2022-10-27T14:47:15Z

pkg/operator/ceohelpers/common.go

-	for _, votingMemberIP := range etcdEndpointsConfigMap.Data {
-		currentVotingMemberIPListSet.Insert(votingMemberIP)
+	for _, member := range members {
+		currentVotingMemberIPListSet.Insert(member.PeerURLs[0])


@tjungblu this is the member's ip, right ?

Elbehery · 2022-10-27T15:17:07Z

/retest-required

Elbehery · 2022-10-27T16:59:20Z

/retest-required

Elbehery · 2022-10-27T18:48:39Z

/retest-required

tjungblu · 2022-10-28T07:26:57Z

/lgtm

openshift-ci · 2022-10-28T07:27:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery, hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [hasbro17]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Elbehery · 2022-10-28T12:41:29Z

/retest-required

openshift-ci · 2022-10-28T16:02:16Z

@Elbehery: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-cherrypick-robot · 2022-10-31T12:05:13Z

@Elbehery: cannot checkout 4.12: error checking out 4.12: exit status 1. output: error: pathspec '4.12' did not match any file(s) known to git

Details

In response to this:

/cherry-pick 4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Elbehery · 2022-10-31T12:06:12Z

/cherry-pick release-4.12

openshift-cherrypick-robot · 2022-10-31T12:06:53Z

@Elbehery: new pull request created: #960

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from EmilyM1 and tjungblu October 16, 2022 22:18

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2022

Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch 2 times, most recently from ac83f08 to e3bc749 Compare October 17, 2022 11:02

Elbehery commented Oct 17, 2022

View reviewed changes

Elbehery commented Oct 18, 2022

View reviewed changes

Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from e7dcf51 to 5d64b1c Compare October 18, 2022 22:02

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2022

openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 18, 2022

openshift-ci bot assigned hasbro17, JoelSpeed and tjungblu Oct 19, 2022

tjungblu reviewed Oct 19, 2022

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2022

remove duplicate quorum calculation

b2c4a83

Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from a71acc6 to b2c4a83 Compare October 27, 2022 04:15

Elbehery mentioned this pull request Oct 27, 2022

ETCD-336: add e2e replace unhealthy master machine openshift/origin#27496

Closed

5 tasks

tjungblu suggested changes Oct 27, 2022

View reviewed changes

tjungblu reviewed Oct 27, 2022

View reviewed changes

Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch 2 times, most recently from bdd1393 to 73258ae Compare October 27, 2022 14:41

Elbehery commented Oct 27, 2022

View reviewed changes

Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from 73258ae to 24bbc13 Compare October 27, 2022 14:49

add tests

fc6a345

Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from 24bbc13 to fc6a345 Compare October 27, 2022 14:57

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2022

openshift-merge-robot merged commit 125d913 into openshift:master Oct 28, 2022

Elbehery deleted the automatic_replacement_unhealthy_member_machine branch October 28, 2022 16:07

openshift-cherrypick-robot mentioned this pull request Oct 31, 2022

[release-4.12] OCPBUGS-2979: automatic replacement of an unhealthy member machine #960

Merged

hasbro17 mentioned this pull request Dec 15, 2022

Update etcd scaling test for CPMS supported platforms openshift/origin#27497

Merged

hasbro17 mentioned this pull request Feb 3, 2026

OCPBUGS-74151: Wait for revision stability before removing etcd members #1540

Merged

	if currentVotingMemberIPListSet.Len() < desiredControlPlaneReplicasCount {
	if liveVotingMembers, err := c.getAllVotingMembers(ctx); err == nil && len(liveVotingMembers) < desiredControlPlaneReplicasCount {

	func IsQuorumFaultTolerant(memberHealth []healthCheck) bool {
	totalMembers := len(memberHealth)
	quorum := totalMembers/2 + 1
	healthyMembers := len(GetHealthyMemberNames(memberHealth))
	switch {
	case totalMembers-quorum < 1:
	klog.Errorf("etcd cluster has quorum of %d which is not fault tolerant: %+v", quorum, memberHealth)
	return false
	case healthyMembers-quorum < 1:
	klog.Errorf("etcd cluster has quorum of %d and %d healthy members which is not fault tolerant: %+v", quorum, healthyMembers, memberHealth)
	return false
	}
	return true
	}

	// IsQuorumFaultTolerantErr is the same as IsQuorumFaultTolerant but with an error return instead of the log
	func IsQuorumFaultTolerantErr(memberHealth []healthCheck) error {
	totalMembers := len(memberHealth)
	quorum := totalMembers/2 + 1
	healthyMembers := len(GetHealthyMemberNames(memberHealth))
	switch {
	case totalMembers-quorum < 1:
	return fmt.Errorf("etcd cluster has quorum of %d which is not fault tolerant: %+v", quorum, memberHealth)
	case healthyMembers-quorum < 1:
	return fmt.Errorf("etcd cluster has quorum of %d and %d healthy members which is not fault tolerant: %+v", quorum, healthyMembers, memberHealth)
	}
	return nil
	}

Conversation

Elbehery commented Oct 16, 2022

Uh oh!

Elbehery commented Oct 17, 2022

Uh oh!

Elbehery commented Oct 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Elbehery commented Oct 17, 2022

Uh oh!

Elbehery commented Oct 18, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Elbehery commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario Replace one failed machine from 3 master nodes

logic

setup

expected

actual

CEO logs

Uh oh!

Elbehery commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario Replace one failed machine from 5 master nodes

logic

setup

expected

actual

CEO logs

Uh oh!

Elbehery commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario Replace one failed machine from 4 master nodes

logic

setup

expected

actual

CEO logs

Uh oh!

Elbehery commented Oct 18, 2022

Uh oh!

Elbehery commented Oct 18, 2022

Uh oh!

Elbehery commented Oct 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Scenario Replace two failed machine from 5 master nodes

logic

setup

expected

actual

ETCD Member logs

CEO logs

Uh oh!

Elbehery commented Oct 19, 2022

Scenario Replace two failed machine from 4 master nodes

logic

setup

expected

actual

CEO logs

before killing master-1

after killing master-1 , before deleting master-1

after deleting master-1

Uh oh!

Elbehery commented Oct 19, 2022

Uh oh!

Elbehery commented Oct 19, 2022

Uh oh!

Elbehery commented Oct 19, 2022

Uh oh!

JoelSpeed commented Oct 19, 2022

Uh oh!

tjungblu commented Oct 19, 2022

Elbehery commented Oct 18, 2022 •

edited

Loading

Scenario `Replace one failed machine from 3 master nodes`

Elbehery commented Oct 18, 2022 •

edited

Loading

Scenario `Replace one failed machine from 5 master nodes`

Elbehery commented Oct 18, 2022 •

edited

Loading

Scenario `Replace one failed machine from 4 master nodes`

Elbehery commented Oct 18, 2022 •

edited

Loading

Scenario `Replace two failed machine from 5 master nodes`

Scenario `Replace two failed machine from 4 master nodes`

before killing `master-1`

after killing `master-1` , before deleting `master-1`

after deleting `master-1`

tjungblu Oct 21, 2022 •

edited

Loading