Skip to content

ETCD-328: automatic replacement of an unhealthy member machine#947

Merged
openshift-merge-robot merged 5 commits intoopenshift:masterfrom
Elbehery:automatic_replacement_unhealthy_member_machine
Oct 28, 2022
Merged

ETCD-328: automatic replacement of an unhealthy member machine#947
openshift-merge-robot merged 5 commits intoopenshift:masterfrom
Elbehery:automatic_replacement_unhealthy_member_machine

Conversation

@Elbehery
Copy link
Contributor

This PR attempts to allow replacing an unhealthy etcd member from the cluster, with a replacement as learner. It relaxes the scaling down invariant to the minimum tolerable threshold in order to maintain Quorum.

@openshift-ci openshift-ci bot requested review from EmilyM1 and tjungblu October 16, 2022 22:18
@Elbehery
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 17, 2022
@Elbehery Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch 2 times, most recently from ac83f08 to e3bc749 Compare October 17, 2022 11:02
@Elbehery
Copy link
Contributor Author

Tested on a 4.12 cluster && this PR atop ..

Scenario :

  • kill ec2 instance backing master-0 without pre-provision a replacement
  • the etcd cluster is unhealthy due to unreachable voting member
  • run oc delete machine master-0
  • master-0 become failed
  • a replacement master machine is being created by CPMO
  • the etcd voting member within master-0 is being removed from the cluster, even though the cluster is unhealthy
  • the new etcd pod successfully joins the etcd cluster as learner and being promoted
  • the etcd cluster recovers and becomes healthy again with 3 voting member ( i.e. desiredControlPlaneReplicaCount` )

Logs from etcd operator

I1017 12:21:28.740923       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:28.775380       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.005558       1 request.go:601] Waited for 1.392216647s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:29.620351       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/etcd-guard-ip-10-0-168-9.us-west-2.compute.internal -n openshift-etcd because it was missing
I1017 12:21:29.622796       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.633835       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.643377       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.675572       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.720747       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.761795       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.810172       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAddAsLearner' successfully added new member https://10.0.168.9:2380
I1017 12:21:29.833832       1 clustermembercontroller.go:283] Not ready for promotion: etcd learner member (https://10.0.168.9:2380) is not yet in sync with leader's log 
I1017 12:21:30.205573       1 request.go:601] Waited for 1.192895381s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:30.366933       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:30.758816       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:31.209179       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
I1017 12:21:32.896457       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.230975       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberPromote' successfully promoted learner member https://10.0.168.9:2380
I1017 12:21:33.565330       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.623126       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-17T12:19:41Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-10-17T12:19:43Z","message":"NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-17T11:32:56Z","message":"StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10\nEtcdMembersAvailable: 2 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-17T11:29:55Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2022-10-17T11:29:28Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"}]}}
I1017 12:21:33.639634       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nGuardControllerDegraded: Missing PodIP in operand etcd-ip-10-0-168-9.us-west-2.compute.internal on node ip-10-0-168-9.us-west-2.compute.internal\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
I1017 12:21:33.639742       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.649419       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/etcd-endpoints -n openshift-etcd:
cause by changes in data.678c11558ac790e0
I1017 12:21:33.652004       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:33.652017       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:33.652271       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:33.652279       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:33.652286       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc0017df200 0xc0017df420 0xc001db98c0]
I1017 12:21:33.652297       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]
W1017 12:21:33.656180       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.661582       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 11 triggered by "configmap/etcd-endpoints has changed"
I1017 12:21:33.808887       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
W1017 12:21:33.843243       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.846922       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.872262       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.885684       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.923949       1 defragcontroller.go:289] etcd member "ip-10-0-168-9.us-west-2.compute.internal" backend store fragmented: 16.65 %, dbSize: 138178560
I1017 12:21:33.923965       1 defragcontroller.go:289] etcd member "ip-10-0-130-100.us-west-2.compute.internal" backend store fragmented: 4.26 %, dbSize: 75403264
I1017 12:21:33.923972       1 defragcontroller.go:289] etcd member "ip-10-0-253-42.us-west-2.compute.internal" backend store fragmented: 3.25 %, dbSize: 74633216
I1017 12:21:33.939388       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.537466       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.807733       1 request.go:601] Waited for 1.183291648s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/etcd-pod
I1017 12:21:35.409752       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/revision-status-11 -n openshift-etcd because it was missing
I1017 12:21:35.410709       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:35.410730       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:35.411069       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:35.411079       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:35.411087       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc001db98c0 0xc0017df200 0xc0017df420]
I1017 12:21:35.411096       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]

}
if currentVotingMemberIPListSet.Len() <= desiredControlPlaneReplicasCount {
klog.V(2).Infof("@Mustafa: currentVotingMemberIPListSet.Len() is %v", currentVotingMemberIPListSet.Len())
if currentVotingMemberIPListSet.Len() < desiredControlPlaneReplicasCount {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if currentVotingMemberIPListSet.Len() < desiredControlPlaneReplicasCount {
if liveVotingMembers, err := c.getAllVotingMembers(ctx); err == nil && len(liveVotingMembers) < desiredControlPlaneReplicasCount {

@hasbro17 I propose this change, since the etcd-config map takes some time to remove member's IP after being removed from the membership

@tjungblu wdyt ?
cc @JoelSpeed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just so I understand the change, is the change to make sure we are getting the current voting members from the etcd cluster itself rather than from the configmap? If that's the case that seems like a sensible move to make sure we are not going to break quorum at all, so +1 from me

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed yes exactly, this is the reason .. also all test passed :D 🎉

i am testing now and will update with the result, before cleaning the logs and adding tests

@Elbehery
Copy link
Contributor Author

/retest-required

The test cases passes locally !! :/

@Elbehery
Copy link
Contributor Author

/retest-required


liveVotingMembers, err := c.getAllVotingMembers(ctx)
if err != nil {
return err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Elbehery TODO: add log and error msg

@Elbehery
Copy link
Contributor Author

Elbehery commented Oct 18, 2022

Scenario Replace one failed machine from 3 master nodes

logic

   - Destroy `1 `ec2 instance backing `master-0`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `3`.

expected

    - CEO should scale down the unhealthy master-0.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - CEO  scales the unhealthy node down, once the machine is being deleted by the admin.
    - CPMS creates a replacement.
    - CEO joins the replacement to the etcd cluster.

CEO logs

I1017 12:21:28.740923       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:28.775380       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.005558       1 request.go:601] Waited for 1.392216647s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:29.620351       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/etcd-guard-ip-10-0-168-9.us-west-2.compute.internal -n openshift-etcd because it was missing
I1017 12:21:29.622796       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.633835       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.643377       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.675572       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.720747       1 clustermembercontroller.go:364] Skipping etcd-ip-10-0-168-9.us-west-2.compute.internal as the etcd container is in incorrect state, isEtcdContainerRunning = false, isEtcdContainerReady = false
I1017 12:21:29.761795       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:29.810172       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberAddAsLearner' successfully added new member https://10.0.168.9:2380
I1017 12:21:29.833832       1 clustermembercontroller.go:283] Not ready for promotion: etcd learner member (https://10.0.168.9:2380) is not yet in sync with leader's log 
I1017 12:21:30.205573       1 request.go:601] Waited for 1.192895381s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-10-ip-10-0-168-9.us-west-2.compute.internal
I1017 12:21:30.366933       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:30.758816       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:31.209179       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
I1017 12:21:32.896457       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.230975       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberPromote' successfully promoted learner member https://10.0.168.9:2380
I1017 12:21:33.565330       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.623126       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-17T12:19:41Z","message":"NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2022-10-17T12:19:43Z","message":"NodeInstallerProgressing: 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-17T11:32:56Z","message":"StaticPodsAvailable: 2 nodes are active; 1 nodes are at revision 0; 2 nodes are at revision 8; 0 nodes have achieved new revision 10\nEtcdMembersAvailable: 2 members are available","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-17T11:29:55Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"},{"lastTransitionTime":"2022-10-17T11:29:28Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"}]}}
I1017 12:21:33.639634       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "NodeControllerDegraded: All master nodes are ready\nGuardControllerDegraded: Missing PodIP in operand etcd-ip-10-0-168-9.us-west-2.compute.internal on node ip-10-0-168-9.us-west-2.compute.internal\nEtcdMembersDegraded: No unhealthy members found" to "NodeControllerDegraded: All master nodes are ready\nEtcdMembersDegraded: No unhealthy members found"
I1017 12:21:33.639742       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:33.649419       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapUpdated' Updated ConfigMap/etcd-endpoints -n openshift-etcd:
cause by changes in data.678c11558ac790e0
I1017 12:21:33.652004       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:33.652017       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:33.652271       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:33.652279       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:33.652286       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc0017df200 0xc0017df420 0xc001db98c0]
I1017 12:21:33.652297       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]
W1017 12:21:33.656180       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.661582       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionTriggered' new revision 11 triggered by "configmap/etcd-endpoints has changed"
I1017 12:21:33.808887       1 installer_controller.go:512] "ip-10-0-168-9.us-west-2.compute.internal" is in transition to 10, but has not made progress because static pod is pending
W1017 12:21:33.843243       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.846922       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.872262       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
W1017 12:21:33.885684       1 etcdcli_pool.go:87] cached client detected change in endpoints [[https://10.0.130.100:2379 https://10.0.253.42:2379]] vs. [[https://10.0.130.100:2379 https://10.0.168.9:2379 https://10.0.253.42:2379]]
I1017 12:21:33.923949       1 defragcontroller.go:289] etcd member "ip-10-0-168-9.us-west-2.compute.internal" backend store fragmented: 16.65 %, dbSize: 138178560
I1017 12:21:33.923965       1 defragcontroller.go:289] etcd member "ip-10-0-130-100.us-west-2.compute.internal" backend store fragmented: 4.26 %, dbSize: 75403264
I1017 12:21:33.923972       1 defragcontroller.go:289] etcd member "ip-10-0-253-42.us-west-2.compute.internal" backend store fragmented: 3.25 %, dbSize: 74633216
I1017 12:21:33.939388       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.537466       1 quorumguardcleanupcontroller.go:134] 2/3 guard pods ready. Waiting until all new guard pods are ready
I1017 12:21:34.807733       1 request.go:601] Waited for 1.183291648s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/etcd-pod
I1017 12:21:35.409752       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"3ede3c2a-bfce-4bb3-919e-045955c24ef2", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/revision-status-11 -n openshift-etcd because it was missing
I1017 12:21:35.410709       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1017 12:21:35.410730       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.130.100:{} 10.0.168.9:{} 10.0.253.42:{}]
I1017 12:21:35.411069       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1017 12:21:35.411079       1 clustermemberremovalcontroller.go:130] @Mustafa: currentVotingMemberIPListSet.Len() is 3
I1017 12:21:35.411087       1 clustermemberremovalcontroller.go:141] @Mustafa: current master machines with DeletionHook [0xc001db98c0 0xc0017df200 0xc0017df420]
I1017 12:21:35.411096       1 clustermemberremovalcontroller.go:148] @Mustafa: votingMembersMachines [0xc0017df200 0xc0017df420 0xc001db98c0]

@Elbehery
Copy link
Contributor Author

Elbehery commented Oct 18, 2022

Scenario Replace one failed machine from 5 master nodes

logic

   - Destroy `1 `ec2 instance backing `master-0`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `5`.

expected

    - CEO should scale down the unhealthy master-0.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - CEO  scales the unhealthy node down, once the machine is being deleted by the admin.
    - No replacement has been created by CPMS ( possible Bug 🐛  🚨 )

CEO logs

I1018 18:50:49.143942       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-zy61bgb-76ef8-j9wf7-master-3 since its member is still present with any of: [{InternalIP 10.0.227.168} {InternalDNS ip-10-0-227-168.us-west-2.compute.internal} {Hostname ip-10-0-227-168.us-west-2.compute.internal}]
I1018 18:50:50.418019       1 clustermemberremovalcontroller.go:184] @Mustafa: unhealthy members [ID:10374916192776508190 name:"ip-10-0-227-168.us-west-2.compute.internal" peerURLs:"https://10.0.227.168:2380" clientURLs:"https://10.0.227.168:2379" ]
I1018 18:50:50.418073       1 clustermemberremovalcontroller.go:187] @Mustafa: votingMemberMachinesPendingDeletionIndex map[10.0.227.168:0xc002a326c0]
I1018 18:50:50.418245       1 clustermemberremovalcontroller.go:209] found unhealthy voting members with machine pending deletion: [0xc002a326c0]
I1018 18:50:50.418257       1 clustermemberremovalcontroller.go:210] unhealthy members found: [https://10.0.227.168:2380]
I1018 18:50:50.418266       1 clustermemberremovalcontroller.go:216] @Mustafa: unhealthyVotingMemberMachinesPendingDeletion: [0xc002a326c0]
I1018 18:50:50.418272       1 clustermemberremovalcontroller.go:401] @Mustafa: entering attemptToRemoveMemberFor()
I1018 18:50:50.418301       1 clustermemberremovalcontroller.go:402] @Mustafa: members are [ID:7583016053260549916 name:"ip-10-0-237-158.us-west-2.compute.internal" peerURLs:"https://10.0.237.158:2380" clientURLs:"https://10.0.237.158:2379"  ID:8697460826291536016 name:"ip-10-0-209-73.us-west-2.compute.internal" peerURLs:"https://10.0.209.73:2380" clientURLs:"https://10.0.209.73:2379"  ID:10374916192776508190 name:"ip-10-0-227-168.us-west-2.compute.internal" peerURLs:"https://10.0.227.168:2380" clientURLs:"https://10.0.227.168:2379"  ID:10881545389327994471 name:"ip-10-0-241-79.us-west-2.compute.internal" peerURLs:"https://10.0.241.79:2380" clientURLs:"https://10.0.241.79:2379"  ID:13043031258985252455 name:"ip-10-0-174-177.us-west-2.compute.internal" peerURLs:"https://10.0.174.177:2380" clientURLs:"https://10.0.174.177:2379" ]
I1018 18:50:50.424876       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.227.168, name: ip-10-0-227-168.us-west-2.compute.internal, id: 10374916192776508190 ] from the cluster
I1018 18:50:50.425081       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"314c904e-f673-4f8b-84aa-6ca8c129d9f3", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' successfully removed member: [ url: 10.0.227.168, name: ip-10-0-227-168.us-west-2.compute.internal, id: 10374916192776508190 ] from the cluster
I1018 18:50:52.186699       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-zy61bgb-76ef8-j9wf7-master-3
I1018 18:50:52.187430       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"314c904e-f673-4f8b-84aa-6ca8c129d9f3", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MachineDeletionHook' successfully removed the deletion hook from machine ci-ln-zy61bgb-76ef8-j9wf7-master-3 as it doesn't have a member
I1018 18:50:52.187577       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1018 18:50:52.194636       1 clustermemberremovalcontroller.go:135] @Mustafa: liveVotingMembers is [ID:7583016053260549916 name:"ip-10-0-237-158.us-west-2.compute.internal" peerURLs:"https://10.0.237.158:2380" clientURLs:"https://10.0.237.158:2379"  ID:8697460826291536016 name:"ip-10-0-209-73.us-west-2.compute.internal" peerURLs:"https://10.0.209.73:2380" clientURLs:"https://10.0.209.73:2379"  ID:10881545389327994471 name:"ip-10-0-241-79.us-west-2.compute.internal" peerURLs:"https://10.0.241.79:2380" clientURLs:"https://10.0.241.79:2379"  ID:13043031258985252455 name:"ip-10-0-174-177.us-west-2.compute.internal" peerURLs:"https://10.0.174.177:2380" clientURLs:"https://10.0.174.177:2379" ]
I1018 18:50:52.194728       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc00228e400 0xc00228e620 0xc00228e840 0xc0032b0240]
I1018 18:50:52.194741       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc00228e400 0xc00228e620 0xc00228e840 0xc0032b0240]

@Elbehery
Copy link
Contributor Author

Elbehery commented Oct 18, 2022

Scenario Replace one failed machine from 4 master nodes

logic

   - Destroy `1 `ec2 instance backing `master-0`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `4`.

expected

    - CEO should scale down the unhealthy master-0.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - CEO  scales the unhealthy node down, once the machine is being deleted by the admin.
    - No replacement has been created by CPMS.

CEO logs

I1018 22:47:50.152487       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1018 22:47:50.152514       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.177.164:{} 10.0.181.165:{} 10.0.185.114:{} 10.0.231.247:{}]
I1018 22:47:50.152881       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1018 22:47:50.170840       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node ip-10-0-177-164.us-west-2.compute.internal
I1018 22:49:03.592953       1 clustermemberremovalcontroller.go:209] found unhealthy voting members with machine pending deletion: [0xc003a2dd40]
I1018 22:49:03.592989       1 clustermemberremovalcontroller.go:210] unhealthy members found: [https://10.0.177.164:2380]
I1018 22:49:03.593035       1 clustermemberremovalcontroller.go:216] @Mustafa: unhealthyVotingMemberMachinesPendingDeletion: [0xc003a2dd40]
I1018 22:49:03.593055       1 clustermemberremovalcontroller.go:401] @Mustafa: entering attemptToRemoveMemberFor()
I1018 22:49:03.593060       1 clustermemberremovalcontroller.go:402] @Mustafa: members are [ID:2193243551350031740 name:"ip-10-0-231-247.us-west-2.compute.internal" peerURLs:"https://10.0.231.247:2380" clientURLs:"https://10.0.231.247:2379"  ID:2203788137076361769 name:"ip-10-0-185-114.us-west-2.compute.internal" peerURLs:"https://10.0.185.114:2380" clientURLs:"https://10.0.185.114:2379"  ID:10711119892892765467 name:"ip-10-0-177-164.us-west-2.compute.internal" peerURLs:"https://10.0.177.164:2380" clientURLs:"https://10.0.177.164:2379"  ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379" ]
I1018 22:49:03.601644       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.177.164, name: ip-10-0-177-164.us-west-2.compute.internal, id: 10711119892892765467 ] from the cluster
I1018 22:49:04.232846       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-0

@Elbehery Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from e7dcf51 to 5d64b1c Compare October 18, 2022 22:02
@Elbehery
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 18, 2022
@Elbehery
Copy link
Contributor Author

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 18, 2022
@Elbehery
Copy link
Contributor Author

Elbehery commented Oct 18, 2022

Scenario Replace two failed machine from 5 master nodes

logic

   - Destroy `2 `ec2 instance backing `master-0`, `master-1`
   - Also tried destroying `master-0` first, wait till its machine becomes `failed`, then destory `master-1`

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `5`.

expected

    - CEO should scale down the unhealthy master-0, and master-1.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - API Server is down 🆘 🚨 
    - Cluster lost completely
    - The cluster is unreachable again after a duration from 12-15 minutes
    - After 15 minutes the cluster became reachable.
    - Deleting the failed machines, results in `1` replacement being created and joins the etcd cluster. 
    - Expected `2` replacement being created. However, only `1` has been created 🐛 🚨 
    - The cluster recovers to the desiredState of `3` control plane machines.
    - There is a duration where we are down to only `one node cluster, and the node is not a leader.

ETCD Member logs

sh-4.4# etcdctl endpoint status -w table
{"level":"warn","ts":"2022-10-19T00:12:09.556Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004a4000/10.0.171.204:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.181.165:2379: connect: connection refused\""}
Failed to get the status of endpoint https://10.0.181.165:2379 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.171.204:2379 | 6a345a2a4a7f60ad |   3.5.5 |  140 MB |     false |      false |        24 |     107019 |             107019 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6a345a2a4a7f60ad | started | ip-10-0-171-204.us-west-2.compute.internal | https://10.0.171.204:2380 | https://10.0.171.204:2379 |      false |
| 9c05eee3ac29d8e1 | started | ip-10-0-181-165.us-west-2.compute.internal | https://10.0.181.165:2380 | https://10.0.181.165:2379 |      false |
| b7a40406e4344087 | started | ip-10-0-228-161.us-west-2.compute.internal | https://10.0.228.161:2380 | https://10.0.228.161:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# etcdctl endpoint status -w table
{"level":"warn","ts":"2022-10-19T00:12:34.441Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000328c40/10.0.171.204:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.181.165:2379: connect: connection refused\""}
Failed to get the status of endpoint https://10.0.181.165:2379 (context deadline exceeded)
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.171.204:2379 | 6a345a2a4a7f60ad |   3.5.5 |  141 MB |     false |      false |        24 |     107283 |             107283 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6a345a2a4a7f60ad | started | ip-10-0-171-204.us-west-2.compute.internal | https://10.0.171.204:2380 | https://10.0.171.204:2379 |      false |
| 9c05eee3ac29d8e1 | started | ip-10-0-181-165.us-west-2.compute.internal | https://10.0.181.165:2380 | https://10.0.181.165:2379 |      false |
| b7a40406e4344087 | started | ip-10-0-228-161.us-west-2.compute.internal | https://10.0.228.161:2380 | https://10.0.228.161:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
sh-4.4# etcdctl endpoint status -w table
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://10.0.171.204:2379 | 6a345a2a4a7f60ad |   3.5.5 |  141 MB |     false |      false |        24 |     107676 |             107676 |        |
| https://10.0.181.165:2379 | 9c05eee3ac29d8e1 |   3.5.5 |  141 MB |     false |      false |        24 |     107676 |             107676 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
sh-4.4# etcdctl member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6a345a2a4a7f60ad | started | ip-10-0-171-204.us-west-2.compute.internal | https://10.0.171.204:2380 | https://10.0.171.204:2379 |      false |
| 9c05eee3ac29d8e1 | started | ip-10-0-181-165.us-west-2.compute.internal | https://10.0.181.165:2380 | https://10.0.181.165:2379 |      false |
| b7a40406e4344087 | started | ip-10-0-228-161.us-west-2.compute.internal | https://10.0.228.161:2380 | https://10.0.228.161:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+

CEO logs

I1018 23:21:55.778813       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:55.779020       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded changed from False to True ("ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy")
I1018 23:21:56.469666       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:56.567048       1 request.go:601] Waited for 1.393155531s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:21:56.970214       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1018 23:21:56.970306       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.167.49:{} 10.0.171.204:{} 10.0.181.165:{} 10.0.185.114:{} 10.0.231.247:{}]
I1018 23:21:56.970518       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ConfigMapCreated' Created ConfigMap/etcd-endpoints-19 -n openshift-etcd because it was missing
I1018 23:21:56.970582       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1018 23:21:56.976771       1 clustermemberremovalcontroller.go:135] @Mustafa: liveVotingMembers is [ID:2193243551350031740 name:"ip-10-0-231-247.us-west-2.compute.internal" peerURLs:"https://10.0.231.247:2380" clientURLs:"https://10.0.231.247:2379"  ID:2203788137076361769 name:"ip-10-0-185-114.us-west-2.compute.internal" peerURLs:"https://10.0.185.114:2380" clientURLs:"https://10.0.185.114:2379"  ID:6504313761527134357 name:"ip-10-0-167-49.us-west-2.compute.internal" peerURLs:"https://10.0.167.49:2380" clientURLs:"https://10.0.167.49:2379"  ID:7652840804494303405 name:"ip-10-0-171-204.us-west-2.compute.internal" peerURLs:"https://10.0.171.204:2380" clientURLs:"https://10.0.171.204:2379"  ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379" ]
I1018 23:21:56.976817       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc000978880 0xc0032366c0 0xc0009ee480 0xc0022ec000 0xc000978660]
I1018 23:21:56.976826       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc0022ec000 0xc000978660 0xc000978880 0xc0032366c0 0xc0009ee480]
I1018 23:21:57.766618       1 request.go:601] Waited for 1.397145453s due to client-side throttling, not priority and fairness, request: POST:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods
I1018 23:21:57.772173       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-18-ip-10-0-171-204.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:21:57.773744       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:57.782356       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:57.790014       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:58.169354       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 18, but has not made progress because installer is not finished, but in Running phase
I1018 23:21:58.335435       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:58.372838       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretCreated' Created Secret/etcd-all-certs-19 -n openshift-etcd because it was missing
I1018 23:21:58.390059       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RevisionCreate' Revision 18 created because configmap/etcd-pod has changed
I1018 23:21:58.486270       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:58.767369       1 request.go:601] Waited for 1.395191466s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:21:59.614528       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:21:59.966688       1 request.go:601] Waited for 1.58161678s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/etcd-scripts
I1018 23:22:00.619542       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.632450       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.640662       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.651968       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:00.967153       1 request.go:601] Waited for 1.797613102s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/revision-pruner-18-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:22:01.369765       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 18, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:01.381446       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2022-10-18T23:21:55Z","message":"ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy","reason":"ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-10-18T22:49:12Z","message":"NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-18T21:52:34Z","message":"StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I1018 23:22:01.390175       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Progressing message changed from "NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 18" to "NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19",Available message changed from "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 18\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy" to "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy"
I1018 23:22:01.390292       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:01.497855       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:01.626514       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:01.967254       1 request.go:601] Waited for 1.596638118s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-171-204.us-west-2.compute.internal
I1018 23:22:02.505556       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:03.166535       1 request.go:601] Waited for 1.779795255s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-18-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:03.511640       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:04.167298       1 request.go:601] Waited for 1.797644382s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/restore-etcd-pod
I1018 23:22:04.527204       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:04.786237       1 installer_controller.go:500] "ip-10-0-167-49.us-west-2.compute.internal" moving to (v1.NodeStatus) {
 NodeName: (string) (len=41) "ip-10-0-167-49.us-west-2.compute.internal",
 CurrentRevision: (int32) 15,
 TargetRevision: (int32) 19,
 LastFailedRevision: (int32) 0,
 LastFailedTime: (*v1.Time)(<nil>),
 LastFailedReason: (string) "",
 LastFailedCount: (int) 0,
 LastFallbackCount: (int) 0,
 LastFailedRevisionErrors: ([]string) <nil>
}
 because new revision pending
I1018 23:22:04.789573       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:05.366414       1 request.go:601] Waited for 1.596102379s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-181-165.us-west-2.compute.internal
E1018 23:22:05.740230       1 base_controller.go:272] FSyncController reconciliation failed: client_error: client error: 403
E1018 23:22:05.751986       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-231-247.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.231.247:2379]: context deadline exceeded
E1018 23:22:05.752072       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-167-49.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.167.49:2379]: context deadline exceeded
I1018 23:22:05.765549       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2022-10-18T23:21:55Z","message":"ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy","reason":"ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-10-18T22:49:12Z","message":"NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-18T21:52:34Z","message":"StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I1018 23:22:05.781195       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Degraded message changed from "ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy" to "ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy"
I1018 23:22:05.781446       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:05.792876       1 status_controller.go:211] clusteroperator/etcd diff {"status":{"conditions":[{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required","reason":"ControllerStarted","status":"Unknown","type":"RecentBackup"},{"lastTransitionTime":"2022-10-18T23:21:55Z","message":"ClusterMemberControllerDegraded: unhealthy members found during reconciling members\nEtcdMembersDegraded: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy","reason":"ClusterMemberController_SyncError::EtcdMembers_UnhealthyMembers","status":"True","type":"Degraded"},{"lastTransitionTime":"2022-10-18T22:49:12Z","message":"NodeInstallerProgressing: 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19","reason":"NodeInstaller","status":"True","type":"Progressing"},{"lastTransitionTime":"2022-10-18T21:52:34Z","message":"StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy","reason":"AsExpected","status":"True","type":"Available"},{"lastTransitionTime":"2022-10-18T21:49:37Z","message":"All is well","reason":"AsExpected","status":"True","type":"Upgradeable"}]}}
I1018 23:22:05.800703       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:05.801375       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/etcd changed: Available message changed from "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 4 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy" to "StaticPodsAvailable: 4 nodes are active; 2 nodes are at revision 15; 1 nodes are at revision 16; 1 nodes are at revision 17; 0 nodes have achieved new revision 19\nEtcdMembersAvailable: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy"
I1018 23:22:06.366576       1 request.go:601] Waited for 1.564555295s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-18-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:07.367001       1 request.go:601] Waited for 1.796474623s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:07.775900       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:07.778074       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-19-ip-10-0-185-114.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:07.788797       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:07.801432       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:08.170061       1 installer_controller.go:500] "ip-10-0-167-49.us-west-2.compute.internal" moving to (v1.NodeStatus) {
 NodeName: (string) (len=41) "ip-10-0-167-49.us-west-2.compute.internal",
 CurrentRevision: (int32) 15,
 TargetRevision: (int32) 19,
 LastFailedRevision: (int32) 0,
 LastFailedTime: (*v1.Time)(<nil>),
 LastFailedReason: (string) "",
 LastFailedCount: (int) 0,
 LastFallbackCount: (int) 0,
 LastFailedRevisionErrors: ([]string) <nil>
}
 because new revision pending
I1018 23:22:08.367314       1 request.go:601] Waited for 1.795808831s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/serviceaccounts/installer-sa
I1018 23:22:08.389669       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:09.567470       1 request.go:601] Waited for 1.789547613s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/revision-pruner-19-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:22:10.767184       1 request.go:601] Waited for 1.290127155s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-171-204.us-west-2.compute.internal
E1018 23:22:18.175043       1 base_controller.go:272] PruneController reconciliation failed: etcdserver: request timed out
I1018 23:22:18.175270       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'PodCreateFailed' Failed to create Pod/revision-pruner-19-ip-10-0-181-165.us-west-2.compute.internal -n openshift-etcd: etcdserver: request timed out
E1018 23:22:25.260962       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-operator.171f4d4333a124a1", GenerateName:"", Namespace:"openshift-etcd-operator", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}, Reason:"PodCreateFailed", Message:"Failed to create Pod/revision-pruner-19-ip-10-0-181-165.us-west-2.compute.internal -n openshift-etcd: etcdserver: request timed out", Source:v1.EventSource{Component:"openshift-cluster-etcd-operator-prunecontroller", Host:""}, FirstTimestamp:time.Date(2022, time.October, 18, 23, 22, 18, 174719137, time.Local), LastTimestamp:time.Date(2022, time.October, 18, 23, 22, 18, 174719137, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'rpc error: code = Unknown desc = context deadline exceeded' (will not retry!)
I1018 23:22:28.861289       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:29.115944       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:29.143537       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
I1018 23:22:29.159093       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/installer-19-ip-10-0-167-49.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:29.176324       1 quorumguardcleanupcontroller.go:134] 3/4 guard pods ready. Waiting until all new guard pods are ready
E1018 23:22:29.640639       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
I1018 23:22:29.641046       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal
E1018 23:22:29.838223       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-185-114.us-west-2.compute.internal is unhealthy
I1018 23:22:29.883132       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Pending phase
I1018 23:22:30.269496       1 request.go:601] Waited for 1.079151574s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:22:30.491705       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-19-ip-10-0-167-49.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:31.462224       1 request.go:601] Waited for 1.183273956s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:22:32.264994       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:32.661886       1 request.go:601] Waited for 1.193524113s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?labelSelector=app%3Dinstaller
I1018 23:22:32.870222       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'PodCreated' Created Pod/revision-pruner-19-ip-10-0-171-204.us-west-2.compute.internal -n openshift-etcd because it was missing
I1018 23:22:33.862200       1 request.go:601] Waited for 1.393359552s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:35.061269       1 request.go:601] Waited for 1.393916742s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/installer-19-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:35.065083       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:35.754795       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdLeaderChangeMetrics' Detected leader change increase of 3.1676198856137265 over 5 minutes on "AWS"; disk metrics are: etcd-ip-10-0-167-49.us-west-2.compute.internal=0.005774,etcd-ip-10-0-171-204.us-west-2.compute.internal=0.009568,etcd-ip-10-0-181-165.us-west-2.compute.internal=0.005619,etcd-ip-10-0-185-114.us-west-2.compute.internal=0.007122,etcd-ip-10-0-231-247.us-west-2.compute.internal=0.006827. Most often this is as a result of inadequate storage or sometimes due to networking issues.
I1018 23:22:36.061432       1 request.go:601] Waited for 1.197324728s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/configmaps/restore-etcd-pod
I1018 23:22:37.061608       1 request.go:601] Waited for 1.237975558s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd
I1018 23:22:38.066517       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
I1018 23:22:38.261412       1 request.go:601] Waited for 1.796605973s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-181-165.us-west-2.compute.internal
I1018 23:22:39.261833       1 request.go:601] Waited for 1.56636698s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-guard-ip-10-0-167-49.us-west-2.compute.internal
I1018 23:22:40.262216       1 request.go:601] Waited for 1.396560818s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/services/etcd
I1018 23:22:40.864599       1 installer_controller.go:512] "ip-10-0-167-49.us-west-2.compute.internal" is in transition to 19, but has not made progress because installer is not finished, but in Running phase
E1018 23:22:59.675889       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
I1018 23:22:59.676025       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal
E1018 23:22:59.870257       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 3 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-185-114.us-west-2.compute.internal is unhealthy
E1018 23:23:35.842744       1 leaderelection.go:330] error retrieving resource lock openshift-etcd-operator/openshift-cluster-etcd-operator-lock: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
W1018 23:23:36.139706       1 etcdcli_pool.go:94] cached client considered unhealthy after 0 tries, trying again. Err: error during cache client health connection check: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
I1018 23:23:36.139723       1 etcdcli_pool.go:157] closing cached client
W1018 23:23:36.154468       1 etcdcli_pool.go:94] cached client considered unhealthy after 0 tries, trying again. Err: error during cache client health connection check: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
I1018 23:23:36.154631       1 etcdcli_pool.go:157] closing cached client
I1018 23:23:38.154946       1 etcdcli_pool.go:70] creating a new cached client
I1018 23:23:42.417523       1 request.go:601] Waited for 1.195727347s, retries: 1, retry-after: 1s - retry-reason: due to retryable error, error: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-185-114.us-west-2.compute.internal": unexpected EOF - request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods/etcd-ip-10-0-185-114.us-west-2.compute.internal
I1018 23:23:43.617177       1 request.go:601] Waited for 2.38428945s, retries: 1, retry-after: 1s - retry-reason: due to retryable error, error: Get "https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?allowWatchBookmarks=true&resourceVersion=71516&timeout=7m53s&timeoutSeconds=473&watch=true": unexpected EOF - request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-etcd/pods?allowWatchBookmarks=true&resourceVersion=71516&timeout=7m53s&timeoutSeconds=473&watch=true
E1018 23:23:51.139108       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-231-247.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.231.247:2379]: context deadline exceeded
E1018 23:23:51.139130       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-185-114.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.185.114:2379]: context deadline exceeded
E1018 23:23:51.139138       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-167-49.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.167.49:2379]: context deadline exceeded
E1018 23:23:51.139145       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-171-204.us-west-2.compute.internal, took=, err=create client failure: error during client connection check: rpc error: code = Unknown desc = malformed header: missing HTTP content-type
E1018 23:23:51.139152       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-181-165.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.181.165:2379]: context deadline exceeded
W1018 23:23:53.155501       1 etcdcli_pool.go:73] could not create a new cached client after 1 tries, trying again. Err: failed to make etcd client for endpoints [https://10.0.167.49:2379 https://10.0.171.204:2379 https://10.0.181.165:2379 https://10.0.185.114:2379 https://10.0.231.247:2379]: context deadline exceeded
I1018 23:24:14.687977       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal,ip-10-0-167-49.us-west-2.compute.internal,ip-10-0-171-204.us-west-2.compute.internal,ip-10-0-181-165.us-west-2.compute.internal
E1018 23:24:14.688175       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
E1018 23:24:14.878571       1 base_controller.go:272] DefragController reconciliation failed: cluster is unhealthy: 0 of 5 members are available, ip-10-0-231-247.us-west-2.compute.internal is unhealthy, ip-10-0-185-114.us-west-2.compute.internal is unhealthy, ip-10-0-167-49.us-west-2.compute.internal is unhealthy, ip-10-0-171-204.us-west-2.compute.internal is unhealthy, ip-10-0-181-165.us-west-2.compute.internal is unhealthy
I1018 23:24:25.142227       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdMembersErrorUpdatingStatus' Timeout: request did not complete within requested timeout - context deadline exceeded
E1018 23:24:41.379035       1 base_controller.go:272] InstallerStateController reconciliation failed: the server was unable to return a response in the time allotted, but may still be processing the request (get pods)
E1018 23:24:41.381346       1 base_controller.go:272] FSyncController reconciliation failed: the server was unable to return a response in the time allotted, but may still be processing the request (get infrastructures.config.openshift.io cluster)
E1018 23:24:48.691180       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-operator.171f4d5e545a0f31", GenerateName:"", Namespace:"openshift-etcd-operator", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}, Reason:"UnhealthyEtcdMember", Message:"unhealthy members: ip-10-0-231-247.us-west-2.compute.internal,ip-10-0-185-114.us-west-2.compute.internal,ip-10-0-167-49.us-west-2.compute.internal,ip-10-0-171-204.us-west-2.compute.internal,ip-10-0-181-165.us-west-2.compute.internal", Source:v1.EventSource{Component:"openshift-cluster-etcd-operator-etcd-client", Host:""}, FirstTimestamp:time.Date(2022, time.October, 18, 23, 24, 14, 687825713, time.Local), LastTimestamp:time.Date(2022, time.October, 18, 23, 24, 14, 687825713, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within requested timeout - context deadline exceeded' (will not retry!)
E1018 23:24:59.144341       1 event.go:267] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"etcd-operator.171f4d60c3779bf9", GenerateName:"", Namespace:"openshift-etcd-operator", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>, DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}, Reason:"EtcdMembersErrorUpdatingStatus", Message:"Timeout: request did not complete within requested timeout - context deadline exceeded", Source:v1.EventSource{Component:"openshift-cluster-etcd-operator-member-observer-controller-etcdmemberscontroller", Host:""}, FirstTimestamp:time.Date(2022, time.October, 18, 23, 24, 25, 141967865, time.Local), LastTimestamp:time.Date(2022, time.October, 18, 23, 24, 25, 141967865, time.Local), Count:1, Type:"Warning", EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC), Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'Timeout: request did not complete within requested timeout - context deadline exceeded' (will not retry!)
I1018 23:24:59.144449       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'EtcdMembersErrorUpdatingStatus' Timeout: request did not complete within requested timeout - context deadline exceeded
E1018 23:25:01.844558       1 leaderelection.go:330] error retrieving resource lock openshift-etcd-operator/openshift-cluster-etcd-operator-lock: the server was unable to return a response in the time allotted, but may still be processing the request (get configmaps openshift-cluster-etcd-operator-lock)
I1018 23:25:08.231328       1 leaderelection.go:283] failed to renew lease openshift-etcd-operator/openshift-cluster-etcd-operator-lock: timed out waiting for the condition
E1018 23:25:08.231418       1 leaderelection.go:306] Failed to release lock: resource name may not be empty
W1018 23:25:08.231450       1 leaderelection.go:82] leader election lost

@Elbehery
Copy link
Contributor Author

Scenario Replace two failed machine from 4 master nodes

logic

   - Destroy `2 `ec2 instance backing `master-0`, `master-1`
   - CPMS should update the machine’s status with `failed`.
   - Delete `master-0` machine’s resource.

setup

    - Desired number of master nodes is `3`.
    - Initial Cluster ControlPlane machines is `4`.

expected

    - CEO should scale down the unhealthy master-0, and master-1.
    - CPMS should create a replacement.
    - CEO should allow the replacement to join the cluster.

actual

    - After creating the extra 4th machine, another master machine became `deleting`.
    - I did not destroy or delete this machine, this is a potential bug 🐛 🚨 🆘 
    - CEO logs that shows scaling down this machine, this is not expected.
    - killing ec2 instance backing another master machine, considering the failure domain during picking the victim master machine.
    - Once the corresponding machine resource becomes `failed`, delete the machine resource.
    - CEO scales down unhealthy member.
    - No Replacement has been created. 🐛 🆘 🚨 
    - The cluster is healthy, but with only `2` master nodes.
    - Initial setup : `4` master nodes.
    - DesiredControlPlaneReplicas : `3` master nodes.
    - Actual : `2` master nodes.

CEO logs

before killing master-1

I1019 00:27:53.269144       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5 since its member is still present with any of: [{InternalIP 10.0.171.204} {InternalDNS ip-10-0-171-204.us-west-2.compute.internal} {Hostname ip-10-0-171-204.us-west-2.compute.internal}]
I1019 00:27:53.549257       1 clustermemberremovalcontroller.go:187] @Mustafa: votingMemberMachinesPendingDeletionIndex map[10.0.171.204:0xc001919d40]
I1019 00:27:53.559880       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.171.204, name: ip-10-0-171-204.us-west-2.compute.internal, id: 7652840804494303405 ] from the cluster
I1019 00:27:53.560360       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1019 00:27:53.560504       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.171.204:{} 10.0.181.165:{} 10.0.228.161:{}]
I1019 00:27:53.561501       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1019 00:27:53.561992       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberRemove' removed member with ID: 7652840804494303405
I1019 00:27:53.562055       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'ScaleDown' successfully removed member: [ url: 10.0.171.204, name: ip-10-0-171-204.us-west-2.compute.internal, id: 7652840804494303405 ] from the cluster
I1019 00:28:02.834926       1 clustermembercontroller.go:208] Ignoring node (ip-10-0-171-204.us-west-2.compute.internal) for scale-up since its machine (ci-ln-g177chb-76ef8-9grr8-master-5) is pending deletion
I1019 00:27:53.269144       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5 since its member is still present with any of: [{InternalIP 10.0.171.204} {InternalDNS ip-10-0-171-204.us-west-2.compute.internal} {Hostname ip-10-0-171-204.us-west-2.compute.internal}]
I1019 00:27:54.631741       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5
I1019 00:27:54.649939       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MachineDeletionHook' successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-5 as it doesn't have a member

after killing master-1 , before deleting master-1

I1019 00:54:16.526417       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1019 00:54:16.526592       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.181.165:{} 10.0.200.166:{} 10.0.228.161:{}]
I1019 00:54:16.526932       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3
I1019 00:54:16.538665       1 clustermemberremovalcontroller.go:135] @Mustafa: liveVotingMembers is [ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379"  ID:13232706032766304391 name:"ip-10-0-228-161.us-west-2.compute.internal" peerURLs:"https://10.0.228.161:2380" clientURLs:"https://10.0.228.161:2379"  ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ]
I1019 00:54:16.538799       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc001c3e240 0xc003d93b00 0xc001f96fc0]
I1019 00:54:16.538812       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc001f96fc0 0xc001c3e240 0xc003d93b00]
E1019 00:54:19.314341       1 etcdmemberscontroller.go:73] Unhealthy etcd member found: ip-10-0-200-166.us-west-2.compute.internal, took=, err=create client failure: failed to make etcd client for endpoints [https://10.0.200.166:2379]: context deadline exceeded
I1019 00:54:20.907165       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'UnhealthyEtcdMember' unhealthy members: ip-10-0-200-166.us-west-2.compute.internal
E1019 00:54:20.907337       1 base_controller.go:272] ClusterMemberController reconciliation failed: unhealthy members found during reconciling members
I1019 00:27:48.968375       1 clustermemberremovalcontroller.go:147] @Mustafa: current master machines with DeletionHook [0xc001c3e240 0xc002532480 0xc001f96fc0 0xc003b50b40]
I1019 00:27:48.968415       1 clustermemberremovalcontroller.go:154] @Mustafa: votingMembersMachines [0xc001f96fc0 0xc003b50b40 0xc001c3e240]
I1019 00:27:52.013537       1 clustermemberremovalcontroller.go:116] @Mustafa: enter attemptToScaleDown()
I1019 00:27:52.013559       1 clustermemberremovalcontroller.go:121] @Mustafa: currentVotingMemberIPListSet is map[10.0.171.204:{} 10.0.181.165:{} 10.0.228.161:{}]
I1019 00:27:52.013847       1 clustermemberremovalcontroller.go:126] @Mustafa: desiredControlPlaneReplicasCount is 3

after deleting master-1

I1019 01:00:39.497872       1 machinedeletionhooks.go:137] current members [ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379"  ID:13232706032766304391 name:"ip-10-0-228-161.us-west-2.compute.internal" peerURLs:"https://10.0.228.161:2380" clientURLs:"https://10.0.228.161:2379"  ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ] with IPSet: map[10.0.181.165:{} 10.0.200.166:{} 10.0.228.161:{}]
I1019 01:00:39.497942       1 machinedeletionhooks.go:151] skip removing the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-wjcfl-5 since its member is still present with any of: [{InternalIP 10.0.200.166} {InternalDNS ip-10-0-200-166.us-west-2.compute.internal} {Hostname ip-10-0-200-166.us-west-2.compute.internal}]
I1019 01:00:42.250785       1 clustermemberremovalcontroller.go:184] @Mustafa: unhealthy members [ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ]
I1019 01:00:42.250841       1 clustermemberremovalcontroller.go:187] @Mustafa: votingMemberMachinesPendingDeletionIndex map[10.0.200.166:0xc001c218c0]
I1019 01:00:42.251027       1 clustermemberremovalcontroller.go:402] @Mustafa: members are [ID:11242654706413721825 name:"ip-10-0-181-165.us-west-2.compute.internal" peerURLs:"https://10.0.181.165:2380" clientURLs:"https://10.0.181.165:2379"  ID:13232706032766304391 name:"ip-10-0-228-161.us-west-2.compute.internal" peerURLs:"https://10.0.228.161:2380" clientURLs:"https://10.0.228.161:2379"  ID:15106259840278836149 name:"ip-10-0-200-166.us-west-2.compute.internal" peerURLs:"https://10.0.200.166:2380" clientURLs:"https://10.0.200.166:2379" ]
I1019 01:00:42.258269       1 clustermemberremovalcontroller.go:419] successfully removed member: [ url: 10.0.200.166, name: ip-10-0-200-166.us-west-2.compute.internal, id: 15106259840278836149 ] from the cluster
I1019 01:00:42.258454       1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"ef7fb330-6c7b-4a05-9faa-c8ebc95833c6", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'MemberRemove' removed member with ID: 15106259840278836149
I1019 01:00:42.540436       1 machinedeletionhooks.go:160] successfully removed the deletion hook from machine ci-ln-g177chb-76ef8-9grr8-master-wjcfl-5

@Elbehery
Copy link
Contributor Author

/assign @hasbro17
/assign @tjungblu
/assign @JoelSpeed

@Elbehery
Copy link
Contributor Author

/retest-required

1 similar comment
@Elbehery
Copy link
Contributor Author

/retest-required

@JoelSpeed
Copy link
Contributor

I've left some comments on the testing docs but I think there's some assumptions going into these test cases about how the CPMS works that aren't true.

CPMS works based on a concept of indexes, each index represents one desired control plane instance, and these indexes are typically 0, 1 and 2 in a cluster that you've just created. So for example, the installer creates a machine in index 0 called master-0.

When the CPMS sees an additional machine in the cluster, say, called master-3, it thinks this is a 4th index. Because the CPMS by default is set to replicas: 3, this is an error and the CPMS will degrade until the administrator removes the extraneous index.

If the CPMS were to see a machine call master-foo-0, then, because this has a 0 at the end, it would assume this is part of index 0. This is how we track replacing machines btw. So it would say, ok, I now have 2 machines in index 0, one of them must be replacing the other. If you delete one of these, eventually the cluster should settle again and CPMS will be happy.

We actually intend to delete those extra machines in the index, but our bug is currently blocked because the etcd e2e can't handle that behaviour yet.

I think in the testing here you've created extra indexes, then when delete the machines the CPMS sees 3 machines in 3 indexes and goes back to an available state, which is why no more machines are created, it's happy once they've been deleted

Note horizontal scaling is not a feature of CPMS yet, we need to make sure etcd can horizontally scale before we were to introduce that, so you are stuck with 3 replicas (you can force it to 5 if you want but this isn't technically supported, no customer is doing this)

@tjungblu
Copy link
Contributor

(you can force it to 5 if you want but this isn't technically supported, no customer is doing this)

@JoelSpeed there's a customer that runs this, which is why we have the five control plane test in our PRs, so it's certainly supported and used.

Copy link
Contributor

@tjungblu tjungblu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left couple comments, lgtm otherwise. Great tests and love the logs.

liveVotingMembers, err := c.getAllVotingMembers(ctx)
if err != nil {
// TODO : update status condition
// TODO : should go degraded ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, that's a transient issue and will be retried next recon cycle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, good point. So out of curiosity, when should we use Status.conditions ?

I learnt about them recently in k8s operators tutorials :D

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when you want to tell the admin to fix something specifically. I'd personally still prefer logs as they are a bit easier to understand over time.

Going Degraded only if we should block the upgrades, but my philosophy is also different than what's used across the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot, now i got it :)

return fmt.Errorf("could not find master machines with deletion hook: %w", err)
}

var votingMembersMachines []*machinev1beta1.Machine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make the below logic consistent, should the currentVotingMemberIPListSet sourced from liveVotingMembers too? Otherwise you have one set generated by configmap, the other from the etcd memberlist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wanted to do this, but did not want to change the logic completely.

I had to use liveVotingMembers for deciding whether attempt scaling down, or skip. During testing, I noticed that the configmap takes time to reflect the cluster membership, and the CEO keep spinning waiting for a new CM to be created.

+1 to replace currentVotingMemberIPListSet with liveVotingMembers, wdyt @hasbro17 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the opposite thought actually. For the default case when we're checking if scale-down is allowed i.e votingMembers >= desiredControlPlaneReplicasCount we don't want to make live calls to the etcd cluster.

The reason why we use currentVotingMemberIPListSet (sourced from the etcd-endpoints configmap) is so we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.

During testing, I noticed that the configmap takes time to reflect the cluster membership, and the CEO keep spinning waiting for a new CM to be created.

Not sure what CM here is (assuming machine) but other than the delay @Elbehery Is there a reason why you can't use the currentVotingMemberIPListSet in L128-135 above? How much is the delay in practice? I would expect the etcd-endpoints controller to update within the minute once there is an actual membership change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hasbro17 for your review

votingMembers >= desiredControlPlaneReplicasCount we don't want to make live calls to the etcd cluster.

the etcd-endpoint CM contains all the voting member ( i.e. non-learner ) whether healthy or unhealthy. In #937 (comment) we agreed to constrain to the scaling-down invariant around healthyVotingMembers iiuc

we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.

I think listing the cluster members is not a load, moreover the recent changes to etcdcli package reuses the connections :)

I would expect the etcd-endpoints controller to update within the minute once there is an actual membership change.

Yes this is correct, it takes around a minutes to update the CM .. I can change it back, but may I ask listing the healthy members directly from the etcd cluster is not correct ?

Copy link
Contributor

@tjungblu tjungblu Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why we use currentVotingMemberIPListSet (sourced from the etcd-endpoints configmap) is so we're not inducing unnecessary load in the default case by using a cache instead of live calls to cluster.

That makes sense, I think we make lots of MemberList calls across the operator already (health probe, etc). The connection to etcd is already open as Mustafa said, so getting the current list of members is not an expensive call.

I'm all up for consolidating this listing into a single controller that queries this in the background every 5-10s and all other controllers just feed from that.

How much is the delay in practice?

I don't have a direct answer for that, but let's go with the case I've added to the quorum guard here:
https://github.com/openshift/cluster-etcd-operator/blob/master/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L145

So when you're down to two nodes, we'll never update the CM until you add a new node. I realize now that this is stupid, but we have no other way to selectively tell the static pod controller not to rollout a revision based on that.
In that worst case, it takes how long it takes the cluster admin to add a new node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all up for consolidating this listing into a single controller that queries this in the background every 5-10s and all other controllers just feed from that.

+1

we'll never update the CM until you add a new node.

In this case I think listing the members from etcd client makes more sense, or ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I forgot that we'll have client connections open from the client cache so I guess having this controller make periodic live list calls might not add a significant load. So I guess we can keep the live call here.

So when you're down to two nodes, we'll never update the CM until you add a new node

Yeah that's not ideal especially since the apiserver is using the endpoints configmap and will end up talking to a non-existent endpoint if we have 2 members and the CM is outdated with 3. But it's a necessary trade-off that we've made given that we don't want quorum loss mid rollout.
Plus with CPMS once we should have a new machine who's member would soon be added and promoted right after the unhealthy one is removed so this gap shouldn't be too large.

Let's keep the live list for now, and leave off the consolidation of list calls to a better internal cache than the etcd-endpoints CM as a follow up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we agree on sticking with the member listing, then I would definitely like to see currentVotingMemberIPListSet to come from the member listing. Otherwise this is going to be wildly inconsistent.

return false
}

func minimumTolerableQuorum(desiredControlPlaneReplicasCount int) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have something similar already in:

func IsQuorumFaultTolerant(memberHealth []healthCheck) bool {
totalMembers := len(memberHealth)
quorum := totalMembers/2 + 1
healthyMembers := len(GetHealthyMemberNames(memberHealth))
switch {
case totalMembers-quorum < 1:
klog.Errorf("etcd cluster has quorum of %d which is not fault tolerant: %+v", quorum, memberHealth)
return false
case healthyMembers-quorum < 1:
klog.Errorf("etcd cluster has quorum of %d and %d healthy members which is not fault tolerant: %+v", quorum, healthyMembers, memberHealth)
return false
}
return true
}
// IsQuorumFaultTolerantErr is the same as IsQuorumFaultTolerant but with an error return instead of the log
func IsQuorumFaultTolerantErr(memberHealth []healthCheck) error {
totalMembers := len(memberHealth)
quorum := totalMembers/2 + 1
healthyMembers := len(GetHealthyMemberNames(memberHealth))
switch {
case totalMembers-quorum < 1:
return fmt.Errorf("etcd cluster has quorum of %d which is not fault tolerant: %+v", quorum, memberHealth)
case healthyMembers-quorum < 1:
return fmt.Errorf("etcd cluster has quorum of %d and %d healthy members which is not fault tolerant: %+v", quorum, healthyMembers, memberHealth)
}
return nil
}

Can we reuse that instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I tried, but the conditions used here are too strict and this was the problem of scaling down

I use totalMembers < desiredControlPlaneCount as a first check to decide if we can attempt to scale down, or skip.

if len(liveVotingMembers) < desiredControlPlaneReplicasCount {
		klog.V(2).Infof("Ignoring scale-down since the number of etcd members (%d) < desired number of control-plane replicas (%d) ", len(liveVotingMembers), desiredControlPlaneReplicasCount)
		return nil
}

I use len(healthyLiveVotingMembers) < minimumTolerableQuorum(desiredControlPlaneReplicasCount) to decide if we have a chance to scale down without losing Quorum

The conditions used in IsQuorumFaultTolerant() is too strict for scaling down

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair, but can we consolidate the formula for computing this into a shared func instead? so not everybody has to re-invent this math and test it again, with what parameters you feed the formula is up to the caller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with that, but both methods have different args 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Elbehery You don't have to reuse the IsQuorumFaultTolerant function as it is or change its args but you can consolidate the quorum computation to a new function and reuse that in both IsQuorumFaultTolerant and in minimumTolerableQuorum so we're not maintaining the same thing in different packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hasbro17 @tjungblu I tried refactoring minimumTolerableQuorum to common but it resulted in cyclic dependency .. I added it to etcdcli as exported

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 27, 2022
@Elbehery Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from a71acc6 to b2c4a83 Compare October 27, 2022 04:15
@Elbehery
Copy link
Contributor Author

/retest-required

return fmt.Errorf("could not find master machines with deletion hook: %w", err)
}

var votingMembersMachines []*machinev1beta1.Machine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we agree on sticking with the member listing, then I would definitely like to see currentVotingMemberIPListSet to come from the member listing. Otherwise this is going to be wildly inconsistent.

}
}

func MinimumTolerableQuorum(members int) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please add a unit test for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure 👍🏽

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added @tjungblu 👍🏽

t.Run(scenario.name, func(t *testing.T) {
actual, err := MinimumTolerableQuorum(scenario.input)

if scenario.expErr != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can simplify all your if conditions below with a simple:

require.Equal(t, scenario.expErr, err)
require.Equal(t, scenario.exp, actual)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks I never used this :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is testify ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sir!

@Elbehery Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch 2 times, most recently from bdd1393 to 73258ae Compare October 27, 2022 14:41
func VotingMemberIPListSet(configMapLister corev1listers.ConfigMapNamespaceLister) (sets.String, error) {
etcdEndpointsConfigMap, err := configMapLister.Get("etcd-endpoints")
func VotingMemberIPListSet(ctx context.Context, cli etcdcli.EtcdClient) (sets.String, error) {
members, err := cli.VotingMemberList(ctx)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjungblu is this correct ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍🏽

for _, votingMemberIP := range etcdEndpointsConfigMap.Data {
currentVotingMemberIPListSet.Insert(votingMemberIP)
for _, member := range members {
currentVotingMemberIPListSet.Insert(member.PeerURLs[0])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjungblu this is the member's ip, right ?

@Elbehery Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from 73258ae to 24bbc13 Compare October 27, 2022 14:49
@Elbehery Elbehery force-pushed the automatic_replacement_unhealthy_member_machine branch from 24bbc13 to fc6a345 Compare October 27, 2022 14:57
@Elbehery
Copy link
Contributor Author

/retest-required

2 similar comments
@Elbehery
Copy link
Contributor Author

/retest-required

@Elbehery
Copy link
Contributor Author

/retest-required

@tjungblu
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Elbehery, hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Elbehery
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2022

@Elbehery: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 125d913 into openshift:master Oct 28, 2022
@Elbehery Elbehery deleted the automatic_replacement_unhealthy_member_machine branch October 28, 2022 16:07
@openshift-cherrypick-robot

@Elbehery: cannot checkout 4.12: error checking out 4.12: exit status 1. output: error: pathspec '4.12' did not match any file(s) known to git

Details

In response to this:

/cherry-pick 4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Elbehery
Copy link
Contributor Author

/cherry-pick release-4.12

@openshift-cherrypick-robot

@Elbehery: new pull request created: #960

Details

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants