[release-4.21] OCPBUGS-77097: Wait for revision stability before removing etcd members#1555
Conversation
Previously, the ClusterMemberRemovalController would remove etcd members during revision rollouts, causing cluster degradation when simultaneously deleting multiple control plane machines with the OnDelete strategy. During a revision rollout, etcd members can temporarily appear unhealthy while their pods are reinstalled to the latest revision. This is different from members being indefinitely unhealthy on a stable revision. Additionally, the EtcdEndpointsController pauses during revision rollouts, so when a replacement machine is added and triggers a rollout, the etcd-endpoints configmap won't update. This causes API servers on the old revision to use removed member endpoints, leading to API unavailability. This change adds a revision stability check before allowing member removal, ensuring we only remove members when revisions are stable and unhealthy members are truly unhealthy. This explicitly codifies the 4.17 behavior where the operator waited for all revisions to complete before removing members and lifecycle hooks. Additionally, the ClusterMemberRemovalController now verifies that the live etcd membership matches the configmap before proceeding with member removal, preventing potential issues during rapid member deletion (cherry picked from commit 0168733)
|
@hasbro17: This pull request references Jira Issue OCPBUGS-77097, which is valid. 7 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test ? |
|
/test e2e-aws-ovn-etcd-scaling Not sure if this will pull in the updated test suite on 4.22 or the one from 4.21 |
|
/cherrypick release-4.20 release-4.19 release-4.18 |
|
@hasbro17: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test e2e-aws-ovn-etcd-scaling |
|
/hold Until we verify the updated scaling test from openshift/origin#30802 passes |
|
/lgtm |
|
/retest-required |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Yeah I guess it makes sense that the Will wait for that and rerun. |
|
/test e2e-aws-ovn-etcd-scaling |
|
/retest-required |
|
@hasbro17: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Seeing the DeleteAll test pass But also seeing a timeout on the other test: Hopefully this is just a flake and I don't have to go back and tune the other tests further amidst a cherry pick. |
|
/payload-aggregate ? |
|
/payload-aggregate e2e-aws-ovn-etcd-scaling 10 |
|
/test e2e-aws-ovn-etcd-scaling |
|
@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/24f844f0-1339-11f1-8e62-5e17caac9539-0 |
|
/payload-aggregate periodic-ci-openshift-release-main-nightly-4.21-e2e-aws-ovn-etcd-scaling 5 |
|
@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bed61200-1339-11f1-82ca-5bc4f2f24c51-0 |
Due to limited information, I'm not sure if this is related to #1528 (which is a key difference between releases 4.21 and 4.22). the origin always remove the num 0 machine : https://github.com/openshift/origin/blob/512dc6d4da2ed7e997108694b634fa89321e6742/test/extended/etcd/helpers/helpers.go#L226-L241 |
|
Somewhat flaky but seeing some pass runs. Good enough. The other tests not related to this change time out on waiting for master nodes to come up sometimes.
That should be okay, that's just the initial delete that the CPMSO will respond to.
But since #1528 is not backported that may well be related. In any case, our scaling e2e tests need some TLC to iron out these flakes but the change here seems good for 4.21. |
|
/unhold |
|
@hasbro17: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
b22c5df
into
openshift:release-4.21
|
@hasbro17: Jira Issue OCPBUGS-77097: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-77097 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@hasbro17: new pull request created: #1559 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Manual cherry-pick of #1540
/cherrypick release-4.20 release-4.19 release-4.18
(cherry picked from commit 0168733)