[release-4.20] OCPBUGS-77313: Wait for revision stability before removing etcd members#1559
Conversation
Previously, the ClusterMemberRemovalController would remove etcd members during revision rollouts, causing cluster degradation when simultaneously deleting multiple control plane machines with the OnDelete strategy. During a revision rollout, etcd members can temporarily appear unhealthy while their pods are reinstalled to the latest revision. This is different from members being indefinitely unhealthy on a stable revision. Additionally, the EtcdEndpointsController pauses during revision rollouts, so when a replacement machine is added and triggers a rollout, the etcd-endpoints configmap won't update. This causes API servers on the old revision to use removed member endpoints, leading to API unavailability. This change adds a revision stability check before allowing member removal, ensuring we only remove members when revisions are stable and unhealthy members are truly unhealthy. This explicitly codifies the 4.17 behavior where the operator waited for all revisions to complete before removing members and lifecycle hooks. Additionally, the ClusterMemberRemovalController now verifies that the live etcd membership matches the configmap before proceeding with member removal, preventing potential issues during rapid member deletion (cherry picked from commit 0168733)
|
@openshift-cherrypick-robot: Detected clone of Jira Issue OCPBUGS-77097 with correct target version. Will retitle the PR to link to the clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@openshift-cherrypick-robot: This pull request references Jira Issue OCPBUGS-77313, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: openshift-cherrypick-robot The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/retest-required |
|
@openshift-cherrypick-robot: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This is an automated cherry-pick of #1555
/assign hasbro17
/cherrypick release-4.19 release-4.18