Skip to content

[release-4.21] OCPBUGS-77097: Wait for revision stability before removing etcd members#1555

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:release-4.21from
hasbro17:release-4.21
Mar 2, 2026
Merged

[release-4.21] OCPBUGS-77097: Wait for revision stability before removing etcd members#1555
openshift-merge-bot[bot] merged 1 commit intoopenshift:release-4.21from
hasbro17:release-4.21

Conversation

@hasbro17
Copy link
Contributor

Manual cherry-pick of #1540

/cherrypick release-4.20 release-4.19 release-4.18

(cherry picked from commit 0168733)

Previously, the ClusterMemberRemovalController would remove etcd members
during revision rollouts, causing cluster degradation when simultaneously
deleting multiple control plane machines with the OnDelete strategy.

During a revision rollout, etcd members can temporarily appear unhealthy
while their pods are reinstalled to the latest revision. This is different
from members being indefinitely unhealthy on a stable revision.

Additionally, the EtcdEndpointsController pauses during revision rollouts,
so when a replacement machine is added and triggers a rollout, the
etcd-endpoints configmap won't update. This causes API servers on the old
revision to use removed member endpoints, leading to API unavailability.

This change adds a revision stability check before allowing member removal,
ensuring we only remove members when revisions are stable and unhealthy
members are truly unhealthy. This explicitly codifies the 4.17 behavior
where the operator waited for all revisions to complete before removing
members and lifecycle hooks.

Additionally, the ClusterMemberRemovalController now verifies that the live
etcd membership matches the configmap before proceeding with member removal,
preventing potential issues during rapid member deletion

(cherry picked from commit 0168733)
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 24, 2026
@openshift-ci-robot
Copy link

@hasbro17: This pull request references Jira Issue OCPBUGS-77097, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.z) matches configured target version for branch (4.21.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-74151 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-74151 targets the "4.22.0" version, which is one of the valid target versions: 4.22.0
  • bug has dependents

Requesting review from QA contact:
/cc @geliu2016

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Manual cherry-pick of #1540

/cherrypick release-4.20 release-4.19 release-4.18

(cherry picked from commit 0168733)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2026
@hasbro17
Copy link
Contributor Author

/test ?

@hasbro17
Copy link
Contributor Author

/test e2e-aws-ovn-etcd-scaling

Not sure if this will pull in the updated test suite on 4.22 or the one from 4.21

@hasbro17
Copy link
Contributor Author

/cherrypick release-4.20 release-4.19 release-4.18

@openshift-cherrypick-robot

@hasbro17: once the present PR merges, I will cherry-pick it on top of release-4.20 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.20 release-4.19 release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hasbro17
Copy link
Contributor Author

/test e2e-aws-ovn-etcd-scaling

@hasbro17 hasbro17 changed the title OCPBUGS-77097: Wait for revision stability before removing etcd members [release-4.21] OCPBUGS-77097: Wait for revision stability before removing etcd members Feb 24, 2026
@hasbro17
Copy link
Contributor Author

/hold

Until we verify the updated scaling test from openshift/origin#30802 passes

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 24, 2026
@tjungblu
Copy link
Contributor

/lgtm
/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Feb 25, 2026
@tjungblu
Copy link
Contributor

/retest-required

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hasbro17
Copy link
Contributor Author

Yeah I guess it makes sense that the is able to delete all masters with OnDelete strategy and wait for CPMSO to replace them is not present on 4.21 until openshift/origin#30802 merges.
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1555/pull-ci-openshift-cluster-etcd-operator-release-4.21-e2e-aws-ovn-etcd-scaling/2026392677280387072

Will wait for that and rerun.

@hasbro17
Copy link
Contributor Author

/test e2e-aws-ovn-etcd-scaling

@hasbro17
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@hasbro17: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-etcd-scaling b421578 link false /test e2e-aws-ovn-etcd-scaling

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hasbro17
Copy link
Contributor Author

Seeing the DeleteAll test pass

passed: (1h4m1s) 2026-02-25T21:15:32 "[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to delete all masters with OnDelete strategy and wait for CPMSO to replace them [Timeout:120m][apigroup:machine.openshift.io]"

But also seeing a timeout on the other test:


[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down with a single node [Timeout:60m][apigroup:machine.openshift.io] expand_less | 31m19s
-- | --
{  fail [github.com/openshift/origin/test/extended/etcd/vertical_scaling.go:109]: Unexpected error:     <*errors.withStack \| 0xc002590d98>:      scale-down: timed out waiting for member (ip-10-0-33-53.us-west-2.compute.internal) to be removed: timed out waiting for the condition     {         error: <*errors.withMessage \| 0xc00793f9e0>{             cause: <wait.errInterrupted>{                 cause: <*errors.errorString \| 0xc00086b020>{                     s: "timed out waiting for the condition",                 },             },             msg: "scale-down: timed out waiting for member (ip-10-0-33-53.us-west-2.compute.internal) to be removed",         },         stack: [0x6db061c, 0x4940e33, 0x4955d1b, 0x2d9fe61],     } occurred}

Hopefully this is just a flake and I don't have to go back and tune the other tests further amidst a cherry pick.

@hasbro17
Copy link
Contributor Author

/payload-aggregate ?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@hasbro17: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@hasbro17
Copy link
Contributor Author

/payload-aggregate e2e-aws-ovn-etcd-scaling 10

@hasbro17
Copy link
Contributor Author

/test e2e-aws-ovn-etcd-scaling

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • e2e-aws-ovn-etcd-scaling

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/24f844f0-1339-11f1-8e62-5e17caac9539-0

@hasbro17
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-main-nightly-4.21-e2e-aws-ovn-etcd-scaling 5

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 26, 2026

@hasbro17: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.21-e2e-aws-ovn-etcd-scaling

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/bed61200-1339-11f1-82ca-5bc4f2f24c51-0

@lance5890
Copy link
Contributor

lance5890 commented Feb 27, 2026

Seeing the DeleteAll test pass

passed: (1h4m1s) 2026-02-25T21:15:32 "[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to delete all masters with OnDelete strategy and wait for CPMSO to replace them [Timeout:120m][apigroup:machine.openshift.io]"

But also seeing a timeout on the other test:


[sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down with a single node [Timeout:60m][apigroup:machine.openshift.io] expand_less | 31m19s
-- | --
{  fail [github.com/openshift/origin/test/extended/etcd/vertical_scaling.go:109]: Unexpected error:     <*errors.withStack \| 0xc002590d98>:      scale-down: timed out waiting for member (ip-10-0-33-53.us-west-2.compute.internal) to be removed: timed out waiting for the condition     {         error: <*errors.withMessage \| 0xc00793f9e0>{             cause: <wait.errInterrupted>{                 cause: <*errors.errorString \| 0xc00086b020>{                     s: "timed out waiting for the condition",                 },             },             msg: "scale-down: timed out waiting for member (ip-10-0-33-53.us-west-2.compute.internal) to be removed",         },         stack: [0x6db061c, 0x4940e33, 0x4955d1b, 0x2d9fe61],     } occurred}

Hopefully this is just a flake and I don't have to go back and tune the other tests further amidst a cherry pick.

Due to limited information, I'm not sure if this is related to #1528 (which is a key difference between releases 4.21 and 4.22).

the origin always remove the num 0 machine : https://github.com/openshift/origin/blob/512dc6d4da2ed7e997108694b634fa89321e6742/test/extended/etcd/helpers/helpers.go#L226-L241

@hasbro17
Copy link
Contributor Author

hasbro17 commented Mar 2, 2026

Somewhat flaky but seeing some pass runs. Good enough.
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/openshift-cluster-etcd-operator-1555-nightly-4.21-e2e-aws-ovn-etcd-scaling/2027075451607846912

: [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to delete all masters with OnDelete strategy and wait for CPMSO to replace them [Timeout:120m][apigroup:machine.openshift.io] | 1h4m42s
-- | --
: [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down when CPMS is disabled [apigroup:machine.openshift.io] | 35m48s
: [sig-etcd][Feature:EtcdVerticalScaling][Suite:openshift/etcd/scaling][Serial] etcd is able to vertically scale up and down with a single node [Timeout:60m][apigroup:machine.openshift.io] | 40m42s

The other tests not related to this change time out on waiting for master nodes to come up sometimes.

the origin always remove the num 0 machine :

That should be okay, that's just the initial delete that the CPMSO will respond to.

Due to limited information, I'm not sure if this is related to #1528 (which is a key difference between releases 4.21 and 4.22).

But since #1528 is not backported that may well be related.

In any case, our scaling e2e tests need some TLC to iron out these flakes but the change here seems good for 4.21.

@hasbro17
Copy link
Contributor Author

hasbro17 commented Mar 2, 2026

/unhold
/verified by me

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 2, 2026
@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Mar 2, 2026
@openshift-ci-robot
Copy link

@hasbro17: This PR has been marked as verified by me.

Details

In response to this:

/unhold
/verified by me

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot bot merged commit b22c5df into openshift:release-4.21 Mar 2, 2026
15 of 16 checks passed
@openshift-ci-robot
Copy link

@hasbro17: Jira Issue OCPBUGS-77097: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-77097 has been moved to the MODIFIED state.

Details

In response to this:

Manual cherry-pick of #1540

/cherrypick release-4.20 release-4.19 release-4.18

(cherry picked from commit 0168733)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-cherrypick-robot

@hasbro17: new pull request created: #1559

Details

In response to this:

/cherrypick release-4.20 release-4.19 release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants