OCPBUGS-73857: Prefer to remove members where they have another healthy machine in the same failure domain index by JoelSpeed · Pull Request #1528 · openshift/cluster-etcd-operator

JoelSpeed · 2026-01-15T17:51:02Z

When using the RollingUpdate strategy, the CPMS will always start by trying to replace index 0. If a user decides to call oc delete on all nodes at once, there is currently no guarantee that the etcd operator will try to remove the member from the index with duplicate instances. This can then lead to etcd being imbalanced across failure domains.

What's worse, and I haven't quite worked out why this does this, but when this happens, I'm observing that the etcd operator first removes, for this example, index 1, the cluster goes to 3 members, and then it will also remove index 0 shortly after that's completely gone. The cluster temporarily ends up with just 2 control plane members.

The clusters are recovering when I do this, but this PR should optimise the way we remove the instances by trying first to remove instances where there are duplicate machines in the same failure domain.

…he same failure domain index

coderabbitai · 2026-01-15T17:51:16Z

Walkthrough

Introduced failure-domain distribution logic to the cluster member removal controller for determining which voting members to remove during scale-down operations. Added helper functions to compute failure-domain indices and counts, integrated sorting by failure-domain distribution, and added corresponding test coverage.

Changes

Cohort / File(s)	Summary
Cluster Member Removal Controller Implementation `pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`	Added strconv import, two new helper functions (`machineFailureDomainIndex`, `machineFailureDomainCountByIndex`) to compute failure-domain distribution, and logic to sort pending-deletion voting members by failure-domain count in descending order before scale-down evaluation.
Cluster Member Removal Controller Tests `pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go`	Added new test scenario verifying that excessive voting members with multiple pending deletions are removed according to failure-domain distribution, including setup of extra voting member, multiple deletion timestamps, health state assertions, and member removal validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

✨ Finishing touches

📝 Generate docstrings

🧹 Recent nitpick comments

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go (1)
1010-1055: Test correctly validates failure domain ordering; minor inconsistency with other tests.

The test setup correctly validates that member m-1 (in failure domain index 1 with count 2) is removed before m-2 or m-3 (each in domains with count 1).

However, this test is missing the indexerObjs field that other tests in this function include:
 			fakeEtcdClientOptions: etcdcli.WithFakeClusterHealth(&etcdcli.FakeMemberHealth{Healthy: 4, Unhealthy: 0}),
+			indexerObjs:           []interface{}{bootstrapComplete},
 			validateFn: func(t *testing.T, fakeEtcdClient etcdcli.EtcdClient) {
While this doesn't affect the test (since attemptToScaleDown doesn't check bootstrap status), adding it maintains consistency with other test scenarios.
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)

500-506: Add comment documenting the CPMS machine naming assumption.

The function extracts only the last character of the machine name to determine the failure domain index. This works correctly for CPMS (Control Plane Machine Set) managed machines where names follow the pattern m-0 through m-9, supporting up to 10 failure domains. However, the coupling to this naming convention should be documented in the code to make the assumption explicit.

Add a comment explaining that this function relies on CPMS machine naming conventions where the final character encodes the failure domain index, and note that it returns -1 if the last character is not a digit.

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 48fbbdb and 50ee4de.

📒 Files selected for processing (2)

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go

🧬 Code graph analysis (2)

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)

pkg/operator/ceohelpers/common.go (1)

FilterMachinesPendingDeletion (108-116)

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go (2)

pkg/etcdcli/helpers.go (2)

WithFakeClusterHealth (241-246)

FakeMemberHealth (236-239)

pkg/etcdcli/interfaces.go (1)

EtcdClient (17-32)

🔇 Additional comments (2)

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (2)

175-190: LGTM - Failure domain sorting logic is correct.

The approach of sorting by descending failure domain count ensures members from over-represented domains are removed first, which helps maintain balance across failure domains during scale-down.

508-515: LGTM!

The helper function correctly builds a count of machines per failure domain index, grouping any machines with unparseable names under index -1.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-01-15T17:51:20Z

@JoelSpeed: This pull request references Jira Issue OCPBUGS-73857, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When using the RollingUpdate strategy, the CPMS will always start by trying to replace index 0. If a user decides to call oc delete on all nodes at once, there is currently no guarantee that the etcd operator will try to remove the member from the index with duplicate instances. This can then lead to etcd being imbalanced across failure domains.

What's worse, and I haven't quite worked out why this does this, but when this happens, I'm observing that the etcd operator first removes, for this example, index 1, the cluster goes to 3 members, and then it will also remove index 0 shortly after that's completely gone. The cluster temporarily ends up with just 2 control plane members.

The clusters are recovering when I do this, but this PR should optimise the way we remove the instances by trying first to remove instances where there are duplicate machines in the same failure domain.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

JoelSpeed · 2026-01-16T11:18:24Z

/jira refresh

openshift-ci-robot · 2026-01-16T11:18:31Z

@JoelSpeed: This pull request references Jira Issue OCPBUGS-73857, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tjungblu · 2026-01-16T12:10:49Z

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

 }
+
+func machineFailureDomainIndex(machine *machinev1beta1.Machine) int {
+	index, err := strconv.Atoi(machine.ObjectMeta.Name[len(machine.ObjectMeta.Name)-1:])


no doubt this works, but wouldn't it be better to rely on the "machine.openshift.io/zone" label?

No, in this case it's better to rely on the numbers in the name.

In theory there could be multiple instances in the same zone, but CPMS always balances across indexes. It has its own logic internally to handle this so that if you had two zones it would balance 2 nodes in one zone and the third in the other, and the names of the master machines are always indexed 0, 1 and 2. Those indexes should remain consistent across with their zone as they are being replaced.

Now in the case we saw here, CPMS saw it had an excess index 0 node, and etcd operator removed the index 1 node. We need etcd operator to prioritise the index 0 nodes in this case, so prioritising based on the index is better on this occasion IMO

And yes, I realise this is a bit of a nasty tight coupling between etcd operator and how CPMS works, but I can't think of another reliable way to do this that will guarantee the correct ordering :(

I think 'failure domain' might be an overloaded and potentially unhelpful term here.

I believe the important point to understand is that CPMS gives each of its Machines an index, e.g. 0, 1, 2. It will try to maintain exactly one each of these, so if it's handling the create of a 0 it's also handling the delete of a 0, and vice versa. This will be true regardless of what criteria were chosen for the distribution of -0, -1, and -2, which will differ by cloud and deployment.

To me, what this issue highlights is the importance of etcd-operator and CPMS both having the same idea about what replaces what. I agree with Joel's characterisation of this as 'nasty tight coupling' and I think we should look into replacing it with a more discoverable and explicit communication channel between etcd-operator and CPMS. However, for now I think this is probably the least nasty tight coupling we could come up with.

tjungblu · 2026-01-16T14:45:26Z

/lgtm

Thanks :)

openshift-ci · 2026-01-16T14:45:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2026-01-27T10:25:46Z

/retest

/verified by @JoelSpeed

openshift-ci-robot · 2026-01-27T10:25:57Z

@JoelSpeed: This PR has been marked as verified by @JoelSpeed.

Details

In response to this:

/retest

/verified by @JoelSpeed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-01-27T15:33:58Z

/retest-required

Remaining retests: 0 against base HEAD 790d456 and 2 for PR HEAD 50ee4de in total

openshift-ci · 2026-01-27T17:23:57Z

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-01-27T17:29:08Z

@JoelSpeed: Jira Issue Verification Checks: Jira Issue OCPBUGS-73857
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-73857 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

When using the RollingUpdate strategy, the CPMS will always start by trying to replace index 0. If a user decides to call oc delete on all nodes at once, there is currently no guarantee that the etcd operator will try to remove the member from the index with duplicate instances. This can then lead to etcd being imbalanced across failure domains.

What's worse, and I haven't quite worked out why this does this, but when this happens, I'm observing that the etcd operator first removes, for this example, index 1, the cluster goes to 3 members, and then it will also remove index 0 shortly after that's completely gone. The cluster temporarily ends up with just 2 control plane members.

The clusters are recovering when I do this, but this PR should optimise the way we remove the instances by trying first to remove instances where there are duplicate machines in the same failure domain.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-robot · 2026-01-29T15:04:58Z

Fix included in accepted release 4.22.0-0.nightly-2026-01-28-225830

Prefer to remove members where they have another healthy machine in t…

50ee4de

…he same failure domain index

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 15, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 15, 2026

openshift-ci bot requested review from hasbro17 and jubittajohn January 15, 2026 17:51

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 16, 2026

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 16, 2026

openshift-ci bot requested a review from geliu2016 January 16, 2026 11:18

tjungblu reviewed Jan 16, 2026

View reviewed changes

openshift-ci bot assigned tjungblu Jan 16, 2026

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2026

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jan 27, 2026

openshift-merge-bot bot merged commit ce1ee43 into openshift:main Jan 27, 2026
15 checks passed

lance5890 mentioned this pull request Feb 27, 2026

[release-4.21] OCPBUGS-77097: Wait for revision stability before removing etcd members #1555

Merged

Conversation

JoelSpeed commented Jan 15, 2026

Uh oh!

coderabbitai bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

openshift-ci-robot commented Jan 15, 2026

Uh oh!

JoelSpeed commented Jan 16, 2026

Uh oh!

openshift-ci-robot commented Jan 16, 2026

Uh oh!

tjungblu Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

JoelSpeed Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

mdbooth Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

tjungblu commented Jan 16, 2026

Uh oh!

openshift-ci bot commented Jan 16, 2026

Uh oh!

JoelSpeed commented Jan 27, 2026

Uh oh!

openshift-ci-robot commented Jan 27, 2026

Uh oh!

openshift-ci-robot commented Jan 27, 2026

Uh oh!

openshift-ci bot commented Jan 27, 2026

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 27, 2026

Uh oh!

openshift-merge-robot commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai bot commented Jan 15, 2026 •

edited

Loading