Skip to content

OCPBUGS-73857: Prefer to remove members where they have another healthy machine in the same failure domain index#1528

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
JoelSpeed:sort-pending-removals
Jan 27, 2026
Merged

OCPBUGS-73857: Prefer to remove members where they have another healthy machine in the same failure domain index#1528
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
JoelSpeed:sort-pending-removals

Conversation

@JoelSpeed
Copy link
Contributor

When using the RollingUpdate strategy, the CPMS will always start by trying to replace index 0. If a user decides to call oc delete on all nodes at once, there is currently no guarantee that the etcd operator will try to remove the member from the index with duplicate instances. This can then lead to etcd being imbalanced across failure domains.

What's worse, and I haven't quite worked out why this does this, but when this happens, I'm observing that the etcd operator first removes, for this example, index 1, the cluster goes to 3 members, and then it will also remove index 0 shortly after that's completely gone. The cluster temporarily ends up with just 2 control plane members.

The clusters are recovering when I do this, but this PR should optimise the way we remove the instances by trying first to remove instances where there are duplicate machines in the same failure domain.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 15, 2026
@coderabbitai
Copy link

coderabbitai bot commented Jan 15, 2026

Walkthrough

Introduced failure-domain distribution logic to the cluster member removal controller for determining which voting members to remove during scale-down operations. Added helper functions to compute failure-domain indices and counts, integrated sorting by failure-domain distribution, and added corresponding test coverage.

Changes

Cohort / File(s) Summary
Cluster Member Removal Controller Implementation
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
Added strconv import, two new helper functions (machineFailureDomainIndex, machineFailureDomainCountByIndex) to compute failure-domain distribution, and logic to sort pending-deletion voting members by failure-domain count in descending order before scale-down evaluation.
Cluster Member Removal Controller Tests
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go
Added new test scenario verifying that excessive voting members with multiple pending deletions are removed according to failure-domain distribution, including setup of extra voting member, multiple deletion timestamps, health state assertions, and member removal validation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

✨ Finishing touches
  • 📝 Generate docstrings

🧹 Recent nitpick comments
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go (1)

1010-1055: Test correctly validates failure domain ordering; minor inconsistency with other tests.

The test setup correctly validates that member m-1 (in failure domain index 1 with count 2) is removed before m-2 or m-3 (each in domains with count 1).

However, this test is missing the indexerObjs field that other tests in this function include:

 			fakeEtcdClientOptions: etcdcli.WithFakeClusterHealth(&etcdcli.FakeMemberHealth{Healthy: 4, Unhealthy: 0}),
+			indexerObjs:           []interface{}{bootstrapComplete},
 			validateFn: func(t *testing.T, fakeEtcdClient etcdcli.EtcdClient) {

While this doesn't affect the test (since attemptToScaleDown doesn't check bootstrap status), adding it maintains consistency with other test scenarios.

pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)

500-506: Add comment documenting the CPMS machine naming assumption.

The function extracts only the last character of the machine name to determine the failure domain index. This works correctly for CPMS (Control Plane Machine Set) managed machines where names follow the pattern m-0 through m-9, supporting up to 10 failure domains. However, the coupling to this naming convention should be documented in the code to make the assumption explicit.

Add a comment explaining that this function relies on CPMS machine naming conventions where the final character encodes the failure domain index, and note that it returns -1 if the last character is not a digit.


📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 48fbbdb and 50ee4de.

📒 Files selected for processing (2)
  • pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
  • pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
  • pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go
🧬 Code graph analysis (2)
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)
pkg/operator/ceohelpers/common.go (1)
  • FilterMachinesPendingDeletion (108-116)
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go (2)
pkg/etcdcli/helpers.go (2)
  • WithFakeClusterHealth (241-246)
  • FakeMemberHealth (236-239)
pkg/etcdcli/interfaces.go (1)
  • EtcdClient (17-32)
🔇 Additional comments (2)
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (2)

175-190: LGTM - Failure domain sorting logic is correct.

The approach of sorting by descending failure domain count ensures members from over-represented domains are removed first, which helps maintain balance across failure domains during scale-down.


508-515: LGTM!

The helper function correctly builds a count of machines per failure domain index, grouping any machines with unparseable names under index -1.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 15, 2026
@openshift-ci-robot
Copy link

@JoelSpeed: This pull request references Jira Issue OCPBUGS-73857, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

When using the RollingUpdate strategy, the CPMS will always start by trying to replace index 0. If a user decides to call oc delete on all nodes at once, there is currently no guarantee that the etcd operator will try to remove the member from the index with duplicate instances. This can then lead to etcd being imbalanced across failure domains.

What's worse, and I haven't quite worked out why this does this, but when this happens, I'm observing that the etcd operator first removes, for this example, index 1, the cluster goes to 3 members, and then it will also remove index 0 shortly after that's completely gone. The cluster temporarily ends up with just 2 control plane members.

The clusters are recovering when I do this, but this PR should optimise the way we remove the instances by trying first to remove instances where there are duplicate machines in the same failure domain.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@JoelSpeed
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 16, 2026
@openshift-ci-robot
Copy link

@JoelSpeed: This pull request references Jira Issue OCPBUGS-73857, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 16, 2026
@openshift-ci openshift-ci bot requested a review from geliu2016 January 16, 2026 11:18
}

func machineFailureDomainIndex(machine *machinev1beta1.Machine) int {
index, err := strconv.Atoi(machine.ObjectMeta.Name[len(machine.ObjectMeta.Name)-1:])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no doubt this works, but wouldn't it be better to rely on the "machine.openshift.io/zone" label?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in this case it's better to rely on the numbers in the name.

In theory there could be multiple instances in the same zone, but CPMS always balances across indexes. It has its own logic internally to handle this so that if you had two zones it would balance 2 nodes in one zone and the third in the other, and the names of the master machines are always indexed 0, 1 and 2. Those indexes should remain consistent across with their zone as they are being replaced.

Now in the case we saw here, CPMS saw it had an excess index 0 node, and etcd operator removed the index 1 node. We need etcd operator to prioritise the index 0 nodes in this case, so prioritising based on the index is better on this occasion IMO

And yes, I realise this is a bit of a nasty tight coupling between etcd operator and how CPMS works, but I can't think of another reliable way to do this that will guarantee the correct ordering :(

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'failure domain' might be an overloaded and potentially unhelpful term here.

I believe the important point to understand is that CPMS gives each of its Machines an index, e.g. 0, 1, 2. It will try to maintain exactly one each of these, so if it's handling the create of a 0 it's also handling the delete of a 0, and vice versa. This will be true regardless of what criteria were chosen for the distribution of -0, -1, and -2, which will differ by cloud and deployment.

To me, what this issue highlights is the importance of etcd-operator and CPMS both having the same idea about what replaces what. I agree with Joel's characterisation of this as 'nasty tight coupling' and I think we should look into replacing it with a more discoverable and explicit communication channel between etcd-operator and CPMS. However, for now I think this is probably the least nasty tight coupling we could come up with.

@tjungblu
Copy link
Contributor

/lgtm

Thanks :)

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 16, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 16, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2026
@JoelSpeed
Copy link
Contributor Author

/retest

/verified by @JoelSpeed

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jan 27, 2026
@openshift-ci-robot
Copy link

@JoelSpeed: This PR has been marked as verified by @JoelSpeed.

Details

In response to this:

/retest

/verified by @JoelSpeed

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 790d456 and 2 for PR HEAD 50ee4de in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2026

@JoelSpeed: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit ce1ee43 into openshift:main Jan 27, 2026
15 checks passed
@openshift-ci-robot
Copy link

@JoelSpeed: Jira Issue Verification Checks: Jira Issue OCPBUGS-73857
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-73857 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

When using the RollingUpdate strategy, the CPMS will always start by trying to replace index 0. If a user decides to call oc delete on all nodes at once, there is currently no guarantee that the etcd operator will try to remove the member from the index with duplicate instances. This can then lead to etcd being imbalanced across failure domains.

What's worse, and I haven't quite worked out why this does this, but when this happens, I'm observing that the etcd operator first removes, for this example, index 1, the cluster goes to 3 members, and then it will also remove index 0 shortly after that's completely gone. The cluster temporarily ends up with just 2 control plane members.

The clusters are recovering when I do this, but this PR should optimise the way we remove the instances by trying first to remove instances where there are duplicate machines in the same failure domain.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.22.0-0.nightly-2026-01-28-225830

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants