OCPBUGS-73857: Prefer to remove members where they have another healthy machine in the same failure domain index#1528
Conversation
…he same failure domain index
WalkthroughIntroduced failure-domain distribution logic to the cluster member removal controller for determining which voting members to remove during scale-down operations. Added helper functions to compute failure-domain indices and counts, integrated sorting by failure-domain distribution, and added corresponding test coverage. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes ✨ Finishing touches
🧹 Recent nitpick comments
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (1)**⚙️ CodeRabbit configuration file
Files:
🧬 Code graph analysis (2)pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go (1)
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller_test.go (2)
🔇 Additional comments (2)
✏️ Tip: You can disable this entire section by setting Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
@JoelSpeed: This pull request references Jira Issue OCPBUGS-73857, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@JoelSpeed: This pull request references Jira Issue OCPBUGS-73857, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| } | ||
|
|
||
| func machineFailureDomainIndex(machine *machinev1beta1.Machine) int { | ||
| index, err := strconv.Atoi(machine.ObjectMeta.Name[len(machine.ObjectMeta.Name)-1:]) |
There was a problem hiding this comment.
no doubt this works, but wouldn't it be better to rely on the "machine.openshift.io/zone" label?
There was a problem hiding this comment.
No, in this case it's better to rely on the numbers in the name.
In theory there could be multiple instances in the same zone, but CPMS always balances across indexes. It has its own logic internally to handle this so that if you had two zones it would balance 2 nodes in one zone and the third in the other, and the names of the master machines are always indexed 0, 1 and 2. Those indexes should remain consistent across with their zone as they are being replaced.
Now in the case we saw here, CPMS saw it had an excess index 0 node, and etcd operator removed the index 1 node. We need etcd operator to prioritise the index 0 nodes in this case, so prioritising based on the index is better on this occasion IMO
And yes, I realise this is a bit of a nasty tight coupling between etcd operator and how CPMS works, but I can't think of another reliable way to do this that will guarantee the correct ordering :(
There was a problem hiding this comment.
I think 'failure domain' might be an overloaded and potentially unhelpful term here.
I believe the important point to understand is that CPMS gives each of its Machines an index, e.g. 0, 1, 2. It will try to maintain exactly one each of these, so if it's handling the create of a 0 it's also handling the delete of a 0, and vice versa. This will be true regardless of what criteria were chosen for the distribution of -0, -1, and -2, which will differ by cloud and deployment.
To me, what this issue highlights is the importance of etcd-operator and CPMS both having the same idea about what replaces what. I agree with Joel's characterisation of this as 'nasty tight coupling' and I think we should look into replacing it with a more discoverable and explicit communication channel between etcd-operator and CPMS. However, for now I think this is probably the least nasty tight coupling we could come up with.
|
/lgtm Thanks :) |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest /verified by @JoelSpeed |
|
@JoelSpeed: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@JoelSpeed: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@JoelSpeed: Jira Issue Verification Checks: Jira Issue OCPBUGS-73857 Jira Issue OCPBUGS-73857 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-01-28-225830 |
When using the RollingUpdate strategy, the CPMS will always start by trying to replace index 0. If a user decides to call
oc deleteon all nodes at once, there is currently no guarantee that the etcd operator will try to remove the member from the index with duplicate instances. This can then lead to etcd being imbalanced across failure domains.What's worse, and I haven't quite worked out why this does this, but when this happens, I'm observing that the etcd operator first removes, for this example, index 1, the cluster goes to 3 members, and then it will also remove index 0 shortly after that's completely gone. The cluster temporarily ends up with just 2 control plane members.
The clusters are recovering when I do this, but this PR should optimise the way we remove the instances by trying first to remove instances where there are duplicate machines in the same failure domain.