OCPBUGS-74151: Wait for revision stability before removing etcd members#1540
Conversation
|
/hold Need to add a test in the scaling suite to verify that we can delete all 3 machines and get replacements without the cluster hanging up |
|
@hasbro17: This pull request references Jira Issue OCPBUGS-74151, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
WalkthroughAdds a revision-stability gate and a pre-check that compares live etcd voting-member IPs to the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes ✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Comment |
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 |
| return nil | ||
| } | ||
|
|
||
| // TODO(haseeb): We should also explicitly put the brakes on the member removal so we don't |
There was a problem hiding this comment.
that's a good find. IIRC I've removed something along the lines to enable this "my machine is unhealthy, please delete it automatically" admin scenario.
With the static pod controller it so happens that sometimes the revisions get stuck, so it would never be able to really remove it and an admin has to go and manually delete the member (or fix the node status somehow). Maybe you find some time to add a test for this, if we don't already have one... Even though it must be quite annoying to mock this in CI
There was a problem hiding this comment.
I've removed something along the lines to enable this "my machine is unhealthy, please delete it automatically" admin scenario.
Sorry thomas it's been a while so my memory is hazy but do you mean this?
#947
i.e that's where we first allowed scaling down/member removal if an unhealthy machine is pending deletion.
sometimes the revisions get stuck
If a revision rollout is stuck, and the result is an unhealthy etcd member (e.g pod never comes up) and the intent is to remove the unhealthy machine to fix that, then yeah this change would prevent the controller from scaling down and would require manual intervention from the admin (to manually delete the member or deletion hook).
But that seems like a tradeoff we'll have to make given the issue in this bug.
add a test for this
Sorry do you mean a unit test to verify that we cannot scale down during a stuck revision rollout?
We have tests for the unhealthy member scale down but nothing around revision rollouts.
Although seems like you've mocked tests with revision rollouts for the cert signer controller so I could probably reuse that.
https://github.com/openshift/cluster-etcd-operator/blob/main/pkg/operator/etcdcertsigner/etcdcertsignercontroller_test.go#L74-L79
There was a problem hiding this comment.
it seems this was #947 - even though 4.12 sounds like a long time ago.
But that seems like a tradeoff we'll have to make given the issue in this bug.
agreed
Sorry do you mean a unit test to verify that we cannot scale down during a stuck revision rollout?
I was think of another e2e test that makes a node unready and then ensures it can be replaced.
best suited in https://github.com/openshift/origin/blob/main/test/extended/etcd/vertical_scaling.go
Not sure this is covered by CPMS somehow, though...
There was a problem hiding this comment.
Ah yes, the e2e test for making sure we can scale down an unhealthy member.
@jubittajohn already did a lot of work on that front in openshift/origin#29236
but I'll have to revisit on why that was held up. I want to say it was stopping the kubelet to make the node unhealthy that was then making that whole test pretty disruptive and failing lots of other tests.
But yeah, it would be good to revive that so we don't regress on that in the future. I'll check with Jubitta to see what was up and if I can bring that back.
@JoelSpeed On a related note, do you know if there are any e2e tests on the CPMSO side that exercise scaling operations for an unhealthy cluster? Doesn't look like we do from a quick glance in https://github.com/openshift/cluster-control-plane-machine-set-operator/tree/main/test/e2e
There was a problem hiding this comment.
To my knowledge no, we don't have any tests for unhealthy members. CPMS doesn't do any health monitoring so I guess it didn't make sense to us to test unhealthy members.
If you have a suggestion for how we can make a member unhealthy we could add that to our suite though
There was a problem hiding this comment.
weren't there machine health checks or something along those lines?
If you have a suggestion for how we can make a member unhealthy we could add that to our suite though
the most reliable way is to just sudo systemctl stop kubelet.service
There was a problem hiding this comment.
Yeah so machine health checks are a thing, but they're a separate thing to CPMS. They basically delete the machine when they detect something is bad.
So we could create a test that gets onto a node, runs sudo systemctl stop kubelet.service, then an MHC would delete the machine after some period of the machine being unhealthy (or we could do that ourselves?) and then the test would continue with CPMS replacing it
@huali9 Do you know of any examples in our existing suites where we exec into nodes we could crib off for this?
There was a problem hiding this comment.
So we could create a test that gets onto a node, runs sudo systemctl stop kubelet.service, then an MHC would delete the machine after some period of the machine being unhealthy (or we could do that ourselves?) and then the test would continue with CPMS replacing it
@JoelSpeed We have this scenario implemented and validated in the an E2E test here , as a part of an unmerged PR that @hasbro17 was referring to. While the scenario itself was successfully tested, we had to put it on hold for a few reasons.
-
The main reason is the invariant failures introduced by the new vertical scaling unhappy-path E2E tests(due to us intentionally stopping the kubelet). To address this, we would need to introduce a new disruptive test suite. TRT has given us the green light to proceed in this direction, which would allow us to isolate the monitor tests we expect to fail and explicitly skip them for these scenarios.
-
Another minor reason was TRT’s effort to add vertical scaling tests to component readiness, which is currently blocked by OCPBUGS-43379. This is because we still need to investigate and fix the invariant failures in the existing scaling tests that are significantly affecting pass rates. (And I guess the scaling tests don’t even run anymore, since the optional jobs that covered them were pruned some time ago as part of CI cost-cutting efforts).
There was a problem hiding this comment.
Hi @JoelSpeed, on our end we've been doing manual testing where we SSH into the nodes directly - we haven't implemented automated node exec in our test suites yet due to stability concerns with these disruptive tests. As @jubittajohn mentioned, there's already an automated implementation in that E2E test, so we shouldn't need to add another one, right? We can just reference their existing work.
|
Multi PR test showed that the 3 delete master scenario passed, but it timed out at 4 hours, so we will need to lengthen the timeout and run again. Looks promising though! |
|
I forgot multi PR tests are a thing now. That's quite useful. The e2e tests for our own scaling suite will take me a while to write and wire up as a presubmit, so the CPMSO test is a good signal in the meantime. Since the removal controller will now take its sweet time waiting for revisions to stabilize it will take a while. Either way I see the timeout has been bumped already. |
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 |
|
We also had to bump the config in the release repo to 6h which I think went in after you kicked that off, so we may need to trigger that again to get a complete run. But the first did successfully complete the new test where we delete the three master simultaneously In case you weren't aware, that suite we are kicking off is part of the release blocking payloads, so as soon as this is fixed we can add this in as release blocking |
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 |
|
m |
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 Image build failures should be fixed now |
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 |
|
Writing up the e2e test for delete all masters on our side here openshift/origin#30760 but I might leave out the unhealthy member case if that leads me down other rabbit holes on fixing/skipping invariants. Also I still need to add unit tests here for verifying:
|
2248459 to
2c512da
Compare
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 Added the check for slowing down member removals while the configmap lags behind the live membership Testing in openshift/origin#30760 (comment) |
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 |
Previously, the ClusterMemberRemovalController would remove etcd members during revision rollouts, causing cluster degradation when simultaneously deleting multiple control plane machines with the OnDelete strategy. During a revision rollout, etcd members can temporarily appear unhealthy while their pods are reinstalled to the latest revision. This is different from members being indefinitely unhealthy on a stable revision. Additionally, the EtcdEndpointsController pauses during revision rollouts, so when a replacement machine is added and triggers a rollout, the etcd-endpoints configmap won't update. This causes API servers on the old revision to use removed member endpoints, leading to API unavailability. This change adds a revision stability check before allowing member removal, ensuring we only remove members when revisions are stable and unhealthy members are truly unhealthy. This explicitly codifies the 4.17 behavior where the operator waited for all revisions to complete before removing members and lifecycle hooks. Additionally, the ClusterMemberRemovalController now verifies that the live etcd membership matches the configmap before proceeding with member removal, preventing potential issues during rapid member deletion
2c512da to
0168733
Compare
|
Already have a unit test for the revision stability helper so not going to redo that Added a unit test for skipping when we have a membership inconsistency between the live membership and configmap. With all this, the e2e test over on openshift/origin#30760 passes(minus the usual invariants): Holding this until the test in origin merges so we can actually run the e2e-aws-ovn-etcd-scaling presubmit here to verify. On that note, I've changed my mind, this new test can stay in scaling suite's workflow and doesn't need a new presubmit/job. Seems to run fine serially with the others and passed in <60m. /hold |
|
/lgtm thanks @hasbro17 :) |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hasbro17, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/testwith openshift/cluster-control-plane-machine-set-operator/main/e2e-aws-periodic-pre openshift/cluster-control-plane-machine-set-operator#383 |
|
Sorry this got held up, was out on PTO. I finally have the Going to unhold this and merge. |
|
/unhold |
|
/retest-required |
1 similar comment
|
/retest-required |
|
doesn't actually look like something caused by this PR, overriding: |
|
@tjungblu: Overrode contexts on behalf of tjungblu: ci/prow/e2e-agnostic-ovn DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/label acknowledge-critical-fixes-only |
|
@hasbro17: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
/retest |
|
/retest-required |
2 similar comments
|
/retest-required |
|
/retest-required |
|
@hasbro17: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@hasbro17: Jira Issue Verification Checks: Jira Issue OCPBUGS-74151 Jira Issue OCPBUGS-74151 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@hasbro17: #1540 failed to apply on top of branch "release-4.21": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in accepted release 4.22.0-0.nightly-2026-02-21-040517 |
Previously, the ClusterMemberRemovalController would remove etcd members during revision rollouts, causing cluster degradation when simultaneously deleting multiple control plane machines with the OnDelete strategy.
During a revision rollout, etcd members can temporarily appear unhealthy while their pods are reinstalled to the latest revision. This is different from members being indefinitely unhealthy on a stable revision.
Additionally, the EtcdEndpointsController pauses during revision rollouts, so when a replacement machine is added and triggers a rollout, the etcd-endpoints configmap won't update. This causes API servers on the old revision to use removed member endpoints, leading to API unavailability.
This change adds a revision stability check before allowing member removal, ensuring we only remove members when revisions are stable and unhealthy members are truly unhealthy. This explicitly codifies the 4.17 behavior where the operator waited for all revisions to complete before removing members and lifecycle hooks.
TODO: Before merging this needs a corresponding test in the etcd vertical scaling suite to make sure we can validate this all the way back to 4.18.