-
Notifications
You must be signed in to change notification settings - Fork 7
🐛 set taint on managedcluster before starting spoke cleanup #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 set taint on managedcluster before starting spoke cleanup #76
Conversation
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: arturshadnik The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
WalkthroughAdds CleanupConfig.forceClusterDrain and two ManagedCluster taint constants; refactors hub/spoke reconcile requeue timings; converts several cleanup methods to return requeue semantics, introduces hub preflight/taint flows; updates CRD/Helm templates, webhook validation, dev config, and e2e Hub/Spoke tests. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
fleetconfig-controller/internal/webhook/v1beta1/validation.go (1)
26-28: Error message out of date with allowed Spoke updates.Message lists old fields; now entire
spec.cleanupConfigis allowed.Apply:
-errAllowedSpokeUpdate = "spoke contains changes which are not allowed; only changes to spec.klusterlet.annotations, spec.klusterlet.values, spec.klusterlet.valuesFrom, spec.kubeconfig, spec.addOns, spec.purgeAgentNamespace, spec.cleanupConfig.purgeKubeconfigSecret, spec.timeout, and spec.logVerbosity are allowed when updating a spoke" +errAllowedSpokeUpdate = "spoke contains changes which are not allowed; only changes to spec.klusterlet.annotations, spec.klusterlet.values, spec.klusterlet.valuesFrom, spec.kubeconfig, spec.addOns, spec.cleanupConfig, spec.timeout, and spec.logVerbosity are allowed when updating a spoke"
🧹 Nitpick comments (2)
fleetconfig-controller/charts/fleetconfig-controller/values.yaml (1)
187-187: Good addition; clarify tainting sequence and scope.Doc reads well. Consider adding one sentence that the terminating taint is applied later in deletion and is not tolerated by any placement/addon, while ForceClusterDrain only applies the workload-cleanup taint to deschedule non-essential workloads. Helps operators reason about order and effects.
fleetconfig-controller/internal/controller/v1beta1/spoke_controller.go (1)
301-328: Spoke-mode Hub watch misses Status.Phase transitions.Parity with the Hub-mode predicate will improve responsiveness (e.g., during Hub deletion). Include phase change detection here as well.
- UpdateFunc: func(e event.UpdateEvent) bool { + UpdateFunc: func(e event.UpdateEvent) bool { oldHub, ok := e.ObjectOld.(*v1beta1.Hub) if !ok { return false } newHub, ok := e.ObjectNew.(*v1beta1.Hub) if !ok { return false } - return sharedFieldsChanged(oldHub.Spec.DeepCopy(), newHub.Spec.DeepCopy()) + return sharedFieldsChanged(oldHub.Spec.DeepCopy(), newHub.Spec.DeepCopy()) || + oldHub.Status.Phase != newHub.Status.Phase },Based on learnings
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
fleetconfig-controller/api/v1beta1/constants.go(1 hunks)fleetconfig-controller/api/v1beta1/spoke_types.go(2 hunks)fleetconfig-controller/charts/fleetconfig-controller/README.md(1 hunks)fleetconfig-controller/charts/fleetconfig-controller/crds/fleetconfig.open-cluster-management.io_spokes.yaml(1 hunks)fleetconfig-controller/charts/fleetconfig-controller/templates/fleetconfig.yaml(1 hunks)fleetconfig-controller/charts/fleetconfig-controller/values.yaml(2 hunks)fleetconfig-controller/devspace.yaml(3 hunks)fleetconfig-controller/internal/controller/v1beta1/constants.go(1 hunks)fleetconfig-controller/internal/controller/v1beta1/hub_controller.go(6 hunks)fleetconfig-controller/internal/controller/v1beta1/spoke_controller.go(8 hunks)fleetconfig-controller/internal/controller/v1beta1/spoke_handler.go(9 hunks)fleetconfig-controller/internal/webhook/v1beta1/validation.go(2 hunks)fleetconfig-controller/test/e2e/v1beta1_hub_spoke.go(2 hunks)
🧰 Additional context used
🧠 Learnings (5)
📚 Learning: 2025-09-25T23:18:41.573Z
Learnt from: arturshadnik
PR: open-cluster-management-io/lab#69
File: fleetconfig-controller/internal/controller/v1beta1/spoke_controller.go:125-133
Timestamp: 2025-09-25T23:18:41.573Z
Learning: In the fleetconfig-controller spoke deletion flow, SpokeCleanupFinalizer is always removed before HubCleanupFinalizer. This means that checking for the existence of HubCleanupFinalizer in the deletion logic is sufficient regardless of cluster type, as any SpokeCleanupFinalizer would have already been removed by the time the hub cleanup runs.
Applied to files:
fleetconfig-controller/test/e2e/v1beta1_hub_spoke.gofleetconfig-controller/internal/controller/v1beta1/spoke_controller.gofleetconfig-controller/internal/controller/v1beta1/spoke_handler.go
📚 Learning: 2025-08-22T19:38:49.769Z
Learnt from: arturshadnik
PR: open-cluster-management-io/lab#52
File: fleetconfig-controller/internal/controller/v1beta1/spoke_controller_test.go:49-56
Timestamp: 2025-08-22T19:38:49.769Z
Learning: In the fleetconfig-controller project, the SpokeSpec and HubSpec structs in v1beta1 contain only optional fields (like Foo *string with omitempty tags), so creating these resources without populating the Spec field does not cause validation failures in tests.
Applied to files:
fleetconfig-controller/test/e2e/v1beta1_hub_spoke.go
📚 Learning: 2025-09-12T20:53:55.600Z
Learnt from: arturshadnik
PR: open-cluster-management-io/lab#59
File: fleetconfig-controller/internal/controller/v1beta1/addon.go:641-661
Timestamp: 2025-09-12T20:53:55.600Z
Learning: The waitForAddonManifestWorksCleanup function is called specifically during spoke cleanup/unjoin operations, only when addons were previously enabled, and after addon disable operations. In this controlled context, checking for zero total ManifestWorks is equivalent to checking for zero addon ManifestWorks since other ManifestWorks would have been handled earlier in the cleanup flow.
Applied to files:
fleetconfig-controller/internal/controller/v1beta1/spoke_handler.go
📚 Learning: 2025-09-22T19:16:34.109Z
Learnt from: arturshadnik
PR: open-cluster-management-io/lab#59
File: fleetconfig-controller/internal/webhook/v1beta1/validation.go:103-121
Timestamp: 2025-09-22T19:16:34.109Z
Learning: In the open-cluster-management-io/lab repository's fleetconfig-controller v1beta1 API, the Klusterlet field in SpokeSpec is defined as a struct value (Klusterlet Klusterlet), not a pointer (*Klusterlet), so direct field access like Klusterlet.Annotations is safe without nil checks. The Klusterlet struct does not contain a Source field.
Applied to files:
fleetconfig-controller/charts/fleetconfig-controller/README.md
📚 Learning: 2025-09-25T23:31:11.630Z
Learnt from: arturshadnik
PR: open-cluster-management-io/lab#69
File: fleetconfig-controller/charts/fleetconfig-controller/templates/ocm/fcc-addon/addon-template.yaml:110-112
Timestamp: 2025-09-25T23:31:11.630Z
Learning: The fleetconfig-controller-manager spoke agent requires create/update/patch/delete permissions on CustomResourceDefinitions because `clusteradm upgrade klusterlet` operations need create/update permissions and cleanup operations require delete permissions for proper lifecycle management.
Applied to files:
fleetconfig-controller/charts/fleetconfig-controller/README.md
🧬 Code graph analysis (5)
fleetconfig-controller/test/e2e/v1beta1_hub_spoke.go (3)
fleetconfig-controller/test/utils/utils.go (2)
WarnError(72-75)AssertConditions(236-253)fleetconfig-controller/api/v1beta1/constants.go (6)
Deleting(63-63)SpokeJoined(28-28)CleanupFailed(25-25)AddonsConfigured(22-22)KlusterletSynced(34-34)PivotComplete(31-31)fleetconfig-controller/api/v1beta1/common.go (1)
Condition(126-129)
fleetconfig-controller/internal/webhook/v1beta1/validation.go (1)
fleetconfig-controller/api/v1beta1/spoke_types.go (1)
CleanupConfig(91-115)
fleetconfig-controller/internal/controller/v1beta1/spoke_controller.go (2)
fleetconfig-controller/api/v1beta1/constants.go (4)
HubCleanupFinalizer(10-10)CleanupFailed(25-25)SpokeJoining(54-54)SpokeRunning(57-57)fleetconfig-controller/api/v1beta1/common.go (1)
NewCondition(112-123)
fleetconfig-controller/internal/controller/v1beta1/hub_controller.go (4)
fleetconfig-controller/api/v1beta1/constants.go (5)
HubCleanupFinalizer(10-10)CleanupFailed(25-25)Unhealthy(60-60)HubStarting(48-48)HubRunning(51-51)fleetconfig-controller/api/v1beta1/common.go (1)
NewCondition(112-123)fleetconfig-controller/api/v1beta1/spoke_types.go (1)
SpokeList(336-340)fleetconfig-controller/pkg/common/common.go (1)
AddOnClient(59-69)
fleetconfig-controller/internal/controller/v1beta1/spoke_handler.go (5)
fleetconfig-controller/api/v1beta1/constants.go (10)
ManagedClusterWorkloadCleanupTaint(135-135)ManagedClusterTerminatingTaint(139-139)InstanceTypeManager(83-83)PivotComplete(31-31)InstanceTypeUnified(89-89)HubCleanupPreflightFinalizer(7-7)SpokeCleanupFinalizer(13-13)HubCleanupFinalizer(10-10)AddonsConfigured(22-22)InstanceTypeAgent(86-86)fleetconfig-controller/internal/controller/v1beta1/spoke_controller.go (1)
SpokeReconciler(52-58)fleetconfig-controller/api/v1beta1/spoke_types.go (2)
Spoke(317-331)CleanupConfig(91-115)fleetconfig-controller/pkg/common/common.go (5)
ClusterClient(20-30)WorkClient(46-56)AddOnClient(59-69)UpdateManagedCluster(84-89)OperatorClient(33-43)fleetconfig-controller/internal/kube/kube.go (1)
RestConfigFromKubeconfig(22-32)
🪛 markdownlint-cli2 (0.18.1)
fleetconfig-controller/charts/fleetconfig-controller/README.md
95-95: Table column count
Expected: 3; Actual: 5; Too many cells, extra data will be missing
(MD056, table-column-count)
128-128: Table column count
Expected: 3; Actual: 4; Too many cells, extra data will be missing
(MD056, table-column-count)
139-139: Table column count
Expected: 3; Actual: 5; Too many cells, extra data will be missing
(MD056, table-column-count)
🔇 Additional comments (31)
fleetconfig-controller/charts/fleetconfig-controller/crds/fleetconfig.open-cluster-management.io_spokes.yaml (1)
76-83: New cleanupConfig.forceClusterDrain: looks correct.Type, default, and description align with API and chart usage.
Also applies to: 86-87
fleetconfig-controller/internal/webhook/v1beta1/validation.go (1)
107-107: Allowing updates to the entire CleanupConfig: LGTM.Zeroing CleanupConfig in the DeepCopy comparison correctly whitelists any cleanup changes.
Also applies to: 126-128
fleetconfig-controller/api/v1beta1/spoke_types.go (1)
92-115: New CleanupConfig.ForceClusterDrain: approved.Matches CRD, chart, and intended taint-driven cleanup behavior. Based on learnings.
fleetconfig-controller/api/v1beta1/constants.go (1)
133-140: Taint constants added: good public surface.Names/values align with docs and cleanup flow.
fleetconfig-controller/devspace.yaml (1)
120-121: Cert-manager disabled in chart while installed separately: verified.Version variable
CERT_MANAGER_VERSIONis properly sourced fromhack/.versions.env(v1.18.1) and injected via DevSpace'sDEVSPACE_ENV_FILEconfiguration at deploy time. Infrastructure is correct across all three deployments (lines 120-121, 136-137, 153-154).fleetconfig-controller/internal/controller/v1beta1/constants.go (1)
12-17: Requeue constants: properly encapsulated internal implementation.Verification confirms the constants are unexported and used only within their internal package. Both hub and spoke controllers correctly reference them in reconciliation requeue logic. No public API exposure and appropriate interval values for controller workloads.
fleetconfig-controller/internal/controller/v1beta1/spoke_controller.go (9)
112-113: Requeue-after on first-finalizer add is correct.
127-128: Agent path: early finalizer add + requeue is correct.
136-139: Deletion phase transition + timed requeue looks right.
182-183: Initial phase/conditions requeue aligns with new constants.
196-197: Requeue until conditions reach WantStatus is appropriate.
203-204: Post-join requeue constant usage LGTM.
260-261: Watching Hub phase transitions in Hub-mode is good.
280-280: Watcher interval constant use is appropriate.
143-153: Cleanup kubeconfig nil-path verified; confirm deployment assumptions.The nil-fallback mechanism is correctly implemented: RestConfigFromKubeconfig(nil) falls back to ctrl.GetConfig(). However, the code flow suggests nil-kubeconfig during cleanup may be limited to InstanceTypeAgent, where it is not actually used (cleanup routes to doSpokeCleanup only). For Manager/Unified instances, kubeconfig is populated via hub.Spec.Kubeconfig unless a loading error occurs, which prevents cleanup from executing.
Please confirm:
- Whether nil-kubeconfig can actually occur in cleanup() for Manager/Unified deployments
- That in-cluster config (when used) resolves to the Hub cluster in your manager/unified deployment topology
fleetconfig-controller/internal/controller/v1beta1/hub_controller.go (7)
97-98: Requeue-after on first-finalizer add is correct.
109-110: Deletion phase transition + timed requeue looks right.
113-123: Cleanup now returns (requeue, err); usage is correct.
150-151: Initial requeue constant usage LGTM.
159-161: Pre-init requeue until HubInitialized is true is correct.
167-169: Requeue on unmet WantStatus is appropriate.
175-176: Post-init requeue constant usage LGTM.fleetconfig-controller/internal/controller/v1beta1/spoke_handler.go (7)
43-55: Taints defined clearly with NoSelect effect.Good centralization; keys match API constants.
476-546: Hub cleanup flow is coherent; preflight + handoff is nice.Removing preflight finalizer and waiting for SpokeCleanupFinalizer is a clean contract.
547-558: Edge: ManagedCluster not found short-circuit is good.Also removing both hub finalizers here prevents leaks.
562-574: Workload-cleanup taint application under ForceClusterDrain is correct.Good idempotency guard and update via common.UpdateManagedCluster.
686-760: Spoke cleanup ordering is solid; respects hub preflight and unjoin.Finalizer removals and optional operator purge look correct; requeue on hub side needed for final cleanup is handled.
581-586: Now let me search for more context about the code to understand whether returning an error is the correct pattern:Let me search for more context about the cleanup flow and ManifestWorks handling:
Based on my verification, I can now provide the final response. The grep results and web search confirm that:
- Returning an error from Reconcile triggers retry with exponential backoff in controller-runtime
- The codebase consistently uses this pattern: cleanup errors set
CleanupFailed=trueand return an error to block reconciliation- When cleanup fails for any reason, it will block resource deletion, which is a tradeoff worth considering based on whether the cleanup is crucial
- The e2e test in
v1beta1_hub_spoke.go:362expectsCleanupFailed=Trueduring cleanup scenarios, confirming this is intentionalThe current behavior (returning an error, which sets
CleanupFailed=true) is correct: it prevents deletion of the managed cluster while active non-addon ManifestWorks exist, preventing orphaned resources. This is the desired UX for cluster cleanup safety.Verify that blocking deletion on non-addon ManifestWorks is the intended UX. The current behavior (returning error, setting CleanupFailed=true) correctly prevents cluster deletion while active ManifestWorks persist, which is confirmed by the e2e test expectations at v1beta1_hub_spoke.go:362.
510-521: CSR cleanup logic verified: HasLabels selector is supported.HasLabels was introduced in controller-runtime v0.5.0 and is confirmed available in this repository's vendored controller-runtime. The code at line 511 correctly uses
client.HasLabels{"open-cluster-management.io/cluster-name"}and no changes are required.fleetconfig-controller/charts/fleetconfig-controller/values.yaml (1)
220-220: Value properly propagated through all three layers.Verified:
- CRD schema default (crds/fleetconfig.open-cluster-management.io_spokes.yaml:77):
default: false- Helm template (templates/fleetconfig.yaml:81):
{{ .forceClusterDrain | default false }}- Controller logic (internal/controller/v1beta1/spoke_handler.go:562): Read and honored in
hubCleanupPreflightfunction, applies workload-cleanup taint when enabledDefault false is appropriate—cluster drain should be opt-in for safety.
fleetconfig-controller/test/e2e/v1beta1_hub_spoke.go (1)
335-372: Test condition set correctly reflects deletion blocking semantics.The test properly verifies that deletion is blocked while ManifestWorks exist by checking all five conditions: the joined/synced/configured states persist from normal operation, while
CleanupFailed=Trueindicates the cleanup is currently blocked by active ManifestWorks. This aligns with the controller's deletion logic wheredoHubCleanup()returns an error when ManifestWorks are present, triggering theCleanupFailedcondition.
fleetconfig-controller/charts/fleetconfig-controller/templates/fleetconfig.yaml
Outdated
Show resolved
Hide resolved
fleetconfig-controller/internal/controller/v1beta1/hub_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
Signed-off-by: Artur Shad Nik <arturshadnik@gmail.com>
|
/lgtm |
5a6ec6b
into
open-cluster-management-io:main
This change resolves an issue where AddOns that use
installStrategy.type: Placementswould block Spoke deletion due to ManifestWorks not being cleaned up.The solution is to set 2 taints on the ManagedCluster. The first is expected to be tolerated by any AddOn's Placement that is needed during the cleanup process. It's purpose is to drain all scheduled workloads. The second taint should not be tolerated by anything, including AddOns. This taint will remove all remaining ManifestWorks (except those created manually) from the spoke's namespace. The first taint is gated behind a new
Spoke.Spec.CleanupConfig.ForceClusterDrainfield. It defaults to false for backwards compatibility. The second taint (the one that removes addons) is always applied.Also resolves an issue where, if a spoke failed to join, and a ManagedCluster was never created, the Spoke resource could not be deleted due to an early exit and finalizers not being removed. Both Hub finalizers will now be removed if a ManagedCluster does not exist during cleanup.
During cleanup, Hub controller will no longer return errors while waiting for Spokes to be deleted. Instead it will log how many Spokes remain and requeue for a short, consistent period. The errors ended up flooding the logs with non-actionable spam. Spokes will still return an error if non-addon ManifestWorks remain that need to be manually cleaned up. This is actionable and blocks deletion unless acted upon so leaving it in.
Small change to E2E assertions to check that the Spoke deletion is being blocked by a manually created ManifestWork.
Summary by CodeRabbit
New Features
Improvements
Configuration Changes
Tests