Fix cool down status condition to trigger scale down #7954

abdelrahman882 · 2025-03-20T06:20:10Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

CA sets soft taint DeletionCandidateOfClusterAutoscaler before proceeding to scale down by some time, if during this time the nodes with these taints are excluded from being a candidates and length of scale down candidates became 0 then scale-down enters cool down and never executes UpdateSoftDeletionTaints that removes these unneeded taints.

k8s-ci-robot · 2025-03-20T06:20:19Z

Hi @abdelrahman882. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

BigDarkClown · 2025-03-20T10:38:11Z

cluster-autoscaler/core/static_autoscaler.go

 		metrics.UpdateDurationFromStart(metrics.FindUnneeded, unneededStart)

-		scaleDownInCooldown := a.isScaleDownInCooldown(currentTime, scaleDownCandidates)
+		scaleDownInCooldown := a.isScaleDownInCooldown(currentTime)


I think instead of changing this logic we should move the taint cleanup from the nesting it is currently in. The following blok should be executed regardless of if scale-down is in cooldown or what is the scaleDownStatus:

if a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount != 0 { taintableNodes := a.scaleDownPlanner.UnneededNodes() // Make sure we are only cleaning taints from selected node groups. selectedNodes := filterNodesFromSelectedGroups(a.CloudProvider, allNodes...) // This is a sanity check to make sure `taintableNodes` only includes // nodes from selected nodes. taintableNodes = intersectNodes(selectedNodes, taintableNodes) untaintableNodes := subtractNodes(selectedNodes, taintableNodes) actuation.UpdateSoftDeletionTaints(a.AutoscalingContext, taintableNodes, untaintableNodes) }

sure, but UpdateSoftDeletionTaints this way will lead to the following:

we might add soft taints to nodes even though scale down is in cool down (actuation.UpdateSoftDeletionTaints is cleaning up soft taints from needed nodes and adding those to unneeded nodes)

we will execute this update regardless there is a node deletion or not (I am mentioning that because current implementation we update only if scaleDownStatus.Result == scaledownstatus.ScaleDownNoNodeDeleted)

I don't see any big risk for the 2 points except for marking nodes with soft taints while scale down is in cool down which kinda not expected behaviour.

one suggestion would be to split actuation.UpdateSoftDeletionTaints to cleanUpSoftDeletionTaints and markSoftDeletionTaints and do the cleanup regardless and the marking just in scaledown

I updated the fix with your suggestion, please let me know if you want to consider the mentioned one

I think like the current solution should be good enough:

If scale-down is in cool down, we still want to persist the candidate (if there are any) taints and remove the non-candidate taints. This way we ensure that once cool down ends we will have the same cluster state as during the last run.

I don't believe this is an issue. Honestly, I am surprised that this check (no deletion) was in place. I suspect this is a relict of the times when scale-down was not parallel, so it did not make sense to add annotations if we were supposed to spend the loop deleting some node.

@x13n, could you take a look as well? From my side it LGTM, but I would love to have additional pair of eyes on this change.

BigDarkClown · 2025-03-21T13:05:39Z

cluster-autoscaler/core/static_autoscaler.go

-				actuation.UpdateSoftDeletionTaints(a.AutoscalingContext, taintableNodes, untaintableNodes)
-			}
-
+			a.updateSoftDeletionTaints(allNodes)


We can just call update soft deletion taints before the if scaleDownInCooldown, this way we won't have to call it in 2 different places.

In runOnce we do the following order in

updateClusterState updates the nodes filtering out those with toBeDeleted taint

scale down and taint nodes with toBeDeleted

add soft taints DeletionCandidate to unneeded nodes ignoring those with hard taint toBeDeleted

If we put that update before if scaleDownInCooldown we might add soft taints DeletionCandidate for the nodes that will scale down in the same loop just before scaling down and adding the hard taint toBeDeleted which I believe is fine and doesn't have any issues that i am aware of.

waiting for @x13n review and will update with the suggestion if there is no objection

I would keep the existing order honestly. The reason is that soft tainting is not instant and it is better to start actuation as soon as CA makes up its mind on removing rather than to wait & increase risk of race conditions with scheduler. If you want to call the function just once, it can be done by putting this whole if cooldown { ... } else { ... } block into yet another function.

Okay, LGTM then.

How confident are we that this made any actionable difference? The fact that (1) existing UT passed without any changes and (2) no new UT scenarios were added maybe suggests this had no effect?

A simple test could be:

starting RunOnce() at min count w/ some tainted nodes,

asserting we go into cool down

asserting that the taints are released

As far as I can tell, we currently do not have any unit tests that exercise ScaleDownInCooldown status?

ping @BigDarkClown, should we do a follow up work stream to add UT cases?

This is a good point, @abdelrahman882 can you add unit tests in a separate PR?

Sure thing, will add those tomorrow

As @rakechill mentioned, this code path was not covered in unit tests.

Added #7995, @jackfrancis The unit test there covers this case and it would fail in case we don't call updateSoftDeletionTaints in case of scaleDownInCooldown is true which should address you concerns.

cc: @x13n

BigDarkClown · 2025-03-24T14:37:56Z

/lgtm
/approve

k8s-ci-robot · 2025-03-24T14:38:04Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abdelrahman882, BigDarkClown

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [BigDarkClown]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2025-05-06T17:34:42Z

/cherry-pick cluster-autoscaler-release-1.32

k8s-infra-cherrypick-robot · 2025-05-06T17:35:36Z

@jackfrancis: new pull request created: #8098

In response to this:

/cherry-pick cluster-autoscaler-release-1.32

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

jackfrancis · 2025-05-08T20:33:43Z

/cherry-pick cluster-autoscaler-release-1.31

k8s-infra-cherrypick-robot · 2025-05-08T20:34:23Z

@jackfrancis: #7954 failed to apply on top of branch "cluster-autoscaler-release-1.31":

Applying: Fix cool down status condition to trigger scale down
Using index info to reconstruct a base tree...
M	cluster-autoscaler/core/static_autoscaler.go
Falling back to patching base and 3-way merge...
Auto-merging cluster-autoscaler/core/static_autoscaler.go
CONFLICT (content): Merge conflict in cluster-autoscaler/core/static_autoscaler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 Fix cool down status condition to trigger scale down

In response to this:

/cherry-pick cluster-autoscaler-release-1.31

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 20, 2025

k8s-ci-robot requested a review from aleksandra-malinowska March 20, 2025 06:20

k8s-ci-robot added the area/cluster-autoscaler label Mar 20, 2025

k8s-ci-robot requested a review from x13n March 20, 2025 06:20

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 20, 2025

abdelrahman882 force-pushed the FixScaledownCoolDown branch from 26d3d3e to a114cf0 Compare March 20, 2025 06:22

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 20, 2025

BigDarkClown reviewed Mar 20, 2025

View reviewed changes

abdelrahman882 force-pushed the FixScaledownCoolDown branch from a114cf0 to 89b9a0c Compare March 21, 2025 10:17

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 21, 2025

Fix cool down status condition to trigger scale down

2bbe859

abdelrahman882 force-pushed the FixScaledownCoolDown branch from 89b9a0c to 2bbe859 Compare March 21, 2025 10:21

abdelrahman882 requested a review from BigDarkClown March 21, 2025 10:38

BigDarkClown reviewed Mar 21, 2025

View reviewed changes

k8s-ci-robot assigned BigDarkClown Mar 24, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 24, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 24, 2025

k8s-ci-robot merged commit 52cd68a into kubernetes:master Mar 24, 2025
6 of 7 checks passed

x13n mentioned this pull request Mar 24, 2025

DeletionCandidateOfClusterAutoscaler soft taint remaining on nodes indefinitely #7964

Open

abdelrahman882 mentioned this pull request Mar 28, 2025

Add unit test for cleaning deletion soft taint in scale down cool down #7995

Merged

abdelrahman882 mentioned this pull request Apr 13, 2025

REQUEST: New membership for abdelrahman882 kubernetes/org#5534

Closed

11 tasks

k8s-infra-cherrypick-robot mentioned this pull request May 6, 2025

[cluster-autoscaler-release-1.32] Fix cool down status condition to trigger scale down #8098

Merged

jackfrancis mentioned this pull request May 8, 2025

[cluster-autoscaler-release-1.31] Fix cool down status condition to trigger scale down #8112

Merged

jackfrancis mentioned this pull request Jun 30, 2025

[cluster-autoscaler-release-1.30] Fix cool down status condition to trigger scale down #8279

Merged

Fix cool down status condition to trigger scale down #7954

Fix cool down status condition to trigger scale down #7954

Uh oh!

Conversation

abdelrahman882 commented Mar 20, 2025

What type of PR is this?

What this PR does / why we need it:

Uh oh!

k8s-ci-robot commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BigDarkClown commented Mar 24, 2025

Uh oh!

k8s-ci-robot commented Mar 24, 2025

Uh oh!

Uh oh!

jackfrancis commented May 6, 2025

Uh oh!

k8s-infra-cherrypick-robot commented May 6, 2025

Uh oh!

jackfrancis commented May 8, 2025

Uh oh!

k8s-infra-cherrypick-robot commented May 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants