Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions cluster-autoscaler/core/static_autoscaler.go
Original file line number Diff line number Diff line change
Expand Up @@ -627,6 +627,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr

if scaleDownInCooldown {
scaleDownStatus.Result = scaledownstatus.ScaleDownInCooldown
a.updateSoftDeletionTaints(allNodes)
} else {
klog.V(4).Infof("Starting scale down")

Expand All @@ -645,21 +646,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr
a.lastScaleDownDeleteTime = currentTime
a.clusterStateRegistry.Recalculate()
}

if scaleDownStatus.Result == scaledownstatus.ScaleDownNoNodeDeleted &&
a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount != 0 {
taintableNodes := a.scaleDownPlanner.UnneededNodes()

// Make sure we are only cleaning taints from selected node groups.
selectedNodes := filterNodesFromSelectedGroups(a.CloudProvider, allNodes...)

// This is a sanity check to make sure `taintableNodes` only includes
// nodes from selected nodes.
taintableNodes = intersectNodes(selectedNodes, taintableNodes)
untaintableNodes := subtractNodes(selectedNodes, taintableNodes)
actuation.UpdateSoftDeletionTaints(a.AutoscalingContext, taintableNodes, untaintableNodes)
}

a.updateSoftDeletionTaints(allNodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just call update soft deletion taints before the if scaleDownInCooldown, this way we won't have to call it in 2 different places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In runOnce we do the following order in

  1. updateClusterState updates the nodes filtering out those with toBeDeleted taint
  2. scale down and taint nodes with toBeDeleted
  3. add soft taints DeletionCandidate to unneeded nodes ignoring those with hard taint toBeDeleted

If we put that update before if scaleDownInCooldown we might add soft taints DeletionCandidate for the nodes that will scale down in the same loop just before scaling down and adding the hard taint toBeDeleted which I believe is fine and doesn't have any issues that i am aware of.

waiting for @x13n review and will update with the suggestion if there is no objection

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the existing order honestly. The reason is that soft tainting is not instant and it is better to start actuation as soon as CA makes up its mind on removing rather than to wait & increase risk of race conditions with scheduler. If you want to call the function just once, it can be done by putting this whole if cooldown { ... } else { ... } block into yet another function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, LGTM then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How confident are we that this made any actionable difference? The fact that (1) existing UT passed without any changes and (2) no new UT scenarios were added maybe suggests this had no effect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simple test could be:

  • starting RunOnce() at min count w/ some tainted nodes,
  • asserting we go into cool down
  • asserting that the taints are released

As far as I can tell, we currently do not have any unit tests that exercise ScaleDownInCooldown status?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @BigDarkClown, should we do a follow up work stream to add UT cases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, @abdelrahman882 can you add unit tests in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, will add those tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @rakechill mentioned, this code path was not covered in unit tests.

Added #7995, @jackfrancis The unit test there covers this case and it would fail in case we don't call updateSoftDeletionTaints in case of scaleDownInCooldown is true which should address you concerns.

cc: @x13n

if typedErr != nil {
klog.Errorf("Failed to scale down: %v", typedErr)
a.lastScaleDownFailTime = currentTime
Expand All @@ -679,6 +666,21 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) caerrors.AutoscalerErr
return nil
}

func (a *StaticAutoscaler) updateSoftDeletionTaints(allNodes []*apiv1.Node) {
if a.AutoscalingContext.AutoscalingOptions.MaxBulkSoftTaintCount != 0 {
taintableNodes := a.scaleDownPlanner.UnneededNodes()

// Make sure we are only cleaning taints from selected node groups.
selectedNodes := filterNodesFromSelectedGroups(a.CloudProvider, allNodes...)

// This is a sanity check to make sure `taintableNodes` only includes
// nodes from selected nodes.
taintableNodes = intersectNodes(selectedNodes, taintableNodes)
untaintableNodes := subtractNodes(selectedNodes, taintableNodes)
actuation.UpdateSoftDeletionTaints(a.AutoscalingContext, taintableNodes, untaintableNodes)
}
}

func (a *StaticAutoscaler) addUpcomingNodesToClusterSnapshot(upcomingCounts map[string]int, nodeInfosForGroups map[string]*framework.NodeInfo) error {
nodeGroups := a.nodeGroupsById()
upcomingNodeGroups := make(map[string]int)
Expand Down
Loading