Do not block scaling up due to pending/not yet complete node deletion #4051

Michael-Sinz · 2021-04-30T11:39:05Z

Which component are you using?: Cluster Autoscaler

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
When the autoscaler deletes instances in Azure VMSS, it will not scale in instances until thoses delete operations are seen to have completed. There are times when deletes have taken over an hour but during that time many nodes of scale in were needed. In one case, we had over 200 needed new nodes that were not autoscaled in due to the autoscaler not trying the scale in due to the incomplete deletes.

It turns out that if we restart the autoscaler during such conditions, it will do one scale in operation and then get stuck as it notices the deleting/to be deleted nodes again and tries to delete them again and will not scale anymore. During business hours rampup, this can cause significant problems when this happens as our clusters many times scale in hundreds of nodes during this time and if there is still a stuck deleting node from earlier, the clusters have a shortage of compute resources and a service outage happens.

In one case we had a small scale in (3 new nodes) somehow fail within VMSS and they timed out. Then, when trying to delete those "unregistered" instances, they did not delete right away and prevented more scale in attempts. This had significant negative impact into the operation of the autoscaler and again, restarting the autoscaler got it to scale in some nodes before it noticed the "unregistered" instances and tried to delete them again. Another restart of the autoscaler got us another scale in before again trying to delete the "unregistered" instances.

Describe the solution you'd like.:
If the autoscaler just would continue to follow its behavior with respect to scale in while waiting for delete to complete, that would address the problem in situations like this. Basically, do not block scaling in due to deletes. When new nodes are needed, the slow to delete nodes are not going to be helping in any way and should not be part of the consideration.

Describe any alternative solutions you've considered.:
We are looking at just restarting the autoscaler any time we see repeated attempts to delete the same instance. This is very inefficient but does address the issue without deploying a new autoscaler. It is a hack but it has been show (by manual actions) that it works as a remediation of the problem.

Additional context.:
This is running large, very dynamic clusters in Azure - where they regularly scale from, for example, 100 nodes to 600 nodes and back down again due to usage patterns.

dharmab · 2021-05-07T20:15:05Z

One mitigation we use is to have multiple VMSSes with identical nodes. This helps constrain VMSS-scoped issues to a subset of the cluster and gives CA an alternative to scale out. This had a large impact on Azure API usage in older versions of Kubernetes, but 1.18+ reduced the impact considerably.

Michael-Sinz · 2021-05-07T20:22:16Z

We have done that but at our scale that still ends up with problems. When we tried that it tended to introduce other problems with scaling.

(We get into azure "rate limit" jail during scale down in the evenings since many of those are 1 VM at a time. Scale up is many times 50 to 100 VMs at once but work trickles off the nodes a bit slower.)

k8s-triage-robot · 2021-08-05T21:14:56Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

k8s-triage-robot · 2021-09-04T22:12:56Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Michael-Sinz · 2021-09-07T14:16:24Z

/remove-lifecycle rotten

amirschw · 2021-10-14T13:59:09Z

There are times when deletes have taken over an hour but during that time many nodes of scale in were needed. In one case, we had over 200 needed new nodes that were not autoscaled in due to the autoscaler not trying the scale in due to the incomplete deletes.

+1. We just had a similar case that we only found out about in retrospect where hundreds of pods were stuck in unscheduled state due to a single Azure VMSS instance that took almost 2 hours to delete.

Michael-Sinz · 2021-10-14T15:07:40Z

Note that our current process is to, if things look like they are stuck due to the delete not completing, we restart the cluster autoscaler pod. This lets it notice a scaling up case before it gets back to noticing the needed delete and get stuck in the delete attempt.

It is a harsh hack but it helps mitigate the problem temporarily.

k8s-triage-robot · 2022-01-12T16:00:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Michael-Sinz · 2022-01-12T16:25:48Z

/remove-lifecycle rotten

Michael-Sinz · 2022-01-12T16:26:06Z

/remove-lifecycle stale

ravidbro · 2022-02-28T08:08:57Z

+1
It's a very painful behavior right now

marwanad · 2022-03-15T16:49:31Z

It's important to call-out that this is unregistered node deletions which is an error recovery loop and not your usual VM deletes.

If that's due to regular scale-downs, then there might be an issue in the Azure provider but none of that provider code is blocking in a sense that it will block the CA main loop or scale-ups.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L340-L352

This method issues delete requests and blocks until those get cleared up. That will happen in only two cases:

VM deletion is completed. You're saying it's taking upwards of 2hr.
The cache for VMs is invalidated and returns the deleted VM.

One option would be for the Azure provider to take the instance out of its cache once the delete call its issued. However, the next cache refresh (5 minutes) will bring this instance back so not sure how much better this will make the situation.

One other option which I prefer is for the autoscaler not to block on that. The logic will have to change to something like "Remove old unregistered node and then clear the unregistered list". That way it can make progress.

Michael-Sinz · 2022-03-15T16:57:16Z

Yes, it is an error recovery loop but it happens far more often than one may expect. VM deletes or VM scale ins sometimes fail and then become unregistered nodes in that they are listed in the VMSS but are not part of the kubernetes cluster (not joined)

Under strong scaling (adding/removing tens or hundreds of nodes) it is possible for one or more instances to fail and thus end up as unregistered nodes that are attempted to be deleted again (and again, and again, if they take a long time). During that time, if scaling needs to add more nodes, the autoscaler will not add them as it has these unregistered nodes that have yet to finish deleting.

Restarting the autoscaler will reset its internal state and will end up doing a single scale up and then notice that there are unregistered nodes and get stuck in waiting for them to delete before scaling up again.

I think that managing of the unregistered nodes should be outside of the question of scaling requests.

k8s-triage-robot · 2022-06-13T17:02:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Michael-Sinz · 2022-06-14T16:39:00Z

/remove-lifecycle stale

amirschw · 2022-08-17T08:09:23Z

Looks like scaling activity is no longer blocked by deletion of unregistered nodes since #4810.

k8s-triage-robot · 2022-11-15T08:34:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-12-15T08:54:17Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-01-14T08:58:38Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-01-14T08:58:42Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Michael-Sinz added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 30, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 5, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 4, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 7, 2021

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2022

fookenc mentioned this issue Apr 12, 2022

CA fails to scale-up or cancel in-progress scale down when there are un-schedulable pods #4456

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 15, 2022

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not block scaling up due to pending/not yet complete node deletion #4051

Do not block scaling up due to pending/not yet complete node deletion #4051

Michael-Sinz commented Apr 30, 2021

dharmab commented May 7, 2021

Michael-Sinz commented May 7, 2021

k8s-triage-robot commented Aug 5, 2021

k8s-triage-robot commented Sep 4, 2021

Michael-Sinz commented Sep 7, 2021

amirschw commented Oct 14, 2021

Michael-Sinz commented Oct 14, 2021

k8s-triage-robot commented Jan 12, 2022

Michael-Sinz commented Jan 12, 2022

Michael-Sinz commented Jan 12, 2022

ravidbro commented Feb 28, 2022

marwanad commented Mar 15, 2022

Michael-Sinz commented Mar 15, 2022

k8s-triage-robot commented Jun 13, 2022

Michael-Sinz commented Jun 14, 2022

amirschw commented Aug 17, 2022

k8s-triage-robot commented Nov 15, 2022

k8s-triage-robot commented Dec 15, 2022

k8s-triage-robot commented Jan 14, 2023

k8s-ci-robot commented Jan 14, 2023

Do not block scaling up due to pending/not yet complete node deletion #4051

Do not block scaling up due to pending/not yet complete node deletion #4051

Comments

Michael-Sinz commented Apr 30, 2021

dharmab commented May 7, 2021

Michael-Sinz commented May 7, 2021

k8s-triage-robot commented Aug 5, 2021

k8s-triage-robot commented Sep 4, 2021

Michael-Sinz commented Sep 7, 2021

amirschw commented Oct 14, 2021

Michael-Sinz commented Oct 14, 2021

k8s-triage-robot commented Jan 12, 2022

Michael-Sinz commented Jan 12, 2022

Michael-Sinz commented Jan 12, 2022

ravidbro commented Feb 28, 2022

marwanad commented Mar 15, 2022

Michael-Sinz commented Mar 15, 2022

k8s-triage-robot commented Jun 13, 2022

Michael-Sinz commented Jun 14, 2022

amirschw commented Aug 17, 2022

k8s-triage-robot commented Nov 15, 2022

k8s-triage-robot commented Dec 15, 2022

k8s-triage-robot commented Jan 14, 2023

k8s-ci-robot commented Jan 14, 2023