[cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129

alexcristi · 2021-06-09T12:56:38Z

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version:
1.20.0

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:15:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS

What did you expect to happen?:

If I request for 50 pods, at worst-case scenario I expect a maximum 50 new nodes to be provisioned. A small delta/deflection is acceptable.

What happened instead?:

A deployment scaled from 3 pods -> 50 pods and the cluster-autoscaler provisioned 124 new nodes (about 3 times more than needed)

How to reproduce it (as minimally and precisely as possible):

Have a kubernetes cluster in the AWS environment with the ASGs split-by-az (1 ASG for each availability zone with balance-similar-node-groups flag enabled)
Have a deployment with composed topologySpreadConstraints:

      topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              app: sample
          maxSkew: 1
          topologyKey: failure-domain.beta.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchLabels:
              app: sample
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway

Trigger a scale-out

Anything else we need to know?:

I0609 12:23:05.391344       1 scale_up.go:288] Pod sample-deployment-6567d494d-msmqx can't be scheduled on yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99, predicate checking error: node(s) didn't match pod topology spread constraints; predicateName=PodTopologySpread; reasons: node(s) didn't match pod topology spread constraints; debugInfo=
I0609 12:23:05.391358       1 scale_up.go:290] 38 other pods similar to sample-deployment-6567d494d-msmqx can't be scheduled on yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99


doalexan-macOS:~ doalexan$ ks logs cluster-autoscaler-757bc688c7-ctfgw -c cluster-autoscaler | grep Estimated
I0609 12:22:44.484873       1 scale_up.go:460] Estimated 43 nodes needed in yaldo3-sbx-va6-k8s-compute-1-worker1AutoScalingGroup-IEVC83OH6WBI
I0609 12:22:54.780549       1 scale_up.go:460] Estimated 41 nodes needed in yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99
I0609 12:23:05.391416       1 scale_up.go:460] Estimated 38 nodes needed in yaldo3-sbx-va6-k8s-compute-1-worker2AutoScalingGroup-14C4MNCP75I8W
doalexan-macOS:~ doalexan$ ks logs cluster-autoscaler-757bc688c7-ctfgw -c cluster-autoscaler | grep "Best option to resize"
I0609 12:22:44.484866       1 scale_up.go:456] Best option to resize: yaldo3-sbx-va6-k8s-compute-1-worker1AutoScalingGroup-IEVC83OH6WBI
I0609 12:22:54.780542       1 scale_up.go:456] Best option to resize: yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99
I0609 12:23:05.391402       1 scale_up.go:456] Best option to resize: yaldo3-sbx-va6-k8s-compute-1-worker2AutoScalingGroup-14C4MNCP75I8W
doalexan-macOS:~ doalexan$ ks logs cluster-autoscaler-757bc688c7-ctfgw -c cluster-autoscaler | grep "Final"
I0609 12:22:44.484915       1 scale_up.go:574] Final scale-up plan: [{yaldo3-sbx-va6-k8s-compute-1-worker1AutoScalingGroup-IEVC83OH6WBI 9->52 (max: 1000)}]
I0609 12:22:54.780601       1 scale_up.go:574] Final scale-up plan: [{yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99 5->46 (max: 1000)}]
I0609 12:23:05.391476       1 scale_up.go:574] Final scale-up plan: [{yaldo3-sbx-va6-k8s-compute-1-worker2AutoScalingGroup-14C4MNCP75I8W 14->52 (max: 1000)}]

The text was updated successfully, but these errors were encountered:

MartinEmrich · 2021-06-21T13:53:51Z

I can confirm it with a single topologySpreadConstraint:

      topologySpreadConstraints:
        - topologyKey: "topology.kubernetes.io/zone"
          maxSkew: 1
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - myApp

After scaling up deployment just 2 to 30 replicas (should have fit easily on a few nodes), CA started to scale up all node groups to the maximum within a few seconds.

(CA 1.20.0, EKS 1.20, 1 ASG per AZ)

Might be related to #4099 ?

nshekhar221 · 2021-07-27T08:48:31Z

Observing the same behaviour after testing with v1.21 ,

In the AWS environment with the ASGs split-by-az (1 ASG for each availability zone with
balance-similar-node-groups flag enabled)
for deployment with failure-domain.beta.kubernetes.io/zone topologySpreadConstraints.

root@a4381d640386:/infrastructure# kubectl  get pods cluster-autoscaler-596fd6869f-l2wj8 -n kube-system -o yaml| grep -i "v1.21.0"
    ...
    image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.21.0
    ...

root@a4381d640386:/infrastructure# kubectl logs cluster-autoscaler-596fd6869f-l2wj8 -n kube-system cluster-autoscaler | grep Estimated
I0726 07:40:22.515606       1 scale_up.go:472] Estimated 2 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 07:43:43.782579       1 scale_up.go:472] Estimated 44 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 07:43:54.011754       1 scale_up.go:472] Estimated 41 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM

Taking it a bit further, tried with the changes suggested in #4099 (ie. Adding a predicateChecker.CheckPredicates call after adding a new node in snapshot (binpacking_estimator.go) to check whether pod can be scheduled on this new node)

(cluster-autoscaler-release-1.21...nshekhar221:cluster-autoscaler-1.21.0-with-fix)

Testing with the above change has resulted with the following output -

root@a4381d640386:/infrastructure# kubectl logs cluster-autoscaler-cc4699b74-wkjmb -n kube-system cluster-autoscaler | grep Estimated
I0726 06:21:23.947747       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:21:34.222356       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:21:44.483110       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:24:05.726519       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:24:15.871400       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:24:26.126278       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:27:27.507255       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:27:37.780775       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:27:48.065048       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:30:29.295872       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:30:39.459837       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:30:59.718187       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:32:50.762266       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:33:21.291955       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:33:41.542018       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:35:32.500733       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:37:13.391653       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
root@a4381d640386:/infrastructure# kubectl logs cluster-autoscaler-cc4699b74-wkjmb -n kube-system cluster-autoscaler | grep Final
I0726 06:21:23.947802       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 1->2 (max: 10)}]
I0726 06:21:34.222404       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 2->3 (max: 10)}]
I0726 06:21:44.483167       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 2->3 (max: 10)}]
I0726 06:24:05.726596       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 2->3 (max: 10)}]
I0726 06:24:15.871448       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 3->4 (max: 10)}]
I0726 06:24:26.126330       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 3->4 (max: 10)}]
I0726 06:27:27.507310       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 3->4 (max: 10)}]
I0726 06:27:37.780841       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 4->5 (max: 10)}]
I0726 06:27:48.065102       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 4->5 (max: 10)}]
I0726 06:30:29.295935       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 4->5 (max: 10)}]
I0726 06:30:39.459895       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 5->6 (max: 10)}]
I0726 06:30:59.718238       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 5->6 (max: 10)}]
I0726 06:32:50.762324       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 5->6 (max: 10)}]
I0726 06:33:21.292028       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 6->7 (max: 10)}]
I0726 06:33:41.542072       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 6->7 (max: 10)}]
I0726 06:35:32.500807       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 6->7 (max: 10)}]
I0726 06:37:13.391702       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 7->8 (max: 10)}]

Results/Analysis :
With the fix,

The CA is not scaling massively anymore for scaling of deployment with failure-domain.beta.kubernetes.io/zone topologySpreadConstraints defined.
Scaling node distribution across AZ is balanced.

Any feedbacks/suggestions around this will help as we are observing this issue on a frequent basis.

nshekhar221 · 2021-08-04T08:02:17Z

@MaciekPytel Does the changes cluster-autoscaler-release-1.21...nshekhar221:cluster-autoscaler-1.21.0-with-fix looks like something that can be a solution for this issue?

Initial testing logs (shared above) suggests that it help with the massive scale out using failure-domain.beta.kubernetes.io/zone topologySpreadConstraints.

Also kindly let us know if there will be any concerns around the same.

Happy to raise a PR if suggested changes looks fine.

MaciekPytel · 2021-08-23T14:26:20Z

The changes make a lot of sense and I agree they could help with this issue. One comment:

ExpansionOption also has a list of pods that will be helped by scale-up. This fix changes the estimated node number, but it doesn't modify the list of pods. That means that expander (heuristic that selects between available scale-up options) will act as if all those pending pods could be scheduled on a very small number of nodes.
I think the best way to fix would be to keep track of which pods were actually "scheduled" in Estimator and override ExpansionOption.Pods based on it. Since Estimator only has a single implementation now, I don't see any problem with changing the interface so that this information can be returned.

Also, for future reference only: removing node from snapshot is an expensive operation as it drops internal caches. I suspect that with a lot of pending pods using topology spreading one may run into scalability problems with binpacking (which is obviously still a major improvement on current state).

This could be optimized by not removing the node if CheckPredicates() fail and just remembering it's empty so we don't add empty node for next pod and not count it towards result if it remains empty at the end.
I think it would be premature and needlessly complex to add this optimization now, just something to keep in mind if we run into scalability issue with this later on.

k8s-triage-robot · 2021-12-14T12:00:13Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci · 2021-12-14T15:15:54Z

/remove-lifecycle stale

k8s-triage-robot · 2022-03-14T16:10:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci · 2022-03-14T16:57:37Z

/remove-lifecycle stale

k8s-triage-robot · 2022-06-12T17:55:50Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pierluigilenoci · 2022-06-13T07:53:42Z

/remove-lifecycle stale

MartinEmrich · 2022-10-17T13:36:38Z

I just tried topologySpreadConstraints again, and it still happens for me.

I then found out that the latest CA version for EKS 1.22 ist indeed 1.22.1 (which I used) from 2021, so the fix cannot be in there.
As far as I can tell, the only release containing the fix is 1.25 so far.

What options are there if one is stuck with Kubernetes 1.22 or 1.23 (e.g. AWS EKS)? The docs clearly state that the versions should match up....

MartinEmrich · 2023-02-13T17:42:34Z

Just started evaluating EKS 1.24, and tried CA 1.25 with it.

It seems to work just fine, and the fix for this issue is included, too. So no more scalout explosions with topology spread constraints.

alexcristi added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2021

fullykubed mentioned this issue Aug 25, 2021

[cluster-autoscaler] Extreme overprovisioning when using pod interpendency constraints #4099

Closed

jbartosik added the area/cluster-autoscaler label Sep 15, 2021

MaciekPytel mentioned this issue Nov 9, 2021

fix pod equivalency checks for pods with projected volumes #4441

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2022

MaciekPytel mentioned this issue Jun 14, 2022

Binpacking can exit without packing all the pods #4970

Merged

k8s-ci-robot closed this as completed in #4970 Jun 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129

[cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129

alexcristi commented Jun 9, 2021 •

edited

Loading

MartinEmrich commented Jun 21, 2021

nshekhar221 commented Jul 27, 2021 •

edited

Loading

nshekhar221 commented Aug 4, 2021 •

edited

Loading

MaciekPytel commented Aug 23, 2021

k8s-triage-robot commented Dec 14, 2021

pierluigilenoci commented Dec 14, 2021

k8s-triage-robot commented Mar 14, 2022

pierluigilenoci commented Mar 14, 2022

k8s-triage-robot commented Jun 12, 2022

pierluigilenoci commented Jun 13, 2022

MartinEmrich commented Oct 17, 2022

MartinEmrich commented Feb 13, 2023

[cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129

[cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129

Comments

alexcristi commented Jun 9, 2021 • edited Loading

MartinEmrich commented Jun 21, 2021

nshekhar221 commented Jul 27, 2021 • edited Loading

nshekhar221 commented Aug 4, 2021 • edited Loading

MaciekPytel commented Aug 23, 2021

k8s-triage-robot commented Dec 14, 2021

pierluigilenoci commented Dec 14, 2021

k8s-triage-robot commented Mar 14, 2022

pierluigilenoci commented Mar 14, 2022

k8s-triage-robot commented Jun 12, 2022

pierluigilenoci commented Jun 13, 2022

MartinEmrich commented Oct 17, 2022

MartinEmrich commented Feb 13, 2023

alexcristi commented Jun 9, 2021 •

edited

Loading

nshekhar221 commented Jul 27, 2021 •

edited

Loading

nshekhar221 commented Aug 4, 2021 •

edited

Loading