Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129

Closed
alexcristi opened this issue Jun 9, 2021 · 12 comments · Fixed by #4970
Closed
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.

Comments

@alexcristi
Copy link

alexcristi commented Jun 9, 2021

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Component version:
1.20.0

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T23:41:55Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:15:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}

What environment is this in?:

AWS

What did you expect to happen?:

If I request for 50 pods, at worst-case scenario I expect a maximum 50 new nodes to be provisioned. A small delta/deflection is acceptable.

What happened instead?:

A deployment scaled from 3 pods -> 50 pods and the cluster-autoscaler provisioned 124 new nodes (about 3 times more than needed)

How to reproduce it (as minimally and precisely as possible):

  • Have a kubernetes cluster in the AWS environment with the ASGs split-by-az (1 ASG for each availability zone with balance-similar-node-groups flag enabled)
  • Have a deployment with composed topologySpreadConstraints:
      topologySpreadConstraints:
        - labelSelector:
            matchLabels:
              app: sample
          maxSkew: 1
          topologyKey: failure-domain.beta.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
        - labelSelector:
            matchLabels:
              app: sample
          maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
  • Trigger a scale-out

Anything else we need to know?:

I0609 12:23:05.391344       1 scale_up.go:288] Pod sample-deployment-6567d494d-msmqx can't be scheduled on yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99, predicate checking error: node(s) didn't match pod topology spread constraints; predicateName=PodTopologySpread; reasons: node(s) didn't match pod topology spread constraints; debugInfo=
I0609 12:23:05.391358       1 scale_up.go:290] 38 other pods similar to sample-deployment-6567d494d-msmqx can't be scheduled on yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99


doalexan-macOS:~ doalexan$ ks logs cluster-autoscaler-757bc688c7-ctfgw -c cluster-autoscaler | grep Estimated
I0609 12:22:44.484873       1 scale_up.go:460] Estimated 43 nodes needed in yaldo3-sbx-va6-k8s-compute-1-worker1AutoScalingGroup-IEVC83OH6WBI
I0609 12:22:54.780549       1 scale_up.go:460] Estimated 41 nodes needed in yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99
I0609 12:23:05.391416       1 scale_up.go:460] Estimated 38 nodes needed in yaldo3-sbx-va6-k8s-compute-1-worker2AutoScalingGroup-14C4MNCP75I8W
doalexan-macOS:~ doalexan$ ks logs cluster-autoscaler-757bc688c7-ctfgw -c cluster-autoscaler | grep "Best option to resize"
I0609 12:22:44.484866       1 scale_up.go:456] Best option to resize: yaldo3-sbx-va6-k8s-compute-1-worker1AutoScalingGroup-IEVC83OH6WBI
I0609 12:22:54.780542       1 scale_up.go:456] Best option to resize: yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99
I0609 12:23:05.391402       1 scale_up.go:456] Best option to resize: yaldo3-sbx-va6-k8s-compute-1-worker2AutoScalingGroup-14C4MNCP75I8W
doalexan-macOS:~ doalexan$ ks logs cluster-autoscaler-757bc688c7-ctfgw -c cluster-autoscaler | grep "Final"
I0609 12:22:44.484915       1 scale_up.go:574] Final scale-up plan: [{yaldo3-sbx-va6-k8s-compute-1-worker1AutoScalingGroup-IEVC83OH6WBI 9->52 (max: 1000)}]
I0609 12:22:54.780601       1 scale_up.go:574] Final scale-up plan: [{yaldo3-sbx-va6-k8s-compute-1-worker3AutoScalingGroup-SI93SIX7YS99 5->46 (max: 1000)}]
I0609 12:23:05.391476       1 scale_up.go:574] Final scale-up plan: [{yaldo3-sbx-va6-k8s-compute-1-worker2AutoScalingGroup-14C4MNCP75I8W 14->52 (max: 1000)}]
@alexcristi alexcristi added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2021
@MartinEmrich
Copy link

I can confirm it with a single topologySpreadConstraint:

      topologySpreadConstraints:
        - topologyKey: "topology.kubernetes.io/zone"
          maxSkew: 1
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchExpressions:
            - key: app
              operator: In
              values:
              - myApp

After scaling up deployment just 2 to 30 replicas (should have fit easily on a few nodes), CA started to scale up all node groups to the maximum within a few seconds.

(CA 1.20.0, EKS 1.20, 1 ASG per AZ)

Might be related to #4099 ?

@nshekhar221
Copy link

nshekhar221 commented Jul 27, 2021

Observing the same behaviour after testing with v1.21 ,

  • In the AWS environment with the ASGs split-by-az (1 ASG for each availability zone with
    balance-similar-node-groups flag enabled)
  • for deployment with failure-domain.beta.kubernetes.io/zone topologySpreadConstraints.
root@a4381d640386:/infrastructure# kubectl  get pods cluster-autoscaler-596fd6869f-l2wj8 -n kube-system -o yaml| grep -i "v1.21.0"
    ...
    image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.21.0
    ...

root@a4381d640386:/infrastructure# kubectl logs cluster-autoscaler-596fd6869f-l2wj8 -n kube-system cluster-autoscaler | grep Estimated
I0726 07:40:22.515606       1 scale_up.go:472] Estimated 2 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 07:43:43.782579       1 scale_up.go:472] Estimated 44 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 07:43:54.011754       1 scale_up.go:472] Estimated 41 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM

Taking it a bit further, tried with the changes suggested in #4099 (ie. Adding a predicateChecker.CheckPredicates call after adding a new node in snapshot (binpacking_estimator.go) to check whether pod can be scheduled on this new node)

(cluster-autoscaler-release-1.21...nshekhar221:cluster-autoscaler-1.21.0-with-fix)

Testing with the above change has resulted with the following output -

root@a4381d640386:/infrastructure# kubectl logs cluster-autoscaler-cc4699b74-wkjmb -n kube-system cluster-autoscaler | grep Estimated
I0726 06:21:23.947747       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:21:34.222356       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:21:44.483110       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:24:05.726519       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:24:15.871400       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:24:26.126278       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:27:27.507255       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:27:37.780775       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:27:48.065048       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:30:29.295872       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:30:39.459837       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:30:59.718187       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:32:50.762266       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:33:21.291955       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM
I0726 06:33:41.542018       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
I0726 06:35:32.500733       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6
I0726 06:37:13.391653       1 scale_up.go:472] Estimated 1 nodes needed in nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A
root@a4381d640386:/infrastructure# kubectl logs cluster-autoscaler-cc4699b74-wkjmb -n kube-system cluster-autoscaler | grep Final
I0726 06:21:23.947802       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 1->2 (max: 10)}]
I0726 06:21:34.222404       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 2->3 (max: 10)}]
I0726 06:21:44.483167       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 2->3 (max: 10)}]
I0726 06:24:05.726596       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 2->3 (max: 10)}]
I0726 06:24:15.871448       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 3->4 (max: 10)}]
I0726 06:24:26.126330       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 3->4 (max: 10)}]
I0726 06:27:27.507310       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 3->4 (max: 10)}]
I0726 06:27:37.780841       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 4->5 (max: 10)}]
I0726 06:27:48.065102       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 4->5 (max: 10)}]
I0726 06:30:29.295935       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 4->5 (max: 10)}]
I0726 06:30:39.459895       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 5->6 (max: 10)}]
I0726 06:30:59.718238       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 5->6 (max: 10)}]
I0726 06:32:50.762324       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 5->6 (max: 10)}]
I0726 06:33:21.292028       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker2AutoScalingGroup-1ORI890DK7IJM 6->7 (max: 10)}]
I0726 06:33:41.542072       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 6->7 (max: 10)}]
I0726 06:35:32.500807       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker1AutoScalingGroup-1TJLUML3GSRP6 6->7 (max: 10)}]
I0726 06:37:13.391702       1 scale_up.go:586] Final scale-up plan: [{nshkr-sbx-va6-k8s-compute-0-worker3AutoScalingGroup-10MYRA8D9IV6A 7->8 (max: 10)}]

Results/Analysis :
With the fix,

  • The CA is not scaling massively anymore for scaling of deployment with failure-domain.beta.kubernetes.io/zone topologySpreadConstraints defined.
  • Scaling node distribution across AZ is balanced.

Any feedbacks/suggestions around this will help as we are observing this issue on a frequent basis.

@nshekhar221
Copy link

nshekhar221 commented Aug 4, 2021

@MaciekPytel Does the changes cluster-autoscaler-release-1.21...nshekhar221:cluster-autoscaler-1.21.0-with-fix looks like something that can be a solution for this issue?

Initial testing logs (shared above) suggests that it help with the massive scale out using failure-domain.beta.kubernetes.io/zone topologySpreadConstraints.

Also kindly let us know if there will be any concerns around the same.

Happy to raise a PR if suggested changes looks fine.

@MaciekPytel
Copy link
Contributor

The changes make a lot of sense and I agree they could help with this issue. One comment:

ExpansionOption also has a list of pods that will be helped by scale-up. This fix changes the estimated node number, but it doesn't modify the list of pods. That means that expander (heuristic that selects between available scale-up options) will act as if all those pending pods could be scheduled on a very small number of nodes.
I think the best way to fix would be to keep track of which pods were actually "scheduled" in Estimator and override ExpansionOption.Pods based on it. Since Estimator only has a single implementation now, I don't see any problem with changing the interface so that this information can be returned.

Also, for future reference only: removing node from snapshot is an expensive operation as it drops internal caches. I suspect that with a lot of pending pods using topology spreading one may run into scalability problems with binpacking (which is obviously still a major improvement on current state).

  • This could be optimized by not removing the node if CheckPredicates() fail and just remembering it's empty so we don't add empty node for next pod and not count it towards result if it remains empty at the end.
  • I think it would be premature and needlessly complex to add this optimization now, just something to keep in mind if we run into scalability issue with this later on.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021
@pierluigilenoci
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022
@pierluigilenoci
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 14, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 12, 2022
@pierluigilenoci
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 13, 2022
@MartinEmrich
Copy link

I just tried topologySpreadConstraints again, and it still happens for me.

I then found out that the latest CA version for EKS 1.22 ist indeed 1.22.1 (which I used) from 2021, so the fix cannot be in there.
As far as I can tell, the only release containing the fix is 1.25 so far.

What options are there if one is stuck with Kubernetes 1.22 or 1.23 (e.g. AWS EKS)? The docs clearly state that the versions should match up....

@MartinEmrich
Copy link

Just started evaluating EKS 1.24, and tried CA 1.25 with it.

It seems to work just fine, and the fix for this issue is included, too. So no more scalout explosions with topology spread constraints.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants