Binpacking can exit without packing all the pods #4970

MaciekPytel · 2022-06-14T13:55:36Z

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR updates Estimator interface to allow binpacking result that only includes some of the pending pods. This is used to address 2 different problems:

Fix an issue where CA would drastically overestimate (and as a result overshoot scale-up) the number of nodes needed for pods using zonal constraints (PodTopologySpreading or PodAntiAffinity on zonal topology) ([cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129).
Limit maximum size of single binpacking via either nodeCount or max binpacking time. Binpacking is ~O(#pending_pods * #nodes_delta) and very large scale-up can take extremely long time to calculate. Limiting binpacking size/duration can protect from this issue; instead of calculating a single enormous scale-up CA will be able to calculate a sequence of smaller ones over multiple loops.
- This behavior is controlled by newly introduced flags --max-nodes-per-scaleup and --max-nodegroup-binpacking-duration.
- On implementation side the limiting decision is delegated to a new EstimationLimiter. This is intended to allow smarter customization (ex. cut binpacking early if an external quota would limit scale-up anyway).

Which issue(s) this PR fixes:

Fixes #4129

Does this PR introduce a user-facing change?

 *  Fix an issue where CA could drastically overshoot scale-up for pods using zonal scheduling constraints  (PodTopologySpreading or PodAntiAffinity on zonal topology).
 * Limit maximum duration of binpacking simulation to prevent CA becoming unresponsive in huge scale-up scenarios. Introduce --max-nodes-per-scaleup and --max-nodegroup-binpacking-duration that can be used to control this behavior (note: those flags are only meant for fine-tuning scale-up calculation latency; they're not intended for rate-limiting scale-up).

Additional notes for reviewer

Unittest "zonal topology spreading with maxSkew=2 only allows 2 pods to schedule" reproduces #4129. It fails without binpacking_estimator changes in the last commit and passes with those changes.

MaciekPytel · 2022-06-14T13:55:52Z

/hold
For additional manual testing

MaciekPytel · 2022-06-14T13:56:02Z

/assign @x13n

k8s-ci-robot · 2022-06-14T13:56:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

x13n · 2022-06-15T09:43:35Z

cluster-autoscaler/estimator/binpacking_estimator.go

 		}

 		if !found {
+			// Stop binpacking if we reach the limit of nodes we can add.
+			// We return the result of the binpacking that we already performed.
+			if !e.limiter.PermissionToAddNodes(1) {


Do you foresee any estimator changes soon that would require the num nodes argument to be anything other than 1?

I don't expect us to add a new estimator anytime soon. It is technically a plug-able interface (similar to expander), so one could make an argument for trying to be generic. My thought process was that the previous implementation of estimator (that we've deprecated and dropped a while ago) may have used it.

That being said you may be right and having node count as parameter is probably an overkill. Do you want me to drop it?

Yeah, let's have a simplest interface that works for the existing estimator. We can always extend it in the future in a backwards compatible way.

x13n · 2022-06-15T09:45:54Z

cluster-autoscaler/estimator/binpacking_estimator.go

+				break
+			}
+
+			// If the last node we've added is empty and the pod couldn't schedule on it, it wouldn't be able to schedule


nit: Limiting binpacking size/time and checking if any pod was scheduled to a new node are separate optimizations, could've been separate PRs.

I think this optimization is necessary for fixing #4129 without affecting performance. Previously we'd never have empty nodes in binpacking, so that wasn't a concern.

Fair point on scalability improvements and fix for #4129 should have been a separate PRs. They're separate commits and if you want I can split them into separate PRs easily. But I'm guessing you mean it more as a feedback for future PRs?

Oh, I didn't actually notice they are separate commits, that'd make my reviewing easier... 🤦 Given low number of my comments I don't think splitting this PR into two is worth the effort, so I guess that's feedback for future PRs.

x13n · 2022-06-15T11:39:39Z

Looks good overall, just need to address test failure. I'm also wondering if we are going to need such generic API.

MaciekPytel · 2022-06-15T12:10:42Z

I'm also wondering if we are going to need such generic API.

Do you mean specific function signatures in EstimationLimiter or the interface itself? I think there are multiple custom optimizations that could be made to limiter, including platform-specific ones (like the quota example in PR description or cutting the binpacking based on max nodepool size). I think passing NodeGroup to limiter is a prerequisite for most of those, so I kinda like that part.

Passing node count to PermissionToAddNodes may have been an overkill, I can remove it if you think it complicates things too much.

x13n · 2022-06-15T12:19:48Z

I just meant the node count, apologies for a vague comment!

MaciekPytel · 2022-06-17T08:49:51Z

I did scalability testing with a few thousand node scale-up and confirmed that CA scale-up loops remained short and the large number of pods no longer caused CA to crashloop.
I did see some very long scale-down loops, but those happened before this change as well and they're a separate issue.
Removing the hold for testing.
/hold cancel

x13n · 2022-06-20T07:11:59Z

cluster-autoscaler/estimator/threshold_based_limiter_test.go

+			name:     "no limiting happens",
+			maxNodes: 20,
+			operations: []limiterRequest{
+				permissionRequest{false},


nit: I think instead of using the pattern of applying operations here it would be more readable to just have a list of func() here that are invoked, just so that they have meaningful names in each test case like so:

operations: []operation{ expectDeny, restartLimiter, expectAllow, }

You're right, that's much better. Updated the test.

x13n · 2022-06-20T07:12:56Z

/lgtm
/hold

One minor nit, otherwise lgtm. Feel free to cancel the hold if you disagree.

The binpacking algorithm is O(#pending_pods * #new_nodes) and calculating a very large scale-up can get stuck for minutes or even hours, leading to CA failing it's healthcheck and going down. The new limiting prevents this scenario by stopping binpacking after reaching specified threshold. Any pods that remain pending as a result of shorter binpacking will be processed next autoscaler loop. The thresholds used can be controlled with newly introduced flags: --max-nodes-per-scaleup and --max-nodegroup-binpacking-duration. The limiting can be disabled by setting both flags to 0 (not recommended, especially for --max-nodegroup-binpacking-duration).

Previously we've just assumed pod will always fit on a newly added node during binpacking, because we've already checked that a pod fits on an empty template node earlier in scale-up logic. This assumption is incorrect, as it doesn't take into account potential impact of other scheduling we've done in binpacking. For pods using zonal Filters (such as PodTopologySpreading with zonal topology key) the pod may no longer be able to schedule even on an empty node as a result of earlier decisions we've made in binpacking.

MaciekPytel · 2022-06-20T15:44:14Z

/hold cancel

towca · 2022-06-20T15:46:09Z

/lgtm

Binpacking can exit without packing all the pods

qianlei90 · 2022-12-20T03:04:42Z

Hi @MaciekPytel, is there any plan to backport this PR to release-1.20？

Binpacking can exit without packing all the pods

fawadkhaliq · 2023-02-24T06:56:18Z

@MaciekPytel are you planning to backport this to 1.24? I'm happy to help out if you don't have cycles

Cherry-pick of #4970

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 14, 2022

k8s-ci-robot assigned x13n Jun 14, 2022

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer June 14, 2022 13:56

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2022

x13n reviewed Jun 15, 2022

View reviewed changes

MaciekPytel force-pushed the estimation_limiter branch from 5dc2c07 to d5fc1b7 Compare June 17, 2022 08:45

MaciekPytel changed the title ~~WIP: Binpacking can exit without packing all the pods~~ Binpacking can exit without packing all the pods Jun 17, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2022

jbartosik added the area/cluster-autoscaler label Jun 17, 2022

x13n reviewed Jun 20, 2022

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 20, 2022

MaciekPytel added 3 commits June 20, 2022 17:02

Add EstimationLimiter interface, update Estimator

f599494

MaciekPytel force-pushed the estimation_limiter branch from d5fc1b7 to 5342f18 Compare June 20, 2022 15:03

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2022

k8s-ci-robot assigned towca Jun 20, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2022

k8s-ci-robot merged commit 34dfd9a into kubernetes:master Jun 20, 2022

evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Aug 17, 2022

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

3779cfe

Binpacking can exit without packing all the pods

evansheng mentioned this pull request Aug 17, 2022

cluster autoscaler patch cluster autoscaler 1.20.3 airbnb1 airbnb/autoscaler#23

Merged

fullykubed mentioned this pull request Sep 12, 2022

[cluster-autoscaler] Extreme overprovisioning when using pod interpendency constraints #4099

Closed

navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

5e3c369

Binpacking can exit without packing all the pods

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

fa5ecbb

Binpacking can exit without packing all the pods

akirillov mentioned this pull request Oct 27, 2022

cluster autoscaler patch cluster autoscaler 1.21.3 airbnb0 airbnb/autoscaler#29

Merged

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

24b715c

Binpacking can exit without packing all the pods

akirillov mentioned this pull request Nov 2, 2022

cluster autoscaler patch cluster autoscaler 1.23.1 airbnb0 airbnb/autoscaler#30

Merged

akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

3d22655

Binpacking can exit without packing all the pods

akirillov mentioned this pull request Nov 2, 2022

cluster autoscaler patch cluster autoscaler 1.24.0 airbnb0 airbnb/autoscaler#31

Merged

bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Nov 7, 2022

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

a13d4e5

Binpacking can exit without packing all the pods

bcostabatista mentioned this pull request Nov 7, 2022

cluster autoscaler 1.22.14 release airbnb/autoscaler#32

Merged

MaciekPytel mentioned this pull request Dec 14, 2022

Add x13n to cluster autoscaler approvers #5367

Merged

6 tasks

lrouquette pushed a commit to lrouquette/autoscaler that referenced this pull request Dec 15, 2022

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

2bfc51b

Binpacking can exit without packing all the pods

towca mentioned this pull request Dec 16, 2022

Cluster Autoscaler: base binpacking time limit on the number of node groups #5371

Closed

lrouquette pushed a commit to lrouquette/autoscaler that referenced this pull request Jan 26, 2023

Merge pull request kubernetes#4970 from MaciekPytel/estimation_limiter

028da0f

Binpacking can exit without packing all the pods

MaciekPytel mentioned this pull request Mar 28, 2023

Cherry-pick of #4970 #5630

Merged

k8s-ci-robot added a commit that referenced this pull request Mar 28, 2023

Merge pull request #5630 from MaciekPytel/ca_1_24_estimator

677b6ff

Cherry-pick of #4970

Schnitzel mentioned this pull request Apr 10, 2023

Update Helm release cluster-autoscaler to v9.28.0 projectsyn/component-cluster-autoscaler#42

Merged

1 task

qianlei90 mentioned this pull request Jun 15, 2023

CA scale-up delays on clusters with heavy scaling activity #5769

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binpacking can exit without packing all the pods #4970

Binpacking can exit without packing all the pods #4970

MaciekPytel commented Jun 14, 2022 •

edited

Loading

MaciekPytel commented Jun 14, 2022

MaciekPytel commented Jun 14, 2022

k8s-ci-robot commented Jun 14, 2022

x13n Jun 15, 2022

MaciekPytel Jun 15, 2022

x13n Jun 15, 2022

MaciekPytel Jun 17, 2022

x13n Jun 15, 2022

MaciekPytel Jun 15, 2022

x13n Jun 15, 2022

x13n commented Jun 15, 2022

MaciekPytel commented Jun 15, 2022

x13n commented Jun 15, 2022

MaciekPytel commented Jun 17, 2022

x13n Jun 20, 2022

MaciekPytel Jun 20, 2022

x13n commented Jun 20, 2022

MaciekPytel commented Jun 20, 2022

towca commented Jun 20, 2022

qianlei90 commented Dec 20, 2022

fawadkhaliq commented Feb 24, 2023

Binpacking can exit without packing all the pods #4970

Binpacking can exit without packing all the pods #4970

Conversation

MaciekPytel commented Jun 14, 2022 • edited Loading

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Does this PR introduce a user-facing change?

Additional notes for reviewer

MaciekPytel commented Jun 14, 2022

MaciekPytel commented Jun 14, 2022

k8s-ci-robot commented Jun 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n commented Jun 15, 2022

MaciekPytel commented Jun 15, 2022

x13n commented Jun 15, 2022

MaciekPytel commented Jun 17, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n commented Jun 20, 2022

MaciekPytel commented Jun 20, 2022

towca commented Jun 20, 2022

qianlei90 commented Dec 20, 2022

fawadkhaliq commented Feb 24, 2023

MaciekPytel commented Jun 14, 2022 •

edited

Loading