-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Binpacking can exit without packing all the pods #4970
Binpacking can exit without packing all the pods #4970
Conversation
/hold |
/assign @x13n |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
} | ||
|
||
if !found { | ||
// Stop binpacking if we reach the limit of nodes we can add. | ||
// We return the result of the binpacking that we already performed. | ||
if !e.limiter.PermissionToAddNodes(1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you foresee any estimator changes soon that would require the num nodes argument to be anything other than 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect us to add a new estimator anytime soon. It is technically a plug-able interface (similar to expander), so one could make an argument for trying to be generic. My thought process was that the previous implementation of estimator (that we've deprecated and dropped a while ago) may have used it.
That being said you may be right and having node count as parameter is probably an overkill. Do you want me to drop it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's have a simplest interface that works for the existing estimator. We can always extend it in the future in a backwards compatible way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
break | ||
} | ||
|
||
// If the last node we've added is empty and the pod couldn't schedule on it, it wouldn't be able to schedule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Limiting binpacking size/time and checking if any pod was scheduled to a new node are separate optimizations, could've been separate PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this optimization is necessary for fixing #4129 without affecting performance. Previously we'd never have empty nodes in binpacking, so that wasn't a concern.
Fair point on scalability improvements and fix for #4129 should have been a separate PRs. They're separate commits and if you want I can split them into separate PRs easily. But I'm guessing you mean it more as a feedback for future PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't actually notice they are separate commits, that'd make my reviewing easier... 🤦 Given low number of my comments I don't think splitting this PR into two is worth the effort, so I guess that's feedback for future PRs.
Looks good overall, just need to address test failure. I'm also wondering if we are going to need such generic API. |
Do you mean specific function signatures in EstimationLimiter or the interface itself? I think there are multiple custom optimizations that could be made to limiter, including platform-specific ones (like the quota example in PR description or cutting the binpacking based on max nodepool size). I think passing NodeGroup to limiter is a prerequisite for most of those, so I kinda like that part. Passing node count to PermissionToAddNodes may have been an overkill, I can remove it if you think it complicates things too much. |
I just meant the node count, apologies for a vague comment! |
5dc2c07
to
d5fc1b7
Compare
I did scalability testing with a few thousand node scale-up and confirmed that CA scale-up loops remained short and the large number of pods no longer caused CA to crashloop. |
name: "no limiting happens", | ||
maxNodes: 20, | ||
operations: []limiterRequest{ | ||
permissionRequest{false}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think instead of using the pattern of applying operations here it would be more readable to just have a list of func()
here that are invoked, just so that they have meaningful names in each test case like so:
operations: []operation{
expectDeny,
restartLimiter,
expectAllow,
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, that's much better. Updated the test.
/lgtm One minor nit, otherwise lgtm. Feel free to cancel the hold if you disagree. |
The binpacking algorithm is O(#pending_pods * #new_nodes) and calculating a very large scale-up can get stuck for minutes or even hours, leading to CA failing it's healthcheck and going down. The new limiting prevents this scenario by stopping binpacking after reaching specified threshold. Any pods that remain pending as a result of shorter binpacking will be processed next autoscaler loop. The thresholds used can be controlled with newly introduced flags: --max-nodes-per-scaleup and --max-nodegroup-binpacking-duration. The limiting can be disabled by setting both flags to 0 (not recommended, especially for --max-nodegroup-binpacking-duration).
Previously we've just assumed pod will always fit on a newly added node during binpacking, because we've already checked that a pod fits on an empty template node earlier in scale-up logic. This assumption is incorrect, as it doesn't take into account potential impact of other scheduling we've done in binpacking. For pods using zonal Filters (such as PodTopologySpreading with zonal topology key) the pod may no longer be able to schedule even on an empty node as a result of earlier decisions we've made in binpacking.
d5fc1b7
to
5342f18
Compare
/hold cancel |
/lgtm |
Binpacking can exit without packing all the pods
Binpacking can exit without packing all the pods
Binpacking can exit without packing all the pods
Binpacking can exit without packing all the pods
Binpacking can exit without packing all the pods
Binpacking can exit without packing all the pods
Binpacking can exit without packing all the pods
Hi @MaciekPytel, is there any plan to backport this PR to release-1.20? |
Binpacking can exit without packing all the pods
@MaciekPytel are you planning to backport this to 1.24? I'm happy to help out if you don't have cycles |
Which component this PR applies to?
cluster-autoscaler
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR updates Estimator interface to allow binpacking result that only includes some of the pending pods. This is used to address 2 different problems:
--max-nodes-per-scaleup
and--max-nodegroup-binpacking-duration
.Which issue(s) this PR fixes:
Fixes #4129
Does this PR introduce a user-facing change?
Additional notes for reviewer
Unittest "zonal topology spreading with maxSkew=2 only allows 2 pods to schedule" reproduces #4129. It fails without binpacking_estimator changes in the last commit and passes with those changes.