Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binpacking can exit without packing all the pods #4970

Merged
merged 3 commits into from
Jun 20, 2022

Conversation

MaciekPytel
Copy link
Contributor

@MaciekPytel MaciekPytel commented Jun 14, 2022

Which component this PR applies to?

cluster-autoscaler

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR updates Estimator interface to allow binpacking result that only includes some of the pending pods. This is used to address 2 different problems:

  • Fix an issue where CA would drastically overestimate (and as a result overshoot scale-up) the number of nodes needed for pods using zonal constraints (PodTopologySpreading or PodAntiAffinity on zonal topology) ([cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints #4129).
  • Limit maximum size of single binpacking via either nodeCount or max binpacking time. Binpacking is ~O(#pending_pods * #nodes_delta) and very large scale-up can take extremely long time to calculate. Limiting binpacking size/duration can protect from this issue; instead of calculating a single enormous scale-up CA will be able to calculate a sequence of smaller ones over multiple loops.
    • This behavior is controlled by newly introduced flags --max-nodes-per-scaleup and --max-nodegroup-binpacking-duration.
    • On implementation side the limiting decision is delegated to a new EstimationLimiter. This is intended to allow smarter customization (ex. cut binpacking early if an external quota would limit scale-up anyway).

Which issue(s) this PR fixes:

Fixes #4129

Does this PR introduce a user-facing change?

 *  Fix an issue where CA could drastically overshoot scale-up for pods using zonal scheduling constraints  (PodTopologySpreading or PodAntiAffinity on zonal topology).
 * Limit maximum duration of binpacking simulation to prevent CA becoming unresponsive in huge scale-up scenarios. Introduce --max-nodes-per-scaleup and --max-nodegroup-binpacking-duration that can be used to control this behavior (note: those flags are only meant for fine-tuning scale-up calculation latency; they're not intended for rate-limiting scale-up).

Additional notes for reviewer

Unittest "zonal topology spreading with maxSkew=2 only allows 2 pods to schedule" reproduces #4129. It fails without binpacking_estimator changes in the last commit and passes with those changes.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2022
@MaciekPytel
Copy link
Contributor Author

/hold
For additional manual testing

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 14, 2022
@MaciekPytel
Copy link
Contributor Author

/assign @x13n

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2022
}

if !found {
// Stop binpacking if we reach the limit of nodes we can add.
// We return the result of the binpacking that we already performed.
if !e.limiter.PermissionToAddNodes(1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you foresee any estimator changes soon that would require the num nodes argument to be anything other than 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect us to add a new estimator anytime soon. It is technically a plug-able interface (similar to expander), so one could make an argument for trying to be generic. My thought process was that the previous implementation of estimator (that we've deprecated and dropped a while ago) may have used it.

That being said you may be right and having node count as parameter is probably an overkill. Do you want me to drop it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's have a simplest interface that works for the existing estimator. We can always extend it in the future in a backwards compatible way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

break
}

// If the last node we've added is empty and the pod couldn't schedule on it, it wouldn't be able to schedule
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Limiting binpacking size/time and checking if any pod was scheduled to a new node are separate optimizations, could've been separate PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this optimization is necessary for fixing #4129 without affecting performance. Previously we'd never have empty nodes in binpacking, so that wasn't a concern.

Fair point on scalability improvements and fix for #4129 should have been a separate PRs. They're separate commits and if you want I can split them into separate PRs easily. But I'm guessing you mean it more as a feedback for future PRs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't actually notice they are separate commits, that'd make my reviewing easier... 🤦 Given low number of my comments I don't think splitting this PR into two is worth the effort, so I guess that's feedback for future PRs.

@x13n
Copy link
Member

x13n commented Jun 15, 2022

Looks good overall, just need to address test failure. I'm also wondering if we are going to need such generic API.

@MaciekPytel
Copy link
Contributor Author

I'm also wondering if we are going to need such generic API.

Do you mean specific function signatures in EstimationLimiter or the interface itself? I think there are multiple custom optimizations that could be made to limiter, including platform-specific ones (like the quota example in PR description or cutting the binpacking based on max nodepool size). I think passing NodeGroup to limiter is a prerequisite for most of those, so I kinda like that part.

Passing node count to PermissionToAddNodes may have been an overkill, I can remove it if you think it complicates things too much.

@x13n
Copy link
Member

x13n commented Jun 15, 2022

I just meant the node count, apologies for a vague comment!

@MaciekPytel MaciekPytel changed the title WIP: Binpacking can exit without packing all the pods Binpacking can exit without packing all the pods Jun 17, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 17, 2022
@MaciekPytel
Copy link
Contributor Author

I did scalability testing with a few thousand node scale-up and confirmed that CA scale-up loops remained short and the large number of pods no longer caused CA to crashloop.
I did see some very long scale-down loops, but those happened before this change as well and they're a separate issue.
Removing the hold for testing.
/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 17, 2022
name: "no limiting happens",
maxNodes: 20,
operations: []limiterRequest{
permissionRequest{false},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think instead of using the pattern of applying operations here it would be more readable to just have a list of func() here that are invoked, just so that they have meaningful names in each test case like so:

operations: []operation{
  expectDeny,
  restartLimiter,
  expectAllow,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, that's much better. Updated the test.

@x13n
Copy link
Member

x13n commented Jun 20, 2022

/lgtm
/hold

One minor nit, otherwise lgtm. Feel free to cancel the hold if you disagree.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Jun 20, 2022
The binpacking algorithm is O(#pending_pods * #new_nodes) and
calculating a very large scale-up can get stuck for minutes or even
hours, leading to CA failing it's healthcheck and going down.
The new limiting prevents this scenario by stopping binpacking after
reaching specified threshold. Any pods that remain pending as a result
of shorter binpacking will be processed next autoscaler loop.

The thresholds used can be controlled with newly introduced flags:
--max-nodes-per-scaleup and --max-nodegroup-binpacking-duration. The
limiting can be disabled by setting both flags to 0 (not recommended,
especially for --max-nodegroup-binpacking-duration).
Previously we've just assumed pod will always fit on a newly added node
during binpacking, because we've already checked that a pod fits on an
empty template node earlier in scale-up logic.
This assumption is incorrect, as it doesn't take into account potential
impact of other scheduling we've done in binpacking. For pods using
zonal Filters (such as PodTopologySpreading with zonal topology key) the
pod may no longer be able to schedule even on an empty node as a result
of earlier decisions we've made in binpacking.
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2022
@MaciekPytel
Copy link
Contributor Author

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 20, 2022
@towca
Copy link
Collaborator

towca commented Jun 20, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 20, 2022
@k8s-ci-robot k8s-ci-robot merged commit 34dfd9a into kubernetes:master Jun 20, 2022
evansheng pushed a commit to airbnb/autoscaler that referenced this pull request Aug 17, 2022
Binpacking can exit without packing all the pods
navinjoy pushed a commit to navinjoy/autoscaler that referenced this pull request Oct 26, 2022
Binpacking can exit without packing all the pods
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Oct 27, 2022
Binpacking can exit without packing all the pods
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022
Binpacking can exit without packing all the pods
akirillov pushed a commit to airbnb/autoscaler that referenced this pull request Nov 2, 2022
Binpacking can exit without packing all the pods
bcostabatista pushed a commit to airbnb/autoscaler that referenced this pull request Nov 7, 2022
Binpacking can exit without packing all the pods
lrouquette pushed a commit to lrouquette/autoscaler that referenced this pull request Dec 15, 2022
Binpacking can exit without packing all the pods
@qianlei90
Copy link
Contributor

Hi @MaciekPytel, is there any plan to backport this PR to release-1.20?

lrouquette pushed a commit to lrouquette/autoscaler that referenced this pull request Jan 26, 2023
Binpacking can exit without packing all the pods
@fawadkhaliq
Copy link

@MaciekPytel are you planning to backport this to 1.24? I'm happy to help out if you don't have cycles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[cluster-autoscaler][AWS] Massive scale-out when using composed topologySpreadConstraints
7 participants