Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: Implement timeouts and Max-batch size for Consolidation #472

Merged
merged 24 commits into from
Aug 16, 2023

Conversation

njtran
Copy link
Contributor

@njtran njtran commented Aug 15, 2023

Fixes #370

Description
I ran some tests with KwoK and a cost monitoring tool to scale up clusters to 2000, 1500, 1000, 500, and 100 nodes. I tracked metrics like cumulative cost, cumulative pending pod seconds, and overall test time.

The test scaled down the clusters incrementally by 25% (100% -> 75% -> 50% -> 25%), waiting for Consolidation to fully scale down to the next level and exiting when the number of nodes got to 25% of the peak. With a timeout of 1 minute for Multi Machine Consolidation, 3 minutes for Single Machine Consolidation, and a max batch size of 100 machines for Multi, Karpenter was able to take more actions quicker, speeding up the whole test to 33 minutes.

  • Includes a new metric counter that tracks how many time a timeout has been reached for both Multi-Machine and Single-Machine Consolidation.
  • This also adds in higher bucket sizes for Prometheus histogram metrics since Deprovisioning has some histogram items that can reach higher than 60 seconds.

How was this change tested?

  • Tested with KWOK, and make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

njtran and others added 16 commits August 15, 2023 09:54
Allocatable() is called a lot. Memoize this using a sync.Map
so we can avoid the expensive calculation.

name               old time/op   new time/op   delta
Scheduling1-12       566µs ± 9%    362µs ±11%  -36.06%  (p=0.004 n=5+6)
Scheduling50-12     26.2ms ± 4%   18.2ms ± 7%  -30.43%  (p=0.004 n=5+6)
Scheduling100-12    53.1ms ± 8%   31.5ms ± 4%  -40.76%  (p=0.008 n=5+5)
Scheduling500-12     272ms ± 5%    157ms ± 9%  -42.23%  (p=0.004 n=5+6)
Scheduling1000-12    596ms ±11%    396ms ±19%  -33.63%  (p=0.004 n=5+6)
Scheduling2000-12    1.08s ± 4%    0.82s ±10%  -24.47%  (p=0.004 n=5+6)
Scheduling5000-12    2.83s ± 8%    1.88s ± 6%  -33.46%  (p=0.008 n=5+5)
This saves a lot of lookups and is measurably better.
@njtran njtran requested a review from a team as a code owner August 15, 2023 18:18
@njtran njtran requested a review from tzneal August 15, 2023 18:18
@coveralls
Copy link

coveralls commented Aug 16, 2023

Pull Request Test Coverage Report for Build 5874881988

  • 25 of 31 (80.65%) changed or added relevant lines in 5 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.009%) to 79.59%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/deprovisioning/multimachineconsolidation.go 12 14 85.71%
pkg/controllers/deprovisioning/singlemachineconsolidation.go 6 10 60.0%
Totals Coverage Status
Change from base Build 5826932860: 0.009%
Covered Lines: 8115
Relevant Lines: 10196

💛 - Coveralls

Copy link
Contributor

@tzneal tzneal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@tzneal
Copy link
Contributor

tzneal commented Aug 16, 2023

Merging so we can get some test runs.

@tzneal tzneal merged commit da2c2f5 into kubernetes-sigs:main Aug 16, 2023
6 checks passed
@njtran njtran deleted the timeouts branch December 5, 2023 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mega Issue: Improve the Performance of Deprovisioning
4 participants