perf: Implement timeouts and Max-batch size for Consolidation #472

njtran · 2023-08-15T18:18:11Z

Fixes #370

Description
I ran some tests with KwoK and a cost monitoring tool to scale up clusters to 2000, 1500, 1000, 500, and 100 nodes. I tracked metrics like cumulative cost, cumulative pending pod seconds, and overall test time.

The test scaled down the clusters incrementally by 25% (100% -> 75% -> 50% -> 25%), waiting for Consolidation to fully scale down to the next level and exiting when the number of nodes got to 25% of the peak. With a timeout of 1 minute for Multi Machine Consolidation, 3 minutes for Single Machine Consolidation, and a max batch size of 100 machines for Multi, Karpenter was able to take more actions quicker, speeding up the whole test to 33 minutes.

Includes a new metric counter that tracks how many time a timeout has been reached for both Multi-Machine and Single-Machine Consolidation.
This also adds in higher bucket sizes for Prometheus histogram metrics since Deprovisioning has some histogram items that can reach higher than 60 seconds.

How was this change tested?

Tested with KWOK, and make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Allocatable() is called a lot. Memoize this using a sync.Map so we can avoid the expensive calculation. name old time/op new time/op delta Scheduling1-12 566µs ± 9% 362µs ±11% -36.06% (p=0.004 n=5+6) Scheduling50-12 26.2ms ± 4% 18.2ms ± 7% -30.43% (p=0.004 n=5+6) Scheduling100-12 53.1ms ± 8% 31.5ms ± 4% -40.76% (p=0.008 n=5+5) Scheduling500-12 272ms ± 5% 157ms ± 9% -42.23% (p=0.004 n=5+6) Scheduling1000-12 596ms ±11% 396ms ±19% -33.63% (p=0.004 n=5+6) Scheduling2000-12 1.08s ± 4% 0.82s ±10% -24.47% (p=0.004 n=5+6) Scheduling5000-12 2.83s ± 8% 1.88s ± 6% -33.46% (p=0.008 n=5+5)

This saves a lot of lookups and is measurably better.

pkg/controllers/deprovisioning/multimachineconsolidation.go

pkg/controllers/provisioning/scheduling/metrics.go

coveralls · 2023-08-16T05:21:50Z

Pull Request Test Coverage Report for Build 5874881988

25 of 31 (80.65%) changed or added relevant lines in 5 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.009%) to 79.59%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/deprovisioning/multimachineconsolidation.go	12	14	85.71%
pkg/controllers/deprovisioning/singlemachineconsolidation.go	6	10	60.0%

Totals
Change from base Build 5826932860:	0.009%
Covered Lines:	8115
Relevant Lines:	10196

💛 - Coveralls

tzneal

lgtm

tzneal · 2023-08-16T12:15:21Z

Merging so we can get some test runs.

njtran and others added 16 commits August 15, 2023 09:54

feat: add timeouts on Consolidation

8cb32f7

test timeouts

5aa8d5f

comments

8078635

reorder test

e31d229

perf: cache the default storage class for 1 minute

bb34280

This saves a lot of lookups and is measurably better.

perf: handle the common case of a simple requirement faster

8d6ef7a

perf: reduce allocs when summing resource lists

853ce8b

metrics logging

9080a64

add num candidates

bb7bd3c

adjust timeout

cbcbf87

add scheduling metrics

1f5fa11

change timeouts

f720d11

minutes

1d8c7b3

timeoutsconfirmed

41effc8

readjust

b054475

njtran requested a review from a team as a code owner August 15, 2023 18:18

njtran requested a review from tzneal August 15, 2023 18:18

njtran added 7 commits August 15, 2023 11:19

add in back metric buckets

6f16c7d

better timeouts

898a45b

tests

0bb6ca3

tests

4220a2d

testfixes

b598797

removeaccident

636d39b

fixed

34355d7

engedaam reviewed Aug 16, 2023

View reviewed changes

pkg/controllers/deprovisioning/multimachineconsolidation.go Outdated Show resolved Hide resolved

pkg/controllers/deprovisioning/multimachineconsolidation.go Show resolved Hide resolved

pkg/controllers/provisioning/scheduling/metrics.go Outdated Show resolved Hide resolved

comments

e12f9cd

tzneal approved these changes Aug 16, 2023

View reviewed changes

tzneal merged commit da2c2f5 into kubernetes-sigs:main Aug 16, 2023
6 checks passed

njtran mentioned this pull request Sep 7, 2023

Mega Issue: Improve the Performance of Deprovisioning #370

Closed

5 tasks

njtran mentioned this pull request Nov 2, 2023

REQUEST: New membership for njtran kubernetes/org#4574

Closed

9 tasks

njtran deleted the timeouts branch December 5, 2023 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Implement timeouts and Max-batch size for Consolidation #472

perf: Implement timeouts and Max-batch size for Consolidation #472

njtran commented Aug 15, 2023 •

edited

Loading

coveralls commented Aug 16, 2023 •

edited

Loading

tzneal left a comment

tzneal commented Aug 16, 2023

perf: Implement timeouts and Max-batch size for Consolidation #472

perf: Implement timeouts and Max-batch size for Consolidation #472

Conversation

njtran commented Aug 15, 2023 • edited Loading

coveralls commented Aug 16, 2023 • edited Loading

Pull Request Test Coverage Report for Build 5874881988

💛 - Coveralls

tzneal left a comment

Choose a reason for hiding this comment

tzneal commented Aug 16, 2023

njtran commented Aug 15, 2023 •

edited

Loading

coveralls commented Aug 16, 2023 •

edited

Loading