Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mega Issue: Improve the Performance of Deprovisioning #370

Closed
5 tasks done
bwagner5 opened this issue Jun 15, 2023 · 7 comments · Fixed by #472
Closed
5 tasks done

Mega Issue: Improve the Performance of Deprovisioning #370

bwagner5 opened this issue Jun 15, 2023 · 7 comments · Fixed by #472
Assignees
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. performance Issues relating to performance (memory usage, cpu usage, timing) v1 Issues requiring resolution by the v1 milestone

Comments

@bwagner5
Copy link
Contributor

bwagner5 commented Jun 15, 2023

Tell us about your request

Improve the Performance of Deprovisioning Workflows for large and busy (pods coming and going) clusters.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

In large and busy clusters, consolidation can take a while to evaluate 10 - 30 minutes. If the cluster is busy where the consolidation decision is not valid after the 15 sec ttl, then little progress can be made towards consolidation. There are two optimization fronts that I could see working on independently:

  • Optimize the number of nodes that multi-node consolidation should take action on. This could be a dynamic calculation starting with all nodes as eligible candidates and then reducing that number on each validation failure OR it could be statically capped at 5 or 10 node max. I would lean toward making this dynamic with an algorithm that halves the number of nodes on a validation failure and then builds back up at some rate.
  • Reduce the time spent computing consolidation actions should also be evaluated. It may be feasible to parallelize computation. This is mainly to accommodate large clusters so scale testing could be performed using 4 - 8 cpus for the Karpenter controller.
  • Factor out the waits for replacement nodes and terminations
  • Factor out the TTL validation wait on Consolidation so that other deprovisioning can occur during the 15 second sleep
  • Configurable number of Deprovisioning actions: Allow Karpenter to Deprovision Expired/Drifted/Consolidatable nodes in parallel #255

Are you currently working around this issue?

N/A

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@bwagner5 bwagner5 added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 15, 2023
@njtran njtran added deprovisioning Issues related to node deprovisioning performance Issues relating to performance (memory usage, cpu usage, timing) labels Jun 15, 2023
@billrayburn billrayburn added the v1 Issues requiring resolution by the v1 milestone label Jun 15, 2023
@ellistarn ellistarn changed the title Improve the Performance of Deprovisioning Mega Issue: Improve the Performance of Deprovisioning Jul 31, 2023
@njtran
Copy link
Contributor

njtran commented Sep 7, 2023

#472 implemented the first two items linked in the issue and is included in v0.30.0.

@garvinp-stripe
Copy link
Contributor

garvinp-stripe commented Dec 18, 2023

I have a similar issue to #670 but not exactly the same. Specifically our cluster is extremely active and managed by Karpenter and another autoscaler. We will pretty much always have some level of "Pending pods" causing consolidation to pretty much never take place.
https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/consolidation.go#L130.

Wondering if there is something we can do about this? ( can also cut an issue for this)

@garvinp-stripe
Copy link
Contributor

garvinp-stripe commented Feb 24, 2024

https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/validation.go#L86 This validation code seems pretty overkill right? Because GetCandidates performs a cluster.Nodes() which is a deep copy of nodes.

First can we perform validation by just re-checking the candidates rather than getting all candidates and making sure the existing candidate matches the new set.

Second, why are we abandoning the consolidation attempt if there is a mismatch https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/validation.go#L92? Why can't we just consolidate those that are valid? We are abandoning a lot of work and if Karpenter is already struggling with large set of nodes it would at least chip away at part of the problem.

Third, why do we need a deep copy of nodes? https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/state/cluster.go#L169. I feel like in most cases this isn't necessary especially when the node object carries every pod and all their information(?). There should be a shallow version of Nodes that contains information just for scheduling and requirements and disassociate pod from node object. I suspect if you making a separate call into a nodeToPods map, rather than keeping pod info with nodes, when pod information is needed it won't be much slower but this makes node object much lighter. In addition to this, you should strip pods object of everything except for information you need for scheduling. I think node object by itself should be relatively light so it likely isn't worth this effort.

Lastly, I can't tell but are we tracking nodes that isn't managed by Karpenter? From what I can tell we are and is there a reason why we do this?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 26, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 25, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. performance Issues relating to performance (memory usage, cpu usage, timing) v1 Issues requiring resolution by the v1 milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants