-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mega Issue: Improve the Performance of Deprovisioning #370
Comments
#472 implemented the first two items linked in the issue and is included in v0.30.0. |
I have a similar issue to #670 but not exactly the same. Specifically our cluster is extremely active and managed by Karpenter and another autoscaler. We will pretty much always have some level of "Pending pods" causing consolidation to pretty much never take place. Wondering if there is something we can do about this? ( can also cut an issue for this) |
https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/validation.go#L86 This validation code seems pretty overkill right? Because GetCandidates performs a First can we perform validation by just re-checking the candidates rather than getting all candidates and making sure the existing candidate matches the new set. Second, why are we abandoning the consolidation attempt if there is a mismatch https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/validation.go#L92? Why can't we just consolidate those that are valid? We are abandoning a lot of work and if Karpenter is already struggling with large set of nodes it would at least chip away at part of the problem. Third, why do we need a deep copy of nodes? https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/state/cluster.go#L169. I feel like in most cases this isn't necessary especially when the node object carries every pod and all their information(?). There should be a shallow version of Nodes that contains information just for scheduling and requirements and disassociate pod from node object. I suspect if you making a separate call into a nodeToPods map, rather than keeping pod info with nodes, when pod information is needed it won't be much slower but this makes node object much lighter. In addition to this, you should strip pods object of everything except for information you need for scheduling. I think node object by itself should be relatively light so it likely isn't worth this effort. Lastly, I can't tell but are we tracking nodes that isn't managed by Karpenter? From what I can tell we are and is there a reason why we do this? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Tell us about your request
Improve the Performance of Deprovisioning Workflows for large and busy (pods coming and going) clusters.
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?
In large and busy clusters, consolidation can take a while to evaluate 10 - 30 minutes. If the cluster is busy where the consolidation decision is not valid after the 15 sec ttl, then little progress can be made towards consolidation. There are two optimization fronts that I could see working on independently:
Are you currently working around this issue?
N/A
Additional Context
No response
Attachments
No response
Community Note
The text was updated successfully, but these errors were encountered: