-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Autoscaler resets unneeded since time to 0s #5618
Comments
/assign @vadasambar |
A snippet of logs around the problematic behavior might be helpful. Here's my understanding (feel free to correct me)
autoscaler/cluster-autoscaler/core/scale_down.go Lines 593 to 604 in cb24873
|
Thank you @vadasambar, I have too many tabs open and that one wasn't from 1.21 sorry for the confusion. Thanks for finding the right place. Here's all the logs having to do with one particular node that this happened on. I'm filtering them because we have hundreds of nodes and there are a LOT of logs, but the filter is just a simple text search of this node name
It looks like something is getting reset in between the 6:00:14 scrape and the 6:00:24 scrape. The only thing that looks like a reset is this line
During the 6:00:24 there were quite a few other nodes marked as unneeded for 0s, some of which were new, but some of which display the same behavior as this node. |
It seems like something is removing the node from autoscaler/cluster-autoscaler/core/scale_down.go Lines 593 to 604 in cb24873
|
I think CA might be removing some nodes from the unneeded node map. Because of this, during the next iteration of CA(cluster-autoscaler), the code goes into autoscaler/cluster-autoscaler/core/scale_down.go Lines 975 to 987 in fcd0433
To look a little closer at autoscaler/cluster-autoscaler/simulator/tracker.go Lines 134 to 161 in e028312
Here's my understanding of how
My earlier understanding was usage tracker would remove node B from I believe node C is the node we care about here. It is getting removed from I am not sure why node B is not removed from |
/label cluster-autoscaler |
@vadasambar: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi, has there been any progress on this? At large scale deployments this is actually taking us hours to fully scale down, and subsequent cost for unutilised machines |
This is also happening in 1.25 version |
/label cluster-autoscaler |
@kevinkrp93: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign Bryce-Soghigian |
Not sure this is entirely related, but I found the same bug show up in my testing for this fix, but it goes away with this fix |
I think we can close this issue as it was addressed in my pr, and i have the cherry picks posted #5962 to fix this for all supported k8s versions |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
1.21.3
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
aws, EKS
What did you expect to happen?:
unneeded since would continue to increase until the node either is removed or becomes needed
What happened instead?:
unneeded since dropped to 0 for many nodes at once (but not all) even though they were never determined to be needed per the logs, causing the
--scale-down-unneeded-time
timer to be reset.How to reproduce it (as minimally and precisely as possible):
I'm unsure what causes this, but I know we have a fairly high churn rate on our cluster, around 300 nodes, mostly default settings with CA, and I did see a
Watch
on replicasets close in the loop that this happened in.. if that matters.Anything else we need to know?:
I'm happy to answer more questions, but I'm unsure what else to put here. The logs are far too verbose to copy in entirety, but I'll say, this is the piece of code I'm at that I think might possibly be lying.
autoscaler/cluster-autoscaler/core/scaledown/unneeded/nodes.go
Line 77 in 2f1c895
The text was updated successfully, but these errors were encountered: