-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #252
Comments
My initial guess would be that it may be related to using node-group-auto-discovery feature. @kylegato can you try running without using it (configuring your node groups statically with --nodes)? @mumoshu I suspect the PollingAutoscaler is probably doing a DescribeAutoScalingGroup API call for each node group on each loop (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/polling_autoscaler.go#L81). Does that sound right? How can we limit this? |
@MaciekPytel Confirmed that using static node group definitions seems to work, I'm no longer hitting the limits and my container is not crashing. I'll keep an eye on this issue to determine when/if it's safe to switch back to auto discovery. Thanks! |
@MaciekPytel Thanks for your support. Hmm, it wasn't an expected behavior at least but basically I guess you're correct. Just to start by syncing it up with you, CA with the automatic discovery is expected to work like:
So, I'd say that there's something wrong is happening in 2. AFAIK, CA calls It could be done with an implementation of what we discussed in the original PR of the automatic discovery feature. |
Btw, increasing |
@mumoshu I may be missing something, but it seems to me that the PollingAutoscaler only replaces the existing autoscaler if the config has changed. So we'd only do all the extra calls due to lost cache if the config actually changed (which should be very rare, so we probably don't care too much). I don't think the autoscaler was actually replaced recently in @kylegato's log, as there are nodes that were unneeded for 2+ minutes and the timers would reset if the autoscaler was replaced. So my intuition would be that those calls somehow come from building a new autoscaler instance in L81 (obvious culprit would be recreating cloudprovider, but it may well be something else in there). So the followup action would be to determine where those calls are coming from and if we can do something about them. Perhaps we could try to get a new list of node group ids without building an autoscaler object? Finally, I don't like increasing |
@kylegato Btw, in case you're using kube-aws 0.9.7 with CA updated to 0.6 by hand while enabling the node draining feature, I'd recommend upgrading to 0.9.8-rc.1. Please see kubernetes-retired/kube-aws#782 for more information if that's the case. |
Yes, and that's done by the step 1 described in my prev comment.
Although I'm still not sure where the cause could be, I agree to your suggestion.
Sorry if I'm not following you correctly, but AFAICS, rebuilding an autoscaler object won't trigger Anyway, your suggestion seems to be certainly feasible to me by extracting the auto-discovery logic of ASGs from the AWS cloud provider impl to somewhere else, and then calling it from PollingAutoscaler. |
@kylegato Btw, would you be sure that a considerable portion of 1,509 calls is made by CA? |
@MaciekPytel Regarding the |
I'm running kube-aws GIT (from master) version 65cb8efbf769aa3c5fc9459f4ed37fefa11e354a I'm assuming the calls are CA because I don't really have much else going on in this AWS account, I can look into it in a bit and report back, I'll need to pull my logs and parse them from S3. |
I've also seen quota limits exceeded on API calls on the AWS web console, especially during periods of rapid scaling. This is without auto-discovery enabled, but the scan interval is set to 15 seconds. |
@mumoshu @MaciekPytel I'm also seeing this issue with automatic discovery enabled. Here's what I'm seeing in my CloudTrail logs for DescribeAutoscalingGroups requests in a 24 hour period: During the same period here are the number "Regenerating ASG information for" messages in CA log: I noticed the issue starts after an hour of starting the CA, which makes me think the problem is with this cache refresh code. Here's a snippet from my CA logs right before the hour expires: |
Commit f04113d has definitely helped in this, but there is still some throttling on API calls from AWS. |
I've closed #400 after @bismarck pointed out it's duplicate of this. I'm reasonably confident that leaking cache refresh goroutines by PollingAutoscaler is the root cause here. It explains the gradual buildup of API calls observed by @bismarck very well and reading the code seems to confirm it. @mumoshu Do you want to take a shot at fixing this? |
@MaciekPytel Thanks for reaching me out but I'm afraid I can't work on it shortly. However, please assign me if you're also out of time. I'll try to manage it. |
I don't use aws at all, so I can't fix it. I can review and provide any other help if someone else wants to take a shot at this. |
Looking at this with @bismarck, and noted that the autoscaler (including the aws manager, and auto_scaling_groups with the go routine) is being recreated every poll method. This results in a go routine (that runs forever) being created every 10 seconds, since our scan interval is 10 seconds. The go routine is there to refresh the k8s cache for ASGs every hour. |
The cache itself is an implementation detail of a given cloudprovider and I'd rather not have static_autoscaler know about it. However, I'm thinking about adding some sort of a general purpose refresh method to CloudProvider that would be called once per loop. The implementation of such method could maintain some internal counter and refresh the cache every N loops or after M minutes have elapsed since last refresh or whatever other internal mechanism makes sense in this case. PollingAutoscaler was implemented a while ago when StaticAutoscaler couldn't handle adding or removing NodeGroup. So it's recreating the whole StaticAutoscaler when the config changes (unfortunately it's also creating a temporary one to check if it has changed). However, recently we've updated StaticAutoscaler and it's almost ready to handle NodeGroups being dynamically added or removed on cloudprovider level (this is how a GKE equivalent of PollingAutoscaler will work). I think the best long-term solution is to reimplement PollingAutoscaler functionality using the same approach, but I would wait 2 more weeks or so until we finish all the necessary prerequisites. Since my long-term idea would require a large-ish refactor I'm ok to review and merge any reasonable short-term fix for this specific issue. |
In case you missed it, the fix in ongoing at #422 by @mmerrill3! |
We are running v1.0.2 and are still seeing various autoscaler failures and crashes due to |
@johanneswuerbach Would you mind sharing us: (1) Number of ASGs you have in your cluster At this point, for me, problematic part left in code-base seems to be: only. In case you're already calling AWS API a lot from another systems, perhaps CA should be improved to just retry API call with back-off when rated limited. |
We currently have 5 clusters with 6 ASGs each (3 for HA masters, 3 for nodes in each AZ) so 30 Auto Scaling Groups in total. Each of those clusters is running an autoscaler instance. This is the smaller of our two accounts (we see crashes in both) and it generally doesn't have any scaling activity (its our staging env). Crashes seem to mostly encounter in
Throttling: Rate exceeded
We currently don't feed AWS api calls into our log aggregation system, but I will try to get those numbers. |
I just checked CA 1.0.2 on our cluster (which is in a similar setup as @johanneswuerbach) and we haven't seen a crash yet. Here are the number of "Regenerating ASG information for" messages in CA log for 24 hour period: Looks like what was expected after the fix. Need to check my CloudTrail logs for DescribeAutoscalingGroups requests. |
@johanneswuerbach Thanks for sharing the details! Seems like I'm unable to figure out anything right now. Would you mind sharing me the full error message and what's you are passing to At least in theory, the describe-tags calls should't be that many to trigger a rate limit. So I suspect there's any pattern or combination of tags to trigger that. |
@mumoshu I'm passing
I'm seeing the following two kind of stack traces:
and followed by a crash:
Could it be the |
…openshift-4.14-atomic-openshift-cluster-autoscaler Updating atomic-openshift-cluster-autoscaler images to be consistent with ART
add log filename and line number
Our cluster-autoscaler pods are dying frequently due to the fact that AWS is rate limiting the API calls.
In a 5 minute period, we logged over 1,509 calls in CloudTrail for "DescribeAutoScalingGroups"
Here's the error that causes cluster-autoscaler to crash: https://gist.github.com/kylegato/9e2a183eca549572ce0e0082c4381dab
This also prevents us from loading the ASG page on the console.
It's also important to note that I run two clusters with this.
Node Count (Cluster 1): 5 nodes
Node Count (Cluster 2): 50 Nodes
The text was updated successfully, but these errors were encountered: