Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS DescribeAutoScalingGroups requests too aggressive - API limits reached #252

Closed
kylegato opened this issue Aug 23, 2017 · 26 comments
Closed
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@kylegato
Copy link

kylegato commented Aug 23, 2017

Our cluster-autoscaler pods are dying frequently due to the fact that AWS is rate limiting the API calls.

In a 5 minute period, we logged over 1,509 calls in CloudTrail for "DescribeAutoScalingGroups"

Here's the error that causes cluster-autoscaler to crash: https://gist.github.com/kylegato/9e2a183eca549572ce0e0082c4381dab

This also prevents us from loading the ASG page on the console.

      - command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/s2-us-west-2
        env:
        - name: AWS_REGION
          value: us-west-2
        image: gcr.io/google_containers/cluster-autoscaler:v0.6.0

It's also important to note that I run two clusters with this.

Node Count (Cluster 1): 5 nodes
Node Count (Cluster 2): 50 Nodes

@MaciekPytel
Copy link
Contributor

My initial guess would be that it may be related to using node-group-auto-discovery feature. @kylegato can you try running without using it (configuring your node groups statically with --nodes)?

@mumoshu I suspect the PollingAutoscaler is probably doing a DescribeAutoScalingGroup API call for each node group on each loop (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/polling_autoscaler.go#L81). Does that sound right? How can we limit this?

@kylegato
Copy link
Author

@MaciekPytel Confirmed that using static node group definitions seems to work, I'm no longer hitting the limits and my container is not crashing.

I'll keep an eye on this issue to determine when/if it's safe to switch back to auto discovery.

Thanks!

@mumoshu
Copy link
Contributor

mumoshu commented Aug 24, 2017

@MaciekPytel Thanks for your support.

Hmm, it wasn't an expected behavior at least but basically I guess you're correct.
Probably we'd need to retain a part of the whole CA state - cache of ASGs - across CA loops.


Just to start by syncing it up with you, CA with the automatic discovery is expected to work like:

  1. Detect target ASGs for each scan interval.

    • This is done by calling DescribeAutoScalingGroup once per maxRecordsReturnedByAPI=100 ASGs. In other words, the API calls are batched. Therefore, if @kylegato had ASGs less than 100 in total, the num of additional API calls made by the automatic discovery feature in the 5 min should had been 300 / 10 * 2 = 60. It is far lower than 1,509.
  2. ..And the rest of the process is the same as that of CA without automatic discovery.

So, I'd say that there's something wrong is happening in 2.

AFAIK, CA calls DescribeAutoScalingGroup multiple times per ASG per loop, unless they are not cached. On the other hand, PollingAutoscaler recreates the whole CA state, throwing away the ASGs cache. Indeed, this would results in huge number of API calls.

It could be done with an implementation of what we discussed in the original PR of the automatic discovery feature.
Basically, we somehow retain the cache of ASGs - awsCloudProvier.awsManager.autoScalingGroups - in each iteration of PollingAutoscaler.
We would also need to evict old ASGs from the cache. Otherwise CA would end up stopping scaling in/out against the ASG with an outdated cache because it contains the min size and the max size of the ASG.

@mumoshu
Copy link
Contributor

mumoshu commented Aug 24, 2017

Btw, increasing --scan-interval will also alleviate the problem.

@MaciekPytel
Copy link
Contributor

@mumoshu I may be missing something, but it seems to me that the PollingAutoscaler only replaces the existing autoscaler if the config has changed. So we'd only do all the extra calls due to lost cache if the config actually changed (which should be very rare, so we probably don't care too much). I don't think the autoscaler was actually replaced recently in @kylegato's log, as there are nodes that were unneeded for 2+ minutes and the timers would reset if the autoscaler was replaced.

So my intuition would be that those calls somehow come from building a new autoscaler instance in L81 (obvious culprit would be recreating cloudprovider, but it may well be something else in there). So the followup action would be to determine where those calls are coming from and if we can do something about them. Perhaps we could try to get a new list of node group ids without building an autoscaler object?

Finally, I don't like increasing --scan-interval as a solution to this problem as it hurts autoscaler's reaction time badly. Maybe we can just limit the rate of polling by doing it once every few loops as opposed to every loop? Not sure about that, WDYT?

@mumoshu
Copy link
Contributor

mumoshu commented Aug 25, 2017

@kylegato Btw, in case you're using kube-aws 0.9.7 with CA updated to 0.6 by hand while enabling the node draining feature, I'd recommend upgrading to 0.9.8-rc.1.
There is an improvement in the node draining feature which reduces the number of DescribeAutoScalingGroups calls.

Please see kubernetes-retired/kube-aws#782 for more information if that's the case.

@mumoshu
Copy link
Contributor

mumoshu commented Aug 25, 2017

@MaciekPytel

I may be missing something, but it seems to me that the PollingAutoscaler only replaces the existing autoscaler if the config has changed

Yes, and that's done by the step 1 described in my prev comment.
The step 1 happens before a replace actually happens, to detect changes in ASGs.
So, I believe it is still 1 DescribeAutoScalingGroups call per interval in a normal case.

So the followup action would be to determine where those calls are coming from and if we can do something about them.

Although I'm still not sure where the cause could be, I agree to your suggestion.
How about logging every DescribeAutoScalingGroups call by perhaps glog.V(5).Infof - a log level larger by one than the one used for logging AWS API errors?

Perhaps we could try to get a new list of node group ids without building an autoscaler object?

Sorry if I'm not following you correctly, but AFAICS, rebuilding an autoscaler object won't trigger DescribeAutoScalingGroups calls more than described in the step 1 above.

Anyway, your suggestion seems to be certainly feasible to me by extracting the auto-discovery logic of ASGs from the AWS cloud provider impl to somewhere else, and then calling it from PollingAutoscaler.

@mumoshu
Copy link
Contributor

mumoshu commented Aug 25, 2017

@kylegato Btw, would you be sure that a considerable portion of 1,509 calls is made by CA?
I'm still wondering if most of these calls are made by the node draining feature of kube-aws by any chance as in my prev comment.

@mumoshu
Copy link
Contributor

mumoshu commented Aug 25, 2017

@MaciekPytel Regarding the --scan-interval, I agree with you. Increasing it should be done if and only if it is a last resort.

@kylegato
Copy link
Author

I'm running kube-aws GIT (from master) version 65cb8efbf769aa3c5fc9459f4ed37fefa11e354a

I'm assuming the calls are CA because I don't really have much else going on in this AWS account, I can look into it in a bit and report back, I'll need to pull my logs and parse them from S3.

@mtcode
Copy link
Contributor

mtcode commented Aug 25, 2017

I've also seen quota limits exceeded on API calls on the AWS web console, especially during periods of rapid scaling. This is without auto-discovery enabled, but the scan interval is set to 15 seconds.

@mwielgus mwielgus added area/provider/aws Issues or PRs related to aws provider area/cluster-autoscaler labels Aug 30, 2017
@bismarck
Copy link

bismarck commented Sep 18, 2017

@mumoshu @MaciekPytel I'm also seeing this issue with automatic discovery enabled.

Here's what I'm seeing in my CloudTrail logs for DescribeAutoscalingGroups requests in a 24 hour period:

describeautoscalinggroups

During the same period here are the number "Regenerating ASG information for" messages in CA log:

regeneratingasg

I noticed the issue starts after an hour of starting the CA, which makes me think the problem is with this cache refresh code.

Here's a snippet from my CA logs right before the hour expires:
CA-Issue.txt

@mtcode
Copy link
Contributor

mtcode commented Sep 18, 2017

Commit f04113d has definitely helped in this, but there is still some throttling on API calls from AWS.

@MaciekPytel MaciekPytel added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 18, 2017
@MaciekPytel
Copy link
Contributor

I've closed #400 after @bismarck pointed out it's duplicate of this. I'm reasonably confident that leaking cache refresh goroutines by PollingAutoscaler is the root cause here. It explains the gradual buildup of API calls observed by @bismarck very well and reading the code seems to confirm it.

@mumoshu Do you want to take a shot at fixing this?

@mumoshu
Copy link
Contributor

mumoshu commented Oct 20, 2017

@MaciekPytel Thanks for reaching me out but I'm afraid I can't work on it shortly. However, please assign me if you're also out of time. I'll try to manage it.

@MaciekPytel
Copy link
Contributor

I don't use aws at all, so I can't fix it. I can review and provide any other help if someone else wants to take a shot at this.

@mmerrill3
Copy link

mmerrill3 commented Oct 20, 2017

Looking at this with @bismarck, and noted that the autoscaler (including the aws manager, and auto_scaling_groups with the go routine) is being recreated every poll method. This results in a go routine (that runs forever) being created every 10 seconds, since our scan interval is 10 seconds. The go routine is there to refresh the k8s cache for ASGs every hour.
I think the function of the go routine (to create a fresh view of the ASGs in AWS) should be exposed as a separate method in the auto_scaling_groups.go. Then, users of auto_scaling_groups can decide when to refresh the cache. Also, polling_autoscaler.go wouldn't have to create a new autoscaler every poll interval. It could just refresh the cache.
The static_autocaler would then have to keep track of when to call the new refresh cache method, probably in its own go routine that runs forever.

@MaciekPytel
Copy link
Contributor

The cache itself is an implementation detail of a given cloudprovider and I'd rather not have static_autoscaler know about it. However, I'm thinking about adding some sort of a general purpose refresh method to CloudProvider that would be called once per loop. The implementation of such method could maintain some internal counter and refresh the cache every N loops or after M minutes have elapsed since last refresh or whatever other internal mechanism makes sense in this case.

PollingAutoscaler was implemented a while ago when StaticAutoscaler couldn't handle adding or removing NodeGroup. So it's recreating the whole StaticAutoscaler when the config changes (unfortunately it's also creating a temporary one to check if it has changed). However, recently we've updated StaticAutoscaler and it's almost ready to handle NodeGroups being dynamically added or removed on cloudprovider level (this is how a GKE equivalent of PollingAutoscaler will work). I think the best long-term solution is to reimplement PollingAutoscaler functionality using the same approach, but I would wait 2 more weeks or so until we finish all the necessary prerequisites.

Since my long-term idea would require a large-ish refactor I'm ok to review and merge any reasonable short-term fix for this specific issue.

@mumoshu
Copy link
Contributor

mumoshu commented Oct 31, 2017

In case you missed it, the fix in ongoing at #422 by @mmerrill3!

@mwielgus
Copy link
Contributor

mwielgus commented Nov 3, 2017

Fixed in #422 and cherrypicked to 1.0.x in #445. Closing.

@johanneswuerbach
Copy link
Contributor

We are running v1.0.2 and are still seeing various autoscaler failures and crashes due to Throttling: Rate exceeded status code: 400.

@mumoshu
Copy link
Contributor

mumoshu commented Nov 16, 2017

@johanneswuerbach Would you mind sharing us:

(1) Number of ASGs you have in your cluster
(2) Number of ASGs in your AWS account
(3) In which timing does CA seem to crash more? scaling up, down, etc
(4) Does number of API calls still increase gradually like we've seen before?

At this point, for me, problematic part left in code-base seems to be:

only.

In case you're already calling AWS API a lot from another systems, perhaps CA should be improved to just retry API call with back-off when rated limited.

@johanneswuerbach
Copy link
Contributor

johanneswuerbach commented Nov 16, 2017

We currently have 5 clusters with 6 ASGs each (3 for HA masters, 3 for nodes in each AZ) so 30 Auto Scaling Groups in total.

Each of those clusters is running an autoscaler instance.

This is the smaller of our two accounts (we see crashes in both) and it generally doesn't have any scaling activity (its our staging env).

Crashes seem to mostly encounter in

return nil, fmt.Errorf("Failed to get ASGs: %v", err)
, which fails with Throttling: Rate exceeded

We currently don't feed AWS api calls into our log aggregation system, but I will try to get those numbers.

@bismarck
Copy link

I just checked CA 1.0.2 on our cluster (which is in a similar setup as @johanneswuerbach) and we haven't seen a crash yet. Here are the number of "Regenerating ASG information for" messages in CA log for 24 hour period:

regeneratingasg-102

Looks like what was expected after the fix. Need to check my CloudTrail logs for DescribeAutoscalingGroups requests.

@mumoshu
Copy link
Contributor

mumoshu commented Nov 19, 2017

@johanneswuerbach Thanks for sharing the details! Seems like I'm unable to figure out anything right now.

Would you mind sharing me the full error message and what's you are passing to --node-group-auto-discovery?

At least in theory, the describe-tags calls should't be that many to trigger a rate limit. So I suspect there's any pattern or combination of tags to trigger that.

@johanneswuerbach
Copy link
Contributor

@mumoshu I'm passing --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/<name> and full command:

            - ./cluster-autoscaler
            - --stderrthreshold=info
            - --cloud-provider=aws
            - --skip-nodes-with-local-storage=false
            - --expander=least-waste
            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,kubernetes.io/cluster/<name>
            - --balance-similar-node-groups
            - --scale-down-utilization-threshold=0.7

I'm seeing the following two kind of stack traces:

E1120 06:39:32.367506       1 static_autoscaler.go:128] Failed to update node registry: Throttling: Rate exceeded
	status code: 400, request id: 90f99c8a-cdbd-11e7-891a-590162651f15

and followed by a crash:

I1120 06:09:02.367838       1 leaderelection.go:174] attempting to acquire leader lease...
I1120 06:09:02.384670       1 leaderelection.go:184] successfully acquired lease kube-system/cluster-autoscaler
F1120 06:33:06.092320       1 cloud_provider_builder.go:111] Failed to create AWS cloud provider: Failed to get ASGs: Throttling: Rate exceeded
	status code: 400, request id: aac62ecf-cdbc-11e7-ab1b-8be915c8bac1
goroutine 77 [running]:
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.stacks(0xc420ab4300, 0xc420f64780, 0xd4, 0x1e0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:766 +0xa7
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.(*loggingT).output(0x518d4e0, 0xc400000003, 0xc420c1e000, 0x4e0ea3f, 0x19, 0x6f, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:717 +0x348
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.(*loggingT).printf(0x518d4e0, 0x3, 0x3434a38, 0x27, 0xc4218a1080, 0x1, 0x1)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:655 +0x14f
k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog.Fatalf(0x3434a38, 0x27, 0xc4218a1080, 0x1, 0x1)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/github.com/golang/glog/glog.go:1145 +0x67
k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder.CloudProviderBuilder.Build(0x7ffc1e022686, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/cloudprovider/builder/cloud_provider_builder.go:111 +0x756
k8s.io/autoscaler/cluster-autoscaler/core.NewAutoscalingContext(0xa, 0x3fe6666666666666, 0x8bb2c97000, 0x1176592e000, 0x0, 0x4e200, 0x0, 0x186a00000, 0x0, 0x7ffc1e0226e3, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaling_context.go:147 +0x466
k8s.io/autoscaler/cluster-autoscaler/core.NewStaticAutoscaler(0xa, 0x3fe6666666666666, 0x8bb2c97000, 0x1176592e000, 0x0, 0x4e200, 0x0, 0x186a00000, 0x0, 0x7ffc1e0226e3, ...)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:55 +0x14d
k8s.io/autoscaler/cluster-autoscaler/core.(*AutoscalerBuilderImpl).Build(0xc42090cb60, 0x0, 0x0, 0xc4218a1b38, 0x1107e60)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/autoscaler_builder.go:71 +0x10e
k8s.io/autoscaler/cluster-autoscaler/core.(*PollingAutoscaler).Poll(0xc4219e28a0, 0x4, 0xed1a46e1d)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/polling_autoscaler.go:81 +0x5a
k8s.io/autoscaler/cluster-autoscaler/core.(*PollingAutoscaler).RunOnce(0xc4219e28a0, 0xed1a46e1d, 0xe2fd62e5e, 0x518d0a0, 0x518d0a0, 0x0)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/polling_autoscaler.go:68 +0x76
main.run(0xc420a02870)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:262 +0x46f
main.main.func2(0xc420bd4540)
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:344 +0x2a
created by k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
	/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:145 +0x97

Could it be the k8s.io/cluster-autoscaler/enabled tag, which is shared by all clusters?

elmiko pushed a commit to elmiko/kubernetes-autoscaler that referenced this issue Jun 1, 2023
…openshift-4.14-atomic-openshift-cluster-autoscaler

Updating atomic-openshift-cluster-autoscaler images to be consistent with ART
yaroslava-serdiuk pushed a commit to yaroslava-serdiuk/autoscaler that referenced this issue Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/aws Issues or PRs related to aws provider help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

8 participants