Retry logic around describe-cluster doesn't handle rate-limiting #999

cartermckinnon · 2022-08-18T20:16:59Z

(relayed from an internal ticket)

What happened:

aws eks wait cluster-active may get rate-limited (TooManyRequestsException) and cause the bootstrap script to terminate, instead of falling back to the retry logic around aws eks describe-cluster.

What you expected to happen:

The describe-cluster call should be retried the desired number of times, despite rate-limiting errors.

The text was updated successfully, but these errors were encountered:

orirawlings · 2024-03-20T16:11:59Z

@cartermckinnon What ever happened with #1004?

We're facing a similar issue where aws eks wait cluster-active fails due to a transient timeout with the AWS API and then a node gets stuck without joining the cluster (which has other knock-on effects, wedging cluster-autoscaler).

2024-03-20T15:00:46+0000 [eks-bootstrap] INFO: --b64-cluster-ca or --apiserver-endpoint is not defined, describing cluster...

Connect timeout on endpoint URL: "https://eks.us-west-2.amazonaws.com/clusters/eks-prod-us-west-2"
Exited with error on line 358

It seems like the patch in #1004 would fix our problem, but it appears it was closed after sitting for a long time.

cartermckinnon · 2024-06-21T19:04:53Z

The best thing to do here is to pass --apiserver-endpoint and --b64-cluster-ca and avoid the DescribeCluster call entirely. This fallback mechanism has been removed in our AL2023 AMI's.

I'll see if we can reboot the PR, in any case.

cartermckinnon self-assigned this Aug 18, 2022

cartermckinnon added the bug Something isn't working label Aug 18, 2022

cartermckinnon mentioned this issue Aug 23, 2022

Handle errors while waiting for cluster to be active. #1004

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry logic around describe-cluster doesn't handle rate-limiting #999

Retry logic around describe-cluster doesn't handle rate-limiting #999

cartermckinnon commented Aug 18, 2022

orirawlings commented Mar 20, 2024

cartermckinnon commented Jun 21, 2024

Retry logic around describe-cluster doesn't handle rate-limiting #999

Retry logic around describe-cluster doesn't handle rate-limiting #999

Comments

cartermckinnon commented Aug 18, 2022

orirawlings commented Mar 20, 2024

cartermckinnon commented Jun 21, 2024