AWS rate limiting causes tasks to fail #23475

SamLynnEvans · 2022-05-04T12:45:16Z

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

All versions

Apache Airflow version

2.3.0 (latest released)

Operating System

linux

Deployment

Astronomer

Deployment details

No response

What happened

When using the AWS glue crawler operator, if the sensing stage gets a 400 error with "Rate exceeded" message then the glue crawler operator fails.

The problem is that this can occur with as little as 10 crawler sensors working concurrently with a poll_interval of 60s.

You can set retries and exponential_backoff on the operators, however if you retry the glue crawler operator it will then fail because the crawler has already been started. In essence this means that if you use this operator and you get rate limited, then you cannot retry and you end up with brittle pipelines.

This rate limting issue happens with all other AWS operators I have worked with too.

What you think should happen instead

This is the kind of error that the operator receives when it pings AWS for a status update:
"An error occurred (ThrottlingException) when calling the GetCrawler operation (reached max retries: 4): Rate exceeded"

I think this error should be handled instead of it causing the task to fail.

I have implemented a custom operator which catches this particular exception. I would be happy to try and submit a PR myself for the issue.

How to reproduce

As this is a problem of being rate limited by AWS you will need to have an AWS account setup with some crawlers.

Create a DAG that uses the glue crawler operator linked to above. Have somewhere between 10 and 20 of these setup to run glue crawlers in an AWS environment.

Anything else

This problem occurs without fail every time we try to run 12 crawler operators at once (each with a poll_interval of 60s).

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

boring-cyborg · 2022-05-04T12:45:18Z

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk · 2022-05-04T14:24:50Z

Feel Free! Assigned you

Taragolis · 2022-05-23T18:10:57Z

As workaround - set Environment Variables for airflow service to use more "pessimistic" retry strategy, rather than default one

AWS_RETRY_MODE=standard
AWS_MAX_ATTEMPTS=10

Most benefit to use this approach - it effect to all AWS operators, secrets backends, hooks, and other stuff which uses boto3

Unless it explicitly set to specific AWS connection in extra [config_kwargs]

potiuk · 2022-05-24T10:58:23Z

I think that's a good general solution that solves the problem (and it does not need any other fix except maybtle updating the docs). Thanks @Taragolis.

potiuk · 2022-05-24T10:59:08Z

@SamLynnEvans would you like to make a PR to the Aws docs to mention that as a solution?

SamLynnEvans · 2022-05-24T11:00:12Z

Yes that seems simpler, thank you!

Taragolis · 2022-05-24T14:28:59Z

@potiuk I could mention about build-in backoff retry in boto3 and how to configure it in docs.

Even this task closed, I also want to check and change default poll_interval / sleep_interval for some of AWS operators, because in most cases 5-6 seconds is too optimistic, and cause API throttling

potiuk · 2022-05-24T15:19:29Z

No problem. Feel free to add PR. There is no need to have issues for such changes. PR is enough

SamLynnEvans added area:providers kind:bug This is a clearly a bug labels May 4, 2022

potiuk assigned SamLynnEvans May 4, 2022

eladkal added provider:amazon-aws AWS/Amazon - related issues good first issue labels May 20, 2022

SamLynnEvans closed this as completed May 24, 2022

Taragolis mentioned this issue May 24, 2022

docs: amazon-provider retry modes #23906

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS rate limiting causes tasks to fail #23475

AWS rate limiting causes tasks to fail #23475

SamLynnEvans commented May 4, 2022 •

edited

Loading

boring-cyborg bot commented May 4, 2022

potiuk commented May 4, 2022

Taragolis commented May 23, 2022

potiuk commented May 24, 2022 •

edited

Loading

potiuk commented May 24, 2022 •

edited

Loading

SamLynnEvans commented May 24, 2022

Taragolis commented May 24, 2022

potiuk commented May 24, 2022

AWS rate limiting causes tasks to fail #23475

AWS rate limiting causes tasks to fail #23475

Comments

SamLynnEvans commented May 4, 2022 • edited Loading

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented May 4, 2022

potiuk commented May 4, 2022

Taragolis commented May 23, 2022

potiuk commented May 24, 2022 • edited Loading

potiuk commented May 24, 2022 • edited Loading

SamLynnEvans commented May 24, 2022

Taragolis commented May 24, 2022

potiuk commented May 24, 2022

SamLynnEvans commented May 4, 2022 •

edited

Loading

potiuk commented May 24, 2022 •

edited

Loading

potiuk commented May 24, 2022 •

edited

Loading