-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS rate limiting causes tasks to fail #23475
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! |
Feel Free! Assigned you |
As workaround - set Environment Variables for airflow service to use more "pessimistic" retry strategy, rather than default one
Most benefit to use this approach - it effect to all AWS operators, secrets backends, hooks, and other stuff which uses boto3 Unless it explicitly set to specific AWS connection in |
I think that's a good general solution that solves the problem (and it does not need any other fix except maybtle updating the docs). Thanks @Taragolis. |
@SamLynnEvans would you like to make a PR to the Aws docs to mention that as a solution? |
Yes that seems simpler, thank you! |
@potiuk I could mention about build-in backoff retry in boto3 and how to configure it in docs. Even this task closed, I also want to check and change default poll_interval / sleep_interval for some of AWS operators, because in most cases 5-6 seconds is too optimistic, and cause API throttling |
No problem. Feel free to add PR. There is no need to have issues for such changes. PR is enough |
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
All versions
Apache Airflow version
2.3.0 (latest released)
Operating System
linux
Deployment
Astronomer
Deployment details
No response
What happened
When using the AWS glue crawler operator, if the sensing stage gets a 400 error with "Rate exceeded" message then the glue crawler operator fails.
The problem is that this can occur with as little as 10 crawler sensors working concurrently with a poll_interval of 60s.
You can set retries and exponential_backoff on the operators, however if you retry the glue crawler operator it will then fail because the crawler has already been started. In essence this means that if you use this operator and you get rate limited, then you cannot retry and you end up with brittle pipelines.
This rate limting issue happens with all other AWS operators I have worked with too.
What you think should happen instead
This is the kind of error that the operator receives when it pings AWS for a status update:
"An error occurred (ThrottlingException) when calling the GetCrawler operation (reached max retries: 4): Rate exceeded"
I think this error should be handled instead of it causing the task to fail.
I have implemented a custom operator which catches this particular exception. I would be happy to try and submit a PR myself for the issue.
How to reproduce
As this is a problem of being rate limited by AWS you will need to have an AWS account setup with some crawlers.
Create a DAG that uses the glue crawler operator linked to above. Have somewhere between 10 and 20 of these setup to run glue crawlers in an AWS environment.
Anything else
This problem occurs without fail every time we try to run 12 crawler operators at once (each with a poll_interval of 60s).
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: