Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS rate limiting causes tasks to fail #23475

Closed
2 tasks done
SamLynnEvans opened this issue May 4, 2022 · 8 comments
Closed
2 tasks done

AWS rate limiting causes tasks to fail #23475

SamLynnEvans opened this issue May 4, 2022 · 8 comments
Assignees
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues

Comments

@SamLynnEvans
Copy link

SamLynnEvans commented May 4, 2022

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

All versions

Apache Airflow version

2.3.0 (latest released)

Operating System

linux

Deployment

Astronomer

Deployment details

No response

What happened

When using the AWS glue crawler operator, if the sensing stage gets a 400 error with "Rate exceeded" message then the glue crawler operator fails.

The problem is that this can occur with as little as 10 crawler sensors working concurrently with a poll_interval of 60s.

You can set retries and exponential_backoff on the operators, however if you retry the glue crawler operator it will then fail because the crawler has already been started. In essence this means that if you use this operator and you get rate limited, then you cannot retry and you end up with brittle pipelines.

This rate limting issue happens with all other AWS operators I have worked with too.

What you think should happen instead

This is the kind of error that the operator receives when it pings AWS for a status update:
"An error occurred (ThrottlingException) when calling the GetCrawler operation (reached max retries: 4): Rate exceeded"

I think this error should be handled instead of it causing the task to fail.

I have implemented a custom operator which catches this particular exception. I would be happy to try and submit a PR myself for the issue.

How to reproduce

As this is a problem of being rate limited by AWS you will need to have an AWS account setup with some crawlers.

Create a DAG that uses the glue crawler operator linked to above. Have somewhere between 10 and 20 of these setup to run glue crawlers in an AWS environment.

Anything else

This problem occurs without fail every time we try to run 12 crawler operators at once (each with a poll_interval of 60s).

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@SamLynnEvans SamLynnEvans added area:providers kind:bug This is a clearly a bug labels May 4, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented May 4, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@potiuk
Copy link
Member

potiuk commented May 4, 2022

Feel Free! Assigned you

@eladkal eladkal added provider:amazon-aws AWS/Amazon - related issues good first issue labels May 20, 2022
@Taragolis
Copy link
Contributor

As workaround - set Environment Variables for airflow service to use more "pessimistic" retry strategy, rather than default one

AWS_RETRY_MODE=standard
AWS_MAX_ATTEMPTS=10

Most benefit to use this approach - it effect to all AWS operators, secrets backends, hooks, and other stuff which uses boto3

Unless it explicitly set to specific AWS connection in extra [config_kwargs]

@potiuk
Copy link
Member

potiuk commented May 24, 2022

I think that's a good general solution that solves the problem (and it does not need any other fix except maybtle updating the docs). Thanks @Taragolis.

@potiuk
Copy link
Member

potiuk commented May 24, 2022

@SamLynnEvans would you like to make a PR to the Aws docs to mention that as a solution?

@SamLynnEvans
Copy link
Author

Yes that seems simpler, thank you!

@Taragolis
Copy link
Contributor

@potiuk I could mention about build-in backoff retry in boto3 and how to configure it in docs.

Even this task closed, I also want to check and change default poll_interval / sleep_interval for some of AWS operators, because in most cases 5-6 seconds is too optimistic, and cause API throttling

@potiuk
Copy link
Member

potiuk commented May 24, 2022

No problem. Feel free to add PR. There is no need to have issues for such changes. PR is enough

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers good first issue kind:bug This is a clearly a bug provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

No branches or pull requests

4 participants