fix(transliterator): excessive throttling from ECS during reprocessing #711
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
During the reprocessing workflow, step functions tries to start a burst of 60,000 (current number of package versions) ECS tasks. Since our account limit is only 1000 parallel tasks, we need to apply a retry policy so the throttled tasks don't end up in the DLQ.
Currently, our retry policy allows for a total wait time of roughly 2.5 hours. Lets do some math to see if this is enough.
Since tasks also have boot time, we don't really run 1000 in parallel. In practice what we normally see is:
So for simplicity sake lets assume 500 parallel tasks. If every task takes about 2 minutes (empirically and somewhat based on
jsii-docgen
test timeouts) we are able to process 1000 tasks in 4 minutes.This means that in order to process 60,000 tasks, we need 4 hours. The current retry policy of 2.5 hours allows us to process only about 35,000 tasks. And indeed, most recent execution of the workflow resulted in the remaining 25,000 tasks being sent to the DLQ.
The retry policy implemented in this PR gives us 7 hours.
TODO
jsii-docgen
improvements did make it better but not enough to put a significant dent. I've updated the PR to give us 7 hours.Fixes #708
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license