Closed
Description
Whenever our reprocess workflow is triggered, we see a bunch of DLQ message that say:
Rate exceeded (Service: AmazonECS; Status Code: 400; Error Code: ThrottlingException; Request ID: 1db00da3-2470-4c18-b97a-ffca0c6aa602; Proxy: null)
This happens because reprocessing means spawning around 40,000 ECS tasks, while our current service limit stands on 1000.
I think we need a two vector solution here:
- Increase service limits to make reprocessing faster.
- Tweak the retry policy or some other mechanism to prevent this from happening.