Summary
The current retry backoff is purely exponential, which risks overflow and synchronized retries ("stampedes"). We should harden the strategy by capping the delay, adding light jitter, and ensuring overflow-safe arithmetic. Include the event ID in retry logs for traceability.
Context
Scope
Acceptance criteria
- Exponential backoff uses overflow-safe arithmetic and clamps to a sensible maximum delay.
- Add small jitter per event to avoid thundering herd effects.
- Retry warnings/logs include the event ID for traceability.
- Unit tests cover delay capping, jitter bounds, and absence of overflow at high attempts.
- Consider simple configurability for max delay and jitter percentage (constants or config).
Notes
Please link the implementing PR back to this issue upon submission