Description
Problem
Async retries use an asyncio.wait_for
line to break out early when the deadline is hit. The problem is, in practice, retries typically wrap gapic grpc calls, which implement their own timeout logic. This means there is a natural race condition when the timeout is reached, where one of two things can happen:
- the
wait_for
line in the wrapper notices the timeout first:- the wrapper emitts a
RetryError
- a CancelledError is sent down to the wrapped function, which the wrapped function may have to account for properly
- the wrapper emitts a
- The wrapped grpc call times out on its own:
- a
DeadlineExceeded
error is passed to the wrapper- This may result in the wrapper raising a
RetryError
, if DeadlineExceeded is in the list of retryable errors, otherwise it will be raised directly through the wrapper.
- This may result in the wrapper raising a
- a
On top of the race condition, there are a couple other issues with wait_for:
wait_for
creates a new task, which can add significant performance overhead- This behaviour differs from the sync retry function, which only checks for timeouts after the wrapped function attempt is complete
Solutions:
We can add a new interrupt_on_timeout
argument to the async retry function, and only enable wait_for when the flag is True. To avoid breaking changes, we can have the flag default to True
But personally, I think we should consider just removing the wait_for entirely
Streaming Retries:
Streaming retries add an extra complication, since there's the added possibility to check for expired timeouts after each yield in the stream, on top of the option to break the stream mid-yield. We we could add another flag to control for to control these timeout options.
But adding any extra checks in between each yield can significantly harm performance, especially if we add a wait_for to interrupt mid-yield, like async_retry currently does. For this reason. I suggest that we keep these timeout options out of streaming retries for now, and only keep the async_retries wait_for for compatibility reasons.
In general, I think the retry wrappers should expect the wrapped function to handle timeout intrruptions on their own, and the wrappers should only handle timeouts between attempts