Skip to content

async retries timeout race condition #528

Closed
@daniel-sanche

Description

@daniel-sanche

Problem

Async retries use an asyncio.wait_for line to break out early when the deadline is hit. The problem is, in practice, retries typically wrap gapic grpc calls, which implement their own timeout logic. This means there is a natural race condition when the timeout is reached, where one of two things can happen:

  • the wait_for line in the wrapper notices the timeout first:
    • the wrapper emitts a RetryError
    • a CancelledError is sent down to the wrapped function, which the wrapped function may have to account for properly
  • The wrapped grpc call times out on its own:
    • a DeadlineExceeded error is passed to the wrapper
      • This may result in the wrapper raising a RetryError, if DeadlineExceeded is in the list of retryable errors, otherwise it will be raised directly through the wrapper.

On top of the race condition, there are a couple other issues with wait_for:

  • wait_for creates a new task, which can add significant performance overhead
  • This behaviour differs from the sync retry function, which only checks for timeouts after the wrapped function attempt is complete

Solutions:

We can add a new interrupt_on_timeout argument to the async retry function, and only enable wait_for when the flag is True. To avoid breaking changes, we can have the flag default to True

But personally, I think we should consider just removing the wait_for entirely


Streaming Retries:

Streaming retries add an extra complication, since there's the added possibility to check for expired timeouts after each yield in the stream, on top of the option to break the stream mid-yield. We we could add another flag to control for to control these timeout options.

But adding any extra checks in between each yield can significantly harm performance, especially if we add a wait_for to interrupt mid-yield, like async_retry currently does. For this reason. I suggest that we keep these timeout options out of streaming retries for now, and only keep the async_retries wait_for for compatibility reasons.

In general, I think the retry wrappers should expect the wrapped function to handle timeout intrruptions on their own, and the wrappers should only handle timeouts between attempts

Metadata

Metadata

Assignees

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions