Skip to content

RUST-1433 Propagate original error for some labeled retry errors #903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jun 29, 2023

Conversation

abr-egn
Copy link
Contributor

@abr-egn abr-egn commented Jun 26, 2023

RUST-1433

The actual functional change here is just the added && !err.contains_label("NoWritesPerformed") clause, everything else is in service of testing. The core issue here is that the test calls for adding a new failpoint in response to a CommandSucceededEvent, and the Rust driver's CommandEventHandler doesn't permit async code. Options I considered:

  • try to cheat the async in there via a channel to an independent task
  • add a test-only way to run async event processing
  • test the code in isolation against some kind of fake server

I'm pretty sure the cheat option wouldn't work, or at best would be flaky, because it would end up having to block the Tokio executor thread to wait on the channel in the sync event handler. The code also is pretty hostile to being tested in isolation, since we don't have any kind of abstraction for those layers. So, I ended up going with the test-only async events as the least-bad route.

Attempting to implement that as a direct callback (Fn(&CommandEvent) -> BoxFuture<...>) pretty rapidly ran into gnarly lifetime issues, so I switched that over to using a channel with AcknowledgedMessages, which gives an equivalent effect ("notify external code of an event, wait for possibly-async code before continuing"). That worked a lot better, although to prevent deadlock due to the failpoint itself triggering events it still ended up being messy; I've left comments to try to clarify what's going on but let me know if it needs more.

Copy link
Contributor

@drshika drshika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@isabelatkinson
Copy link
Contributor

isabelatkinson commented Jun 28, 2023

I spent some time thinking about this yesterday and came up with the following idea that passes locally with a rough POC I wrote. The gist is:

  • define an AsyncEventHandler type that stores a handle to a client and the async runtime
  • implement CommandEventHandler for AsyncEventHandler and call spawn on the stored runtime in the definition of handle_command_succeeded_event to configure the failpoint
  • create an AsyncEventHandler in the async test by calling tokio::runtime::Handler::current to grab the runtime handle and configure it on the client used to run the insert

i.e.

struct AsyncEventHandler {
    runtime_handle: tokio::runtime::Handle,
    client: TestClient,
}

impl CommandEventHandler for AsyncEventHandler {
    fn handle_command_succeeded_event(&self, event: CommandSucceededEvent) {
        if event matches {
            runtime_handle.spawn(/* configure failpoint on client */);
        }
    }
}

Caveats: I'm not sure what the async-std equivalent here would be (although I don't think it'd be a huge deal if we only ran this test on tokio), and I'm not sure how safe this is against a race between the failpoint being set and the retry occurring. WDYT? Happy to defer to whatever you think the best approach is here.

@abr-egn
Copy link
Contributor Author

abr-egn commented Jun 28, 2023

I hadn't thought of that approach, but I'm pretty sure that'll be flaky due to the race condition you called out. Once the task to set the failpoint is spawned it's down to the scheduler whether it'll execute before or after the retry, which even if it happened to work doesn't seem good to rely on.

@abr-egn abr-egn force-pushed the RUST-1433/retry-propagate-errors branch from 3849b58 to 8a31022 Compare June 28, 2023 19:39
Copy link
Contributor

@isabelatkinson isabelatkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that'll be flaky due to the race condition you called out.

Yeah agreed, I think it makes sense to stick with your strategy then. LGTM! (I filed RUST-1694 to update this test to use successive failpoints if/when the server supports that.)

@abr-egn abr-egn merged commit 3ea196f into mongodb:main Jun 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants