Skip to content

Race condition in Pool #1293

Closed
Closed
@madadam

Description

@madadam

I think I found a bug in Pool which sometimes causes unexpected PoolTimedOut errors. Here is a minimal reproducible test case:

#[tokio::test(flavor = "multi_thread")]
async fn race_condition_reproduction_case() {
    let pool: sqlx::Pool<Sqlite> = sqlx::pool::PoolOptions::new()
        .max_connections(1)
        .connect_timeout(std::time::Duration::from_secs(1))
        .connect(":memory:")
        .await
        .unwrap();

    sqlx::query("CREATE TABLE foo (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();

    sqlx::query("CREATE TABLE bar (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();
}

The failure is not deterministic - the test might need to be run several times before the failure is observed.

Note: I decreased the connect timeout to make the test fail faster, but the failure still occurs even with the default timeout. Also, this failure is most likely when max_connections is 1, but I think as long as the number of subsequent queries is higher than max_connections then the failure can still occur.

What happens I think is that there is a race condition. The first query acquires a connection and the releases it. But the release happens in a spawned task, so the second query might call acquire first. It gets to this point before a context switch happens and the release of the first query finishes, waking a Waiter if there is one (there are none in this case). Then context gets switched back to the acquire and it runs the poll_fn which pushes a new waiter to the queue. But it never gets woken up because the waking up already happened and eventually the call timeouts.

A simple fix which seems to work for me is to check whether idle_conns is still empty before pushing the waiter:

                future::poll_fn(|cx| -> Poll<()> {
                    if !self.idle_conns.is_empty() {
                        return Poll::Ready(());
                    }

                    let waiter = waiter.get_or_insert_with(|| Waiter::push_new(cx, &self.waiters));

                    // ...
                }),

EDIT: after a bit more testing, this doesn't seem to be a sufficient fix. The only synchronization between idle_conns.push and idle_conns.is_empty is via the waiters, but in cases like I described there are no waiters yet, so depending on timing, the idle_conns.is_empty() call might still return true and the problem persists ☹️

EDIT 2: seems that moving the idle_conns.is_empty() check after the let waiter = waiter.get_or_insert(... line does the trick. That seems to eliminate the last race condition - namely when the is_empty check runs first, then release runs and notifies the waiters and finally the Waiter::push_new is called.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions