Race condition in Pool

I think I found a bug in `Pool` which sometimes causes unexpected `PoolTimedOut` errors. Here is a minimal reproducible test case:

```rust
#[tokio::test(flavor = "multi_thread")]
async fn race_condition_reproduction_case() {
    let pool: sqlx::Pool<Sqlite> = sqlx::pool::PoolOptions::new()
        .max_connections(1)
        .connect_timeout(std::time::Duration::from_secs(1))
        .connect(":memory:")
        .await
        .unwrap();

    sqlx::query("CREATE TABLE foo (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();

    sqlx::query("CREATE TABLE bar (name TEXT, value TEXT)")
        .execute(&pool)
        .await
        .unwrap();
}
```
The failure is not deterministic - the test might need to be run several times before the failure is observed. 

Note: I decreased the connect timeout to make the test fail faster, but the failure still occurs even with the default timeout. Also, this failure is most likely when `max_connections` is 1, but I think as long as the number of subsequent queries is higher than `max_connections` then the failure can still occur.

What happens I think is that there is a race condition. The first query acquires a connection and the releases it. But the release happens in a spawned task, so the second query might call `acquire` first. It gets to [this point](https://github.com/launchbadge/sqlx/blob/13f7c7124a79dbe2ba3ff2ea3a1fb4f686b0a2fa/sqlx-core/src/pool/inner.rs#L180) before a context switch happens and the release of the first query finishes, waking a `Waiter` if there is one (there are none in this case). Then context gets switched back to the acquire and it runs the `poll_fn` which pushes a new waiter to the queue. But it never gets woken up because the waking up already happened and eventually the call timeouts.

A simple fix which seems to work for me is to check whether `idle_conns` is still empty before pushing the waiter:

```rust
                future::poll_fn(|cx| -> Poll<()> {
                    if !self.idle_conns.is_empty() {
                        return Poll::Ready(());
                    }

                    let waiter = waiter.get_or_insert_with(|| Waiter::push_new(cx, &self.waiters));

                    // ...
                }),
``` 
EDIT: after a bit more testing, this doesn't seem to be a sufficient fix. The only synchronization between `idle_conns.push` and `idle_conns.is_empty` is via the waiters, but in cases like I described there are no waiters yet, so depending on timing, the `idle_conns.is_empty()` call might still return `true` and the problem persists :frowning_face: 

EDIT 2: seems that moving the `idle_conns.is_empty()` check after the `let waiter = waiter.get_or_insert(...` line does the trick. That seems to eliminate the last race condition - namely when the `is_empty` check runs first, then `release` runs and notifies the waiters and finally the `Waiter::push_new` is called. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race condition in Pool #1293

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race condition in Pool #1293

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions