Description
I think I found a bug in Pool
which sometimes causes unexpected PoolTimedOut
errors. Here is a minimal reproducible test case:
#[tokio::test(flavor = "multi_thread")]
async fn race_condition_reproduction_case() {
let pool: sqlx::Pool<Sqlite> = sqlx::pool::PoolOptions::new()
.max_connections(1)
.connect_timeout(std::time::Duration::from_secs(1))
.connect(":memory:")
.await
.unwrap();
sqlx::query("CREATE TABLE foo (name TEXT, value TEXT)")
.execute(&pool)
.await
.unwrap();
sqlx::query("CREATE TABLE bar (name TEXT, value TEXT)")
.execute(&pool)
.await
.unwrap();
}
The failure is not deterministic - the test might need to be run several times before the failure is observed.
Note: I decreased the connect timeout to make the test fail faster, but the failure still occurs even with the default timeout. Also, this failure is most likely when max_connections
is 1, but I think as long as the number of subsequent queries is higher than max_connections
then the failure can still occur.
What happens I think is that there is a race condition. The first query acquires a connection and the releases it. But the release happens in a spawned task, so the second query might call acquire
first. It gets to this point before a context switch happens and the release of the first query finishes, waking a Waiter
if there is one (there are none in this case). Then context gets switched back to the acquire and it runs the poll_fn
which pushes a new waiter to the queue. But it never gets woken up because the waking up already happened and eventually the call timeouts.
A simple fix which seems to work for me is to check whether idle_conns
is still empty before pushing the waiter:
future::poll_fn(|cx| -> Poll<()> {
if !self.idle_conns.is_empty() {
return Poll::Ready(());
}
let waiter = waiter.get_or_insert_with(|| Waiter::push_new(cx, &self.waiters));
// ...
}),
EDIT: after a bit more testing, this doesn't seem to be a sufficient fix. The only synchronization between idle_conns.push
and idle_conns.is_empty
is via the waiters, but in cases like I described there are no waiters yet, so depending on timing, the idle_conns.is_empty()
call might still return true
and the problem persists
EDIT 2: seems that moving the idle_conns.is_empty()
check after the let waiter = waiter.get_or_insert(...
line does the trick. That seems to eliminate the last race condition - namely when the is_empty
check runs first, then release
runs and notifies the waiters and finally the Waiter::push_new
is called.