-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I have found these related issues/pull requests
None
Description
This is sort of two bugs in one. But the causes are very related (partially the same) and the simpler error-handling bug actually helps demonstrate the problem while being much simpler and more reliable to trigger. So I have decided to report them together. I am also reporting them from the perspective of sqlite because that is where we encountered them and the code that we have explored most thoroughly. However the root cause is in the database independent Transaction object and likely affects all databases, and if databases are not affected it is likely more by chance than by intention.
Drop Safety
With very unfortunate drop-timing or an error during transaction setup code. sqlx_sqlite::SqliteTransactionManager::begin can leave a connection with an open transaction. If this connection is not specially cleaned up it can result in future calls that try to open a transaction such as SqlitePool::begin to fail with:
attempted to call begin_with at non-zero transaction depth
For many people using sqlite this is a major issue. Because sqlite doesn't support concurrent writers it is common to use a single write connection. If this connection is "corrupted" by being left with an open transaction all attempts to use a write transaction in the application will fail. We were lucky in that we noticed fairly quickly, if this happened at night we likely would have had hours of downtime.
sqlx/sqlx-sqlite/src/transaction.rs
Lines 14 to 25 in f7ef1ed
| async fn begin(conn: &mut SqliteConnection, statement: Option<SqlStr>) -> Result<(), Error> { | |
| let is_custom_statement = statement.is_some(); | |
| conn.worker.begin(statement).await?; | |
| if is_custom_statement { | |
| // Check that custom statement actually put the connection into a transaction. | |
| let mut handle = conn.lock_handle().await?; | |
| if !handle.in_transaction() { | |
| return Err(Error::BeginFailed); | |
| } | |
| } | |
| Ok(()) | |
| } |
In SqliteTransactionManager::begin with a custom statement it is possible that the future is dropped or an error is raised between when the transaction is started (at some point during the execution of line 16) and when line 19 completes (at which point it will be wrapped in a Transaction object that cleans it up). In this case the Transaction object would never be constructed and the transaction would never be rolled back.
sqlx/sqlx-core/src/transaction.rs
Lines 106 to 111 in f7ef1ed
| DB::TransactionManager::begin(&mut conn, statement).await?; | |
| Ok(Self { | |
| connection: conn, | |
| open: true, | |
| }) |
It is also quite possible that this same error could occur without a custom statement, but I have not yet looked into the implementation of ConnectionWorker::begin to see if it has any suspension points after the transaction is started. Either way the code structure in Transaction is very error prone. If the future is Droped between the transaction being started (somewhere inside DB::TransactionManager::begin) and the Transaction object being constructed the connection will unexpectedly be left in an "in-transaction" state. This could very likely affect all database implementations as the main "bug" is in the database independent Transaction object. Our use case is just triggering it via the SqliteTransactionManager implementation.
I can think of two simple ways to resolve this:
- Build the
Transactionobject immediately, then callDB::TransactionManager::beginon the connection inside of the transaction. Theimpl Drop for Transactionwill then need to handle the case where the transaction was not actually started. (This could be possibly be done by changingdoneto be a try state where it starts "unknown" then switches to "open" after the transaction is started then moves to "closed" after rollback or commit.) - Create a new
StartingTransactionobject that has the same "close transaction if open" logic described above, but onceDB::TransactionManager::beginmove the connection from that object into theTransactionobject (in a Drop and panic-safe way). This avoidsTransactionneeding to handle the non-transaction case which should be very exceptional and would be good to avoid extra database calls for (especially for networked databases).
Error Safety
A bug with basically the same cause is error safety. Any error raised during transaction setup after the connection is put into in a transaction state (whether an actual underlying sqlite transaction or a ConnectionWorker transaction_depth > 0 state) will not result in the connection being closed.
Some errors are handled, for example if the default or custom statement fails the internal transaction_depth counter is not incremented. Since this error also means a transaction was not started this leaves sqlx in a good "no transaction" state.
sqlx/sqlx-sqlite/src/connection/worker.rs
Lines 226 to 231 in f7ef1ed
| let res = | |
| conn.handle | |
| .exec(statement.as_str()) | |
| .map(|_| { | |
| shared.transaction_depth.fetch_add(1, Ordering::Release); | |
| }); |
However other errors are not properly handled. For example there is an explicit check to ensure that the provided custom statement creates a transaction.
sqlx/sqlx-sqlite/src/transaction.rs
Lines 18 to 22 in f7ef1ed
| // Check that custom statement actually put the connection into a transaction. | |
| let mut handle = conn.lock_handle().await?; | |
| if !handle.in_transaction() { | |
| return Err(Error::BeginFailed); | |
| } |
If this check fails it raises an error that avoids the Transaction object being created. This then means that the transaction would not be closed. In theory this isn't an error because there is no transaction. However for sqlite this is an issue because the ConnectionWorker still believes it is in one and will fail future attempts at opening transactions.
I can think of two improvements. Probably it is necessary to do both of them:
- Move the "is in connection" check to
ConnectionWorkerso that it can avoid incrementingtransaction_depthif the command succeeded but did not start a transaction. However this approach is fragile as it does not other possible errors which could occur after a transaction was opened. - Basically the same solution as the drop problem, build some object with a
Dropthat will clean up the transaction no matter why it is done, including any error states. In this case theConnectionWorkermay still need to be updated to tracktransaction_depthproperly but it should also be robust to other errors.
Fixing
I can contribute a patch if we decide on the approach. DB::TransactionManager::get_transaction_depth should provided the required information for determining if the transaction needs to be rolled back without needing database-specific logic. Additionally the ConnectionWorker will need to be re-thought to avoid losing sync when the statement doesn't actually open a transaction.
Reproduction steps
Unfortunately I don't have a concrete way to reproduce the drop case. (It can probably be done by figuring out the exact number of times to poll the future, but I didn't feel like doing that as the reproducer would be very fragile) but I can show the error case which is a very similar error path.
#[sqlx::test]
async fn write_tx_recover(pool: SqlitePool) -> anyhow::Result<()> {
let mut conn = pool.acquire().await?;
dbg!(
Transaction::begin(
MaybePoolConnection::<Sqlite>::Connection(&mut conn),
Some("SELECT 1".into()),
)
.await
);
dbg!(
Transaction::begin(
MaybePoolConnection::<Sqlite>::Connection(&mut conn),
Some("THIS IS NEVER RUN".into()),
)
.await
)?;
Ok(())
}Output:
---- db::sqlite::write_tx_recover stdout ----
[api-server/src/db/sqlite.rs:291:5] Transaction::begin(MaybePoolConnection::<Sqlite>::Connection(&mut conn),
Some("SELECT 1".into()),).await = Err(
BeginFailed,
)
[api-server/src/db/sqlite.rs:299:5] Transaction::begin(MaybePoolConnection::<Sqlite>::Connection(&mut conn),
Some("adsjlfjadslfjaslfejl".into()),).await = Err(
InvalidSavePointStatement,
)
Error: attempted to call begin_with at non-zero transaction depth
This is a bit of a strange reproducer because the SELECT 1 doesn't actually start an sqlite transaction. However it exercises the same code and bug as the drop case. In this case the ConnectionWorker increments transaction_depth, but then an error is returned before the Transaction object is created. The transaction is never cleaned up (because that was the responsibility of the Transaction which was never constructed). The second call then fails because ConnectionWorker still believes that it is in a transaction. The second command is garbage because it is never run, ConnectionWorker returns an error before even getting to sqlite.
This could be modified to trigger the Drop bug.
- Update the commands to be
BEGIN IMMEDIATE. - Instead of
await-ing the first future poll itNtimes then drop it. WhereNis enough times to start the transaction in sqlite (and/orConnectionWorker) but not enough to complete the future (and build theTransactionobject.
SQLx version
0.8.5 (and code checked against current main branch)
Enabled SQLx features
"sqlite", "chrono", "uuid", "runtime-tokio-rustls", "macros", "migrate"
Database server and version
3.48.0 2025-01-14 11:05:00 d2fe6b05f38d9d7cd78c5d252e99ac59f1aea071d669830c1ffe4e8966e84010 (64-bit)
Operating system
Linux
Rust version
1.87.0