Stop spawning long-lasting threads for libsql connections #693

MarinPostma · 2023-09-20T13:47:27Z

This PR removes the need to spawn long-lasting threads for libsql connections. This reduces the number of threads required by sqld, since it can now reuse threads from the thread pool for connections.

Lock stealing

Previously, every connection spawned a blocking thread that kept track of the transaction state and timed out and rolled if a transaction didn't complete in time.

Avoiding to spawn a new thread means that we need a new mechanism to timeout a transaction. This PR introduces a lock-stealing mechanism.
The lock-stealing algorithm only monitors write locks, meaning RO transactions can continue indefinitely. Another change is that a lock is stolen i.if another connection claims it. Therefore, a write connection can go indefinitely until another connection steals the lock, guaranteeing it had at least TXN_TIMEOUT_DURATION to perform work.

Here is a detailed description of the algorithm:

all connections to a database share a TxnState, that contains a TxnSlot
when a connection acquires a write lock to the database, this is detected by monitoring the state of the connection before and after the call thanks to sqlite3_txn_state()
if the connection acquired a write lock (txn state none/read -> write), a new txn slot is created. A clone of the TxnSlot is placed in the TxnState shared with other connections to this database, while another clone is kept in the transaction state. The TxnSlot contains the instant the txn should timeout, a is_stolen flag, and a pointer to the connection currently holding the lock.
when another connection attempts to acquire the lock, the busy_handler callback will be called. The callback is being passed to the TxnState for the connection. The handler looks at the current slot to determine when the current txn will timeout and waits for that instant before retrying. The waiting handler can also be notified that the transaction has been finished early.
If the handler waits until the txn timeout and isn't notified of the termination of the txn, it will attempt to steal the lock. This is done by calling rollback on the slot's txn, and marking the slot as stolen.
When a connection notices that its slot has been stolen, it returns a timed-out error to the subsequent request.

psarna

Nice, I really like the idea, but there's one more thing to consider. Previous transaction time limit was also an implicit backpressure/guardrails mechanism, so that e.g. if somebody opens a single transaction and then keeps writing for ages, they'd just get a timeout and learn that they need to write in smaller bits. After the patch, users could technically keep writing, with their data not getting checkpointed (because of the write lock), and not backed up I guess, since it's not a full transaction. It's not very critical, because it's just an antipattern to have super-long-standing transactions, but maybe we should impose some kind of high limit anyway? So that people that hold a transaction for a full minute know that they're very likely doing something wrong.

psarna · 2023-09-28T15:45:37Z

sqld/src/connection/libsql.rs

+///   This is done by calling rollback on the slot's txn, and marking the slot as stolen.
+/// - When a connection notices that it's slot has been stolen, it returns a timedout error to the next request.
+unsafe extern "C" fn busy_handler<W: WalHook>(state: *mut c_void, _retries: c_int) -> c_int {
+    let state = &*(state as *mut TxnState<W>);


Sounds like we can have a fast path here where timeout_at is already in the past, that doesn't need to run within tokio::block_on and use the runtime in any way, right? And we can just steal the lock immediately. It's technically just an optimization, but I suggest we add it early, if we have an opportunity to run sync code without blocking on an async reactor

psarna · 2023-09-28T15:49:29Z

sqld/src/connection/libsql.rs

+            let mut lock = this.lock();
+
+            if let Some(slot) = &lock.slot {
+                if slot.is_stolen.load(Ordering::Relaxed) || Instant::now() > slot.timeout_at {


Ugh, I just wrote an elaborate paragraph why I think we may need acquire/release on load/store of is_stolen respectively`, only to read your comment about the connection lock above. Nvm 😇

MarinPostma · 2023-09-29T09:47:24Z

That's a relatively fix I think, we just have to check that we haven't timed out the current transaction in the connection. For checkpointing, we can also have it steal the lock when necessary. I'll do that :) @psarna

psarna · 2023-10-01T06:39:34Z

Btw, our checkpointing connection is not preemptible, since it needs to finish, right? In that case we should clearly log and measure how much time a checkpoint takes, since it's likely to be a source of SQLITE_BUSY for users

MarinPostma · 2023-10-01T09:10:08Z

This is an argument for checkpointing often

MarinPostma · 2023-10-03T06:54:18Z

In that case, we should clearly log and measure

we already do that :)

penberg · 2023-10-03T14:23:57Z

Looks good, ship it, @MarinPostma! 🚀

MarinPostma force-pushed the async-connection branch 6 times, most recently from 07b658a to 6a99215 Compare September 22, 2023 12:57

MarinPostma marked this pull request as ready for review September 22, 2023 12:59

MarinPostma changed the title ~~remove long lasting connection blocking thread~~ Stop spawning long-lasting threads for libsql connections Sep 22, 2023

psarna reviewed Sep 28, 2023

View reviewed changes

MarinPostma force-pushed the async-connection branch from bba1948 to 1c6754a Compare October 2, 2023 15:39

MarinPostma added 4 commits October 3, 2023 15:24

remove long lasting connection blocking thread

1c740c0

lock stealing

90ac417

rollback when txn detects timeout.

4c50081

busy handler fast path

61d61dd

MarinPostma force-pushed the async-connection branch 2 times, most recently from 7c6a4fc to edd7161 Compare October 3, 2023 13:43

penberg approved these changes Oct 3, 2023

View reviewed changes

fix deadlock and add more tests

3cfa05b

MarinPostma force-pushed the async-connection branch from edd7161 to 3cfa05b Compare October 3, 2023 14:27

MarinPostma enabled auto-merge October 3, 2023 14:27

MarinPostma added this pull request to the merge queue Oct 3, 2023

Merged via the queue into main with commit 99ff0c9 Oct 3, 2023
9 checks passed

MarinPostma deleted the async-connection branch October 3, 2023 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop spawning long-lasting threads for libsql connections #693

Stop spawning long-lasting threads for libsql connections #693

MarinPostma commented Sep 20, 2023 •

edited

Loading

psarna left a comment

psarna Sep 28, 2023

psarna Sep 28, 2023

MarinPostma commented Sep 29, 2023

psarna commented Oct 1, 2023

MarinPostma commented Oct 1, 2023

MarinPostma commented Oct 3, 2023 •

edited

Loading

penberg commented Oct 3, 2023

Stop spawning long-lasting threads for libsql connections #693

Stop spawning long-lasting threads for libsql connections #693

Conversation

MarinPostma commented Sep 20, 2023 • edited Loading

Lock stealing

psarna left a comment

Choose a reason for hiding this comment

psarna Sep 28, 2023

Choose a reason for hiding this comment

psarna Sep 28, 2023

Choose a reason for hiding this comment

MarinPostma commented Sep 29, 2023

psarna commented Oct 1, 2023

MarinPostma commented Oct 1, 2023

MarinPostma commented Oct 3, 2023 • edited Loading

penberg commented Oct 3, 2023

MarinPostma commented Sep 20, 2023 •

edited

Loading

MarinPostma commented Oct 3, 2023 •

edited

Loading