Skip to content
This repository has been archived by the owner on Oct 18, 2023. It is now read-only.

Stop spawning long-lasting threads for libsql connections #693

Merged
merged 5 commits into from
Oct 3, 2023

Conversation

MarinPostma
Copy link
Collaborator

@MarinPostma MarinPostma commented Sep 20, 2023

This PR removes the need to spawn long-lasting threads for libsql connections. This reduces the number of threads required by sqld, since it can now reuse threads from the thread pool for connections.

Lock stealing

Previously, every connection spawned a blocking thread that kept track of the transaction state and timed out and rolled if a transaction didn't complete in time.

Avoiding to spawn a new thread means that we need a new mechanism to timeout a transaction. This PR introduces a lock-stealing mechanism.
The lock-stealing algorithm only monitors write locks, meaning RO transactions can continue indefinitely. Another change is that a lock is stolen i.if another connection claims it. Therefore, a write connection can go indefinitely until another connection steals the lock, guaranteeing it had at least TXN_TIMEOUT_DURATION to perform work.

Here is a detailed description of the algorithm:

  • all connections to a database share a TxnState, that contains a TxnSlot
  • when a connection acquires a write lock to the database, this is detected by monitoring the state of the connection before and after the call thanks to sqlite3_txn_state()
  • if the connection acquired a write lock (txn state none/read -> write), a new txn slot is created. A clone of the TxnSlot is placed in the TxnState shared with other connections to this database, while another clone is kept in the transaction state. The TxnSlot contains the instant the txn should timeout, a is_stolen flag, and a pointer to the connection currently holding the lock.
  • when another connection attempts to acquire the lock, the busy_handler callback will be called. The callback is being passed to the TxnState for the connection. The handler looks at the current slot to determine when the current txn will timeout and waits for that instant before retrying. The waiting handler can also be notified that the transaction has been finished early.
  • If the handler waits until the txn timeout and isn't notified of the termination of the txn, it will attempt to steal the lock. This is done by calling rollback on the slot's txn, and marking the slot as stolen.
  • When a connection notices that its slot has been stolen, it returns a timed-out error to the subsequent request.

@MarinPostma MarinPostma force-pushed the async-connection branch 6 times, most recently from 07b658a to 6a99215 Compare September 22, 2023 12:57
@MarinPostma MarinPostma marked this pull request as ready for review September 22, 2023 12:59
@MarinPostma MarinPostma changed the title remove long lasting connection blocking thread Stop spawning long-lasting threads for libsql connections Sep 22, 2023
Copy link
Contributor

@psarna psarna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I really like the idea, but there's one more thing to consider. Previous transaction time limit was also an implicit backpressure/guardrails mechanism, so that e.g. if somebody opens a single transaction and then keeps writing for ages, they'd just get a timeout and learn that they need to write in smaller bits. After the patch, users could technically keep writing, with their data not getting checkpointed (because of the write lock), and not backed up I guess, since it's not a full transaction. It's not very critical, because it's just an antipattern to have super-long-standing transactions, but maybe we should impose some kind of high limit anyway? So that people that hold a transaction for a full minute know that they're very likely doing something wrong.

/// This is done by calling rollback on the slot's txn, and marking the slot as stolen.
/// - When a connection notices that it's slot has been stolen, it returns a timedout error to the next request.
unsafe extern "C" fn busy_handler<W: WalHook>(state: *mut c_void, _retries: c_int) -> c_int {
let state = &*(state as *mut TxnState<W>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like we can have a fast path here where timeout_at is already in the past, that doesn't need to run within tokio::block_on and use the runtime in any way, right? And we can just steal the lock immediately. It's technically just an optimization, but I suggest we add it early, if we have an opportunity to run sync code without blocking on an async reactor

let mut lock = this.lock();

if let Some(slot) = &lock.slot {
if slot.is_stolen.load(Ordering::Relaxed) || Instant::now() > slot.timeout_at {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I just wrote an elaborate paragraph why I think we may need acquire/release on load/store of is_stolen respectively`, only to read your comment about the connection lock above. Nvm 😇

@MarinPostma
Copy link
Collaborator Author

That's a relatively fix I think, we just have to check that we haven't timed out the current transaction in the connection. For checkpointing, we can also have it steal the lock when necessary. I'll do that :) @psarna

@psarna
Copy link
Contributor

psarna commented Oct 1, 2023

Btw, our checkpointing connection is not preemptible, since it needs to finish, right? In that case we should clearly log and measure how much time a checkpoint takes, since it's likely to be a source of SQLITE_BUSY for users

@MarinPostma
Copy link
Collaborator Author

This is an argument for checkpointing often

@MarinPostma
Copy link
Collaborator Author

MarinPostma commented Oct 3, 2023

In that case, we should clearly log and measure

we already do that :)

@MarinPostma MarinPostma force-pushed the async-connection branch 2 times, most recently from 7c6a4fc to edd7161 Compare October 3, 2023 13:43
@penberg
Copy link
Contributor

penberg commented Oct 3, 2023

Looks good, ship it, @MarinPostma! 🚀

@MarinPostma MarinPostma enabled auto-merge October 3, 2023 14:27
@MarinPostma MarinPostma added this pull request to the merge queue Oct 3, 2023
Merged via the queue into main with commit 99ff0c9 Oct 3, 2023
9 checks passed
@MarinPostma MarinPostma deleted the async-connection branch October 3, 2023 15:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants