Skip to content

Conversation

@tobixen
Copy link

@tobixen tobixen commented Jan 1, 2026

Summary

When database commits fail (e.g., due to disk full or I/O errors), the worker thread would panic, poisoning the Mutex lock and making the server completely unusable until restart. Users would see endless "poisoned lock: another task failed inside" errors.

This change:

  • Replaces panic! with error logging and graceful worker shutdown
  • Adds CommitFailed error variant to DatastoreError
  • Maps CommitFailed to HTTP 503 Service Unavailable
  • Makes legacy import transaction failures non-fatal (best-effort)

The worker now exits cleanly when commits fail, allowing the channel to close properly. Future requests receive a clean "channel closed" error instead of encountering a poisoned lock.

Context

This was discovered while investigating issue #405. While the root cause reported there was different (the period_union bug, which was fixed in commit 1f4f07d), the symptom of a poisoned lock making the server unusable is shared. In my case, the actual panic was:

panicked at 'Failed to commit datastore transaction! database or disk is full': aw-datastore/src/worker.rs:195

Any panic while holding the datastore lock will poison it, so this fix addresses one category of such panics.

Test plan

  • cargo build compiles successfully
  • cargo test --package aw-datastore passes (9 tests)
  • Manual testing: fill disk, verify server logs error but doesn't poison lock

🤖 Generated with Claude Code


Important

Handle database commit failures gracefully by logging errors and shutting down the worker thread instead of panicking, preventing server lock poisoning.

  • Behavior:
    • Replace panic! with error logging and graceful shutdown in work_loop() in worker.rs.
    • Add CommitFailed error variant to DatastoreError in lib.rs.
    • Map CommitFailed to HTTP 503 in endpoints/util.rs.
    • Make legacy import transaction failures non-fatal in worker.rs.
  • Error Handling:
    • Log errors and set self.quit = true on commit failure in work_loop() in worker.rs.
    • Handle CommitFailed in HttpErrorJson conversion in endpoints/util.rs.

This description was created by Ellipsis for 44656f1. You can customize this summary. It will automatically update as commits are pushed.

When database commits fail (e.g., due to disk full or I/O errors),
the worker thread would panic, poisoning the Mutex lock and making
the server completely unusable until restart.

This change:
- Replaces panic! with error logging and graceful worker shutdown
- Adds CommitFailed error variant to DatastoreError
- Maps CommitFailed to HTTP 503 Service Unavailable
- Makes legacy import transaction failures non-fatal

The worker now exits cleanly when commits fail, allowing the channel
to close properly. Future requests receive a clean error instead of
encountering a poisoned lock.

Fixes behavior reported in issue ActivityWatch#405 (though root cause there was
different, the poisoned lock symptom is the same).

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 44656f1 in 47 seconds. Click for details.
  • Reviewed 87 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 4 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. aw-datastore/src/lib.rs:42
  • Draft comment:
    Added the CommitFailed error variant. This is a good step to differentiate commit failures from other errors. Ensure that later code (e.g. in worker.rs) consistently uses or maps this variant where appropriate.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
2. aw-datastore/src/worker.rs:132
  • Draft comment:
    Legacy import handling now logs errors for transaction start/commit failures instead of panicking. This prevents lock poisoning; just ensure that skipping legacy import is acceptable if it fails.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
3. aw-datastore/src/worker.rs:197
  • Draft comment:
    Instead of panicking on commit failure of the main transaction, the worker now logs the error and marks itself to quit gracefully. This avoids mutex poisoning. Confirm that gracefully shutting down the worker is the desired behavior for client requests.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
4. aw-server/src/endpoints/util.rs:101
  • Draft comment:
    The new mapping from DatastoreError::CommitFailed to an HTTP 503 (Service Unavailable) is correctly implemented. This ensures clients receive a meaningful error response when commit failures occur.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_nd6N2qKAC4qJIAcY

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

@greptile-apps
Copy link

greptile-apps bot commented Jan 1, 2026

Greptile Summary

Replaced panic-on-commit-failure with graceful worker shutdown to prevent mutex lock poisoning. When database commits fail (disk full, I/O errors), the worker thread now logs the error and exits cleanly instead of panicking.

Key changes:

  • Legacy import made best-effort (no longer fatal if transaction fails)
  • Main commit failure on line 197-209 sets quit=true and exits gracefully
  • Added CommitFailed error variant (though currently unused by worker)
  • HTTP 503 mapping for CommitFailed in endpoint util

Behavior notes:

  • Requests processed before a failed commit receive optimistic Ok responses, but their data may not be persisted
  • Subsequent requests fail with MpscError (channel closed) rather than CommitFailed
  • The new CommitFailed error variant is defined but never constructed - worker exits without returning this error to pending requests

This prevents the "poisoned lock" issue described in #405, though the recovery behavior is server shutdown rather than per-request error handling.

Confidence Score: 4/5

  • Safe to merge - significantly improves server resilience by preventing lock poisoning
  • The core fix (replacing panic with graceful shutdown) solves the immediate problem of server becoming completely unusable. One logic concern: the CommitFailed error variant is defined but never actually returned by the worker, and requests receive optimistic responses before commit failures. However, this trade-off may be acceptable since the alternative (panicking) is strictly worse.
  • Pay attention to aw-datastore/src/lib.rs - the CommitFailed error variant is currently dead code

Important Files Changed

Filename Overview
aw-datastore/src/worker.rs Replaced panic with graceful shutdown on commit failures, made legacy import best-effort
aw-datastore/src/lib.rs Added CommitFailed error variant (currently unused - not returned by worker)
aw-server/src/endpoints/util.rs Maps CommitFailed to HTTP 503 with helpful error message

Sequence Diagram

sequenceDiagram
    participant Client as HTTP Client
    participant Server as Rocket Server
    participant Worker as DatastoreWorker Thread
    participant DB as SQLite Database

    Note over Worker,DB: Normal Operation
    Client->>Server: API Request (insert event)
    Server->>Worker: Send Command via channel
    Worker->>DB: Execute in transaction
    DB-->>Worker: Success
    Worker->>Worker: Mark uncommitted_events++
    Worker-->>Server: Response::EventList
    Server-->>Client: HTTP 200 OK
    
    Note over Worker,DB: Periodic Commit (every 15s or 100 events)
    Worker->>DB: tx.commit()
    DB-->>Worker: Ok(())
    Worker->>Worker: Continue processing

    Note over Worker,DB: Commit Failure (NEW: Graceful Shutdown)
    Client->>Server: API Request
    Server->>Worker: Send Command
    Worker->>DB: Execute in transaction
    DB-->>Worker: Success
    Worker-->>Server: Response (optimistic)
    Server-->>Client: HTTP 200 OK
    Worker->>DB: tx.commit()
    DB--xWorker: Err (disk full)
    Worker->>Worker: error!() + quit=true
    Worker->>Worker: Exit work_loop gracefully
    
    Note over Server,Worker: Worker Exited - Channel Closed
    Client->>Server: Next API Request
    Server->>Worker: Try send Command
    Worker--xServer: Channel closed (MpscError)
    Server-->>Client: HTTP 500 InternalServerError
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. aw-datastore/src/worker.rs, line 197-209 (link)

    logic: Worker exits gracefully on commit failure but never returns DatastoreError::CommitFailed to pending requests

    When tx.commit() fails, the worker sets self.quit = true and exits the loop. However, the requests that were processed in this transaction never receive error responses - they already got Ok responses on line 181. Subsequent requests will fail with MpscError when they try to send to the closed channel, not CommitFailed.

    Consider whether pending requests should be notified of the commit failure, or if the current behavior (optimistic response, server becomes unavailable on next request) is acceptable for your use case.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@tobixen
Copy link
Author

tobixen commented Jan 1, 2026

Disclaimer: this PR was AI-generated. I ("I" as in "not the AI, but the real human stupidity agent operating under the nickname tobixen since the Soviet and USA was unsuccessfully competing to be the first to invent real AI in the 80s") don't know rust, and I haven't (yet) tested manually if this fixes the disk-full problem. The changes seems good to me, but I should still verify manually what happens when the disk goes full. I also notice that the patch is without test code, I'll try to see if Claude can help with that, too.

Tests all DatastoreError variants including the new CommitFailed
variant which maps to HTTP 503 Service Unavailable.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@tobixen
Copy link
Author

tobixen commented Jan 1, 2026

A very simple unit test has been added. Claude suggested two more complicated approaches, one is a mroe complete test depending on mocking up thing. My personal general experience is that "mocking up things" makes complicated and fragile tests without delivering much value. The third approach would be to actually fill the disk - this would certainly need to be handled in some sandbox environment, I think it's outside the scope of this pull request.

@tobixen
Copy link
Author

tobixen commented Jan 1, 2026

I don't know anything about rust, but it appears to me that a make a general fix would be to make it so that panic!(...) would cause an immediate graceful exit rather than putting the server in an unusable state. I'll ask claude to make a separate pull request for that.

Comment on lines +131 to +148
// Ensure legacy import (best-effort, don't fail if it doesn't work)
if self.legacy_import {
let transaction = match conn.transaction_with_behavior(TransactionBehavior::Immediate) {
Ok(transaction) => transaction,
match conn.transaction_with_behavior(TransactionBehavior::Immediate) {
Ok(transaction) => {
match ds.ensure_legacy_import(&transaction) {
Ok(_) => (),
Err(err) => error!("Failed to do legacy import: {:?}", err),
}
match transaction.commit() {
Ok(_) => (),
Err(err) => {
error!("Failed to commit legacy import transaction: {err}");
}
}
}
Err(err) => {
panic!("Unable to start immediate transaction on SQLite database! {err}")
error!("Unable to start transaction for legacy import: {err}");
info!("Skipping legacy import due to transaction error");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should result in a panic, too important of a functionality to fail silently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read the code above correctly (which was churned out by the AI), it downgrades a panic to an error without quitting on legacy import. Can the legacy import be done later, or is it paramount that it's done before importing new data? If the legacy import can be done later, then in my opinion error would be more appropriate than panic - though, it's not up to me to decide the details or the direction of this project, so I'll happily upgrade it to a panic if that's considered more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might even corrupt the database if it's not done before, since it's appending data to a database with a different schema. Or it will just fail to save data to the db. Not sure really, haven't investigated in detail, but I don't think you can skip this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It's moot now anyway, as I closed this pull request - it's sort of superseded by #553

@tobixen
Copy link
Author

tobixen commented Jan 2, 2026

The whole point with this pull request was that panics were unsafe, leaving the server in a vegetable state. This has been fixed in #553, so this entire pull request is kind of superseded. It does add some improved logging, but at least the logic with error + quit rather than panic should be reverted.

@tobixen tobixen marked this pull request as draft January 2, 2026 08:37
@tobixen
Copy link
Author

tobixen commented Jan 2, 2026

Closing in favor of #553, which provides a more general solution.

PR #553 uses catch_unwind() to handle any panic gracefully, not just the specific commit failure case. This is a cleaner approach because:

  1. It catches all panics, not just the ones we explicitly converted to error handling
  2. It doesn't introduce dead code (the CommitFailed error variant defined here was never actually returned)
  3. It's defense-in-depth - even if we miss converting some panic! to proper error handling, the server won't end up in an unusable state

The improved error logging from this PR could potentially be added separately, but it's lower priority now that the core issue (poisoned locks) is addressed by #553.

🤖 Generated with Claude Code

@tobixen tobixen closed this Jan 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants