fix: handle commit failures gracefully instead of panicking #552

tobixen · 2026-01-01T23:06:04Z

Summary

When database commits fail (e.g., due to disk full or I/O errors), the worker thread would panic, poisoning the Mutex lock and making the server completely unusable until restart. Users would see endless "poisoned lock: another task failed inside" errors.

This change:

Replaces panic! with error logging and graceful worker shutdown
Adds CommitFailed error variant to DatastoreError
Maps CommitFailed to HTTP 503 Service Unavailable
Makes legacy import transaction failures non-fatal (best-effort)

The worker now exits cleanly when commits fail, allowing the channel to close properly. Future requests receive a clean "channel closed" error instead of encountering a poisoned lock.

Context

This was discovered while investigating issue #405. While the root cause reported there was different (the period_union bug, which was fixed in commit 1f4f07d), the symptom of a poisoned lock making the server unusable is shared. In my case, the actual panic was:

panicked at 'Failed to commit datastore transaction! database or disk is full': aw-datastore/src/worker.rs:195

Any panic while holding the datastore lock will poison it, so this fix addresses one category of such panics.

Test plan

cargo build compiles successfully
cargo test --package aw-datastore passes (9 tests)
Manual testing: fill disk, verify server logs error but doesn't poison lock

🤖 Generated with Claude Code

Important

Handle database commit failures gracefully by logging errors and shutting down the worker thread instead of panicking, preventing server lock poisoning.

Behavior:
- Replace panic! with error logging and graceful shutdown in work_loop() in worker.rs.
- Add CommitFailed error variant to DatastoreError in lib.rs.
- Map CommitFailed to HTTP 503 in endpoints/util.rs.
- Make legacy import transaction failures non-fatal in worker.rs.
Error Handling:
- Log errors and set self.quit = true on commit failure in work_loop() in worker.rs.
- Handle CommitFailed in HttpErrorJson conversion in endpoints/util.rs.

^{This description was created by}^{for 44656f1. You can customize this summary. It will automatically update as commits are pushed.}

When database commits fail (e.g., due to disk full or I/O errors), the worker thread would panic, poisoning the Mutex lock and making the server completely unusable until restart. This change: - Replaces panic! with error logging and graceful worker shutdown - Adds CommitFailed error variant to DatastoreError - Maps CommitFailed to HTTP 503 Service Unavailable - Makes legacy import transaction failures non-fatal The worker now exits cleanly when commits fail, allowing the channel to close properly. Future requests receive a clean error instead of encountering a poisoned lock. Fixes behavior reported in issue ActivityWatch#405 (though root cause there was different, the poisoned lock symptom is the same). 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

ellipsis-dev

Important

Looks good to me! 👍

Reviewed everything up to 44656f1 in 47 seconds. Click for details.

Reviewed 87 lines of code in 3 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. aw-datastore/src/lib.rs:42

Draft comment:
Added the CommitFailed error variant. This is a good step to differentiate commit failures from other errors. Ensure that later code (e.g. in worker.rs) consistently uses or maps this variant where appropriate.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. aw-datastore/src/worker.rs:132

Draft comment:
Legacy import handling now logs errors for transaction start/commit failures instead of panicking. This prevents lock poisoning; just ensure that skipping legacy import is acceptable if it fails.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

3. aw-datastore/src/worker.rs:197

Draft comment:
Instead of panicking on commit failure of the main transaction, the worker now logs the error and marks itself to quit gracefully. This avoids mutex poisoning. Confirm that gracefully shutting down the worker is the desired behavior for client requests.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

4. aw-server/src/endpoints/util.rs:101

Draft comment:
The new mapping from DatastoreError::CommitFailed to an HTTP 503 (Service Unavailable) is correctly implemented. This ensures clients receive a meaningful error response when commit failures occur.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_nd6N2qKAC4qJIAcY

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

greptile-apps · 2026-01-01T23:08:50Z

Greptile Summary

Replaced panic-on-commit-failure with graceful worker shutdown to prevent mutex lock poisoning. When database commits fail (disk full, I/O errors), the worker thread now logs the error and exits cleanly instead of panicking.

Key changes:

Legacy import made best-effort (no longer fatal if transaction fails)
Main commit failure on line 197-209 sets quit=true and exits gracefully
Added CommitFailed error variant (though currently unused by worker)
HTTP 503 mapping for CommitFailed in endpoint util

Behavior notes:

Requests processed before a failed commit receive optimistic Ok responses, but their data may not be persisted
Subsequent requests fail with MpscError (channel closed) rather than CommitFailed
The new CommitFailed error variant is defined but never constructed - worker exits without returning this error to pending requests

This prevents the "poisoned lock" issue described in #405, though the recovery behavior is server shutdown rather than per-request error handling.

Confidence Score: 4/5

Safe to merge - significantly improves server resilience by preventing lock poisoning
The core fix (replacing panic with graceful shutdown) solves the immediate problem of server becoming completely unusable. One logic concern: the CommitFailed error variant is defined but never actually returned by the worker, and requests receive optimistic responses before commit failures. However, this trade-off may be acceptable since the alternative (panicking) is strictly worse.
Pay attention to aw-datastore/src/lib.rs - the CommitFailed error variant is currently dead code

Important Files Changed

Filename	Overview
aw-datastore/src/worker.rs	Replaced panic with graceful shutdown on commit failures, made legacy import best-effort
aw-datastore/src/lib.rs	Added `CommitFailed` error variant (currently unused - not returned by worker)
aw-server/src/endpoints/util.rs	Maps `CommitFailed` to HTTP 503 with helpful error message

Sequence Diagram

sequenceDiagram
    participant Client as HTTP Client
    participant Server as Rocket Server
    participant Worker as DatastoreWorker Thread
    participant DB as SQLite Database

    Note over Worker,DB: Normal Operation
    Client->>Server: API Request (insert event)
    Server->>Worker: Send Command via channel
    Worker->>DB: Execute in transaction
    DB-->>Worker: Success
    Worker->>Worker: Mark uncommitted_events++
    Worker-->>Server: Response::EventList
    Server-->>Client: HTTP 200 OK
    
    Note over Worker,DB: Periodic Commit (every 15s or 100 events)
    Worker->>DB: tx.commit()
    DB-->>Worker: Ok(())
    Worker->>Worker: Continue processing

    Note over Worker,DB: Commit Failure (NEW: Graceful Shutdown)
    Client->>Server: API Request
    Server->>Worker: Send Command
    Worker->>DB: Execute in transaction
    DB-->>Worker: Success
    Worker-->>Server: Response (optimistic)
    Server-->>Client: HTTP 200 OK
    Worker->>DB: tx.commit()
    DB--xWorker: Err (disk full)
    Worker->>Worker: error!() + quit=true
    Worker->>Worker: Exit work_loop gracefully
    
    Note over Server,Worker: Worker Exited - Channel Closed
    Client->>Server: Next API Request
    Server->>Worker: Try send Command
    Worker--xServer: Channel closed (MpscError)
    Server-->>Client: HTTP 500 InternalServerError

greptile-apps

Additional Comments (1)

aw-datastore/src/worker.rs, line 197-209 (link)

logic: Worker exits gracefully on commit failure but never returns DatastoreError::CommitFailed to pending requests

When tx.commit() fails, the worker sets self.quit = true and exits the loop. However, the requests that were processed in this transaction never receive error responses - they already got Ok responses on line 181. Subsequent requests will fail with MpscError when they try to send to the closed channel, not CommitFailed.

Consider whether pending requests should be notified of the commit failure, or if the current behavior (optimistic response, server becomes unavailable on next request) is acceptable for your use case.

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

tobixen · 2026-01-01T23:11:31Z

Disclaimer: this PR was AI-generated. I ("I" as in "not the AI, but the real human stupidity agent operating under the nickname tobixen since the Soviet and USA was unsuccessfully competing to be the first to invent real AI in the 80s") don't know rust, and I haven't (yet) tested manually if this fixes the disk-full problem. The changes seems good to me, but I should still verify manually what happens when the disk goes full. I also notice that the patch is without test code, I'll try to see if Claude can help with that, too.

Tests all DatastoreError variants including the new CommitFailed variant which maps to HTTP 503 Service Unavailable. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

tobixen · 2026-01-01T23:35:38Z

A very simple unit test has been added. Claude suggested two more complicated approaches, one is a mroe complete test depending on mocking up thing. My personal general experience is that "mocking up things" makes complicated and fragile tests without delivering much value. The third approach would be to actually fill the disk - this would certainly need to be handled in some sandbox environment, I think it's outside the scope of this pull request.

tobixen · 2026-01-01T23:38:07Z

I don't know anything about rust, but it appears to me that a make a general fix would be to make it so that panic!(...) would cause an immediate graceful exit rather than putting the server in an unusable state. I'll ask claude to make a separate pull request for that.

0xbrayo · 2026-01-02T07:55:53Z

aw-datastore/src/worker.rs

+        // Ensure legacy import (best-effort, don't fail if it doesn't work)
        if self.legacy_import {
-            let transaction = match conn.transaction_with_behavior(TransactionBehavior::Immediate) {
-                Ok(transaction) => transaction,
+            match conn.transaction_with_behavior(TransactionBehavior::Immediate) {
+                Ok(transaction) => {
+                    match ds.ensure_legacy_import(&transaction) {
+                        Ok(_) => (),
+                        Err(err) => error!("Failed to do legacy import: {:?}", err),
+                    }
+                    match transaction.commit() {
+                        Ok(_) => (),
+                        Err(err) => {
+                            error!("Failed to commit legacy import transaction: {err}");
+                        }
+                    }
+                }
                Err(err) => {
-                    panic!("Unable to start immediate transaction on SQLite database! {err}")
+                    error!("Unable to start transaction for legacy import: {err}");
+                    info!("Skipping legacy import due to transaction error");


I think this should result in a panic, too important of a functionality to fail silently.

If I read the code above correctly (which was churned out by the AI), it downgrades a panic to an error without quitting on legacy import. Can the legacy import be done later, or is it paramount that it's done before importing new data? If the legacy import can be done later, then in my opinion error would be more appropriate than panic - though, it's not up to me to decide the details or the direction of this project, so I'll happily upgrade it to a panic if that's considered more appropriate.

It might even corrupt the database if it's not done before, since it's appending data to a database with a different schema. Or it will just fail to save data to the db. Not sure really, haven't investigated in detail, but I don't think you can skip this.

Ok. It's moot now anyway, as I closed this pull request - it's sort of superseded by #553

tobixen · 2026-01-02T08:35:59Z

The whole point with this pull request was that panics were unsafe, leaving the server in a vegetable state. This has been fixed in #553, so this entire pull request is kind of superseded. It does add some improved logging, but at least the logic with error + quit rather than panic should be reverted.

tobixen · 2026-01-02T09:28:34Z

Closing in favor of #553, which provides a more general solution.

PR #553 uses catch_unwind() to handle any panic gracefully, not just the specific commit failure case. This is a cleaner approach because:

It catches all panics, not just the ones we explicitly converted to error handling
It doesn't introduce dead code (the CommitFailed error variant defined here was never actually returned)
It's defense-in-depth - even if we miss converting some panic! to proper error handling, the server won't end up in an unusable state

The improved error logging from this PR could potentially be added separately, but it's lower priority now that the core issue (poisoned locks) is addressed by #553.

🤖 Generated with Claude Code

ellipsis-dev bot reviewed Jan 1, 2026

View reviewed changes

greptile-apps bot reviewed Jan 1, 2026

View reviewed changes

0xbrayo reviewed Jan 2, 2026

View reviewed changes

tobixen marked this pull request as draft January 2, 2026 08:37

tobixen closed this Jan 2, 2026

Uh oh!

fix: handle commit failures gracefully instead of panicking #552

fix: handle commit failures gracefully instead of panicking #552

Uh oh!

Conversation

tobixen commented Jan 1, 2026 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Test plan

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 1, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

tobixen commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tobixen commented Jan 1, 2026

Uh oh!

tobixen commented Jan 1, 2026

Uh oh!

0xbrayo Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

tobixen Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

0xbrayo Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

tobixen Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

tobixen commented Jan 2, 2026

Uh oh!

tobixen commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tobixen commented Jan 1, 2026 •

edited by ellipsis-dev bot

Loading

greptile-apps bot left a comment •

edited

Loading

tobixen commented Jan 1, 2026 •

edited

Loading