-
-
Notifications
You must be signed in to change notification settings - Fork 73
fix: handle commit failures gracefully instead of panicking #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: handle commit failures gracefully instead of panicking #552
Conversation
When database commits fail (e.g., due to disk full or I/O errors), the worker thread would panic, poisoning the Mutex lock and making the server completely unusable until restart. This change: - Replaces panic! with error logging and graceful worker shutdown - Adds CommitFailed error variant to DatastoreError - Maps CommitFailed to HTTP 503 Service Unavailable - Makes legacy import transaction failures non-fatal The worker now exits cleanly when commits fail, allowing the channel to close properly. Future requests receive a clean error instead of encountering a poisoned lock. Fixes behavior reported in issue ActivityWatch#405 (though root cause there was different, the poisoned lock symptom is the same). 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important
Looks good to me! 👍
Reviewed everything up to 44656f1 in 47 seconds. Click for details.
- Reviewed
87lines of code in3files - Skipped
0files when reviewing. - Skipped posting
4draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. aw-datastore/src/lib.rs:42
- Draft comment:
Added the CommitFailed error variant. This is a good step to differentiate commit failures from other errors. Ensure that later code (e.g. in worker.rs) consistently uses or maps this variant where appropriate. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. aw-datastore/src/worker.rs:132
- Draft comment:
Legacy import handling now logs errors for transaction start/commit failures instead of panicking. This prevents lock poisoning; just ensure that skipping legacy import is acceptable if it fails. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
3. aw-datastore/src/worker.rs:197
- Draft comment:
Instead of panicking on commit failure of the main transaction, the worker now logs the error and marks itself to quit gracefully. This avoids mutex poisoning. Confirm that gracefully shutting down the worker is the desired behavior for client requests. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
4. aw-server/src/endpoints/util.rs:101
- Draft comment:
The new mapping from DatastoreError::CommitFailed to an HTTP 503 (Service Unavailable) is correctly implemented. This ensures clients receive a meaningful error response when commit failures occur. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
Workflow ID: wflow_nd6N2qKAC4qJIAcY
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
Greptile SummaryReplaced panic-on-commit-failure with graceful worker shutdown to prevent mutex lock poisoning. When database commits fail (disk full, I/O errors), the worker thread now logs the error and exits cleanly instead of panicking. Key changes:
Behavior notes:
This prevents the "poisoned lock" issue described in #405, though the recovery behavior is server shutdown rather than per-request error handling. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client as HTTP Client
participant Server as Rocket Server
participant Worker as DatastoreWorker Thread
participant DB as SQLite Database
Note over Worker,DB: Normal Operation
Client->>Server: API Request (insert event)
Server->>Worker: Send Command via channel
Worker->>DB: Execute in transaction
DB-->>Worker: Success
Worker->>Worker: Mark uncommitted_events++
Worker-->>Server: Response::EventList
Server-->>Client: HTTP 200 OK
Note over Worker,DB: Periodic Commit (every 15s or 100 events)
Worker->>DB: tx.commit()
DB-->>Worker: Ok(())
Worker->>Worker: Continue processing
Note over Worker,DB: Commit Failure (NEW: Graceful Shutdown)
Client->>Server: API Request
Server->>Worker: Send Command
Worker->>DB: Execute in transaction
DB-->>Worker: Success
Worker-->>Server: Response (optimistic)
Server-->>Client: HTTP 200 OK
Worker->>DB: tx.commit()
DB--xWorker: Err (disk full)
Worker->>Worker: error!() + quit=true
Worker->>Worker: Exit work_loop gracefully
Note over Server,Worker: Worker Exited - Channel Closed
Client->>Server: Next API Request
Server->>Worker: Try send Command
Worker--xServer: Channel closed (MpscError)
Server-->>Client: HTTP 500 InternalServerError
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
aw-datastore/src/worker.rs, line 197-209 (link)logic: Worker exits gracefully on commit failure but never returns
DatastoreError::CommitFailedto pending requestsWhen
tx.commit()fails, the worker setsself.quit = trueand exits the loop. However, the requests that were processed in this transaction never receive error responses - they already gotOkresponses on line 181. Subsequent requests will fail withMpscErrorwhen they try to send to the closed channel, notCommitFailed.Consider whether pending requests should be notified of the commit failure, or if the current behavior (optimistic response, server becomes unavailable on next request) is acceptable for your use case.
3 files reviewed, 1 comment
|
Disclaimer: this PR was AI-generated. I ("I" as in "not the AI, but the real human stupidity agent operating under the nickname tobixen since the Soviet and USA was unsuccessfully competing to be the first to invent real AI in the 80s") don't know rust, and I haven't (yet) tested manually if this fixes the disk-full problem. The changes seems good to me, but I should still verify manually what happens when the disk goes full. I also notice that the patch is without test code, I'll try to see if Claude can help with that, too. |
Tests all DatastoreError variants including the new CommitFailed variant which maps to HTTP 503 Service Unavailable. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
A very simple unit test has been added. Claude suggested two more complicated approaches, one is a mroe complete test depending on mocking up thing. My personal general experience is that "mocking up things" makes complicated and fragile tests without delivering much value. The third approach would be to actually fill the disk - this would certainly need to be handled in some sandbox environment, I think it's outside the scope of this pull request. |
|
I don't know anything about rust, but it appears to me that a make a general fix would be to make it so that panic!(...) would cause an immediate graceful exit rather than putting the server in an unusable state. I'll ask claude to make a separate pull request for that. |
| // Ensure legacy import (best-effort, don't fail if it doesn't work) | ||
| if self.legacy_import { | ||
| let transaction = match conn.transaction_with_behavior(TransactionBehavior::Immediate) { | ||
| Ok(transaction) => transaction, | ||
| match conn.transaction_with_behavior(TransactionBehavior::Immediate) { | ||
| Ok(transaction) => { | ||
| match ds.ensure_legacy_import(&transaction) { | ||
| Ok(_) => (), | ||
| Err(err) => error!("Failed to do legacy import: {:?}", err), | ||
| } | ||
| match transaction.commit() { | ||
| Ok(_) => (), | ||
| Err(err) => { | ||
| error!("Failed to commit legacy import transaction: {err}"); | ||
| } | ||
| } | ||
| } | ||
| Err(err) => { | ||
| panic!("Unable to start immediate transaction on SQLite database! {err}") | ||
| error!("Unable to start transaction for legacy import: {err}"); | ||
| info!("Skipping legacy import due to transaction error"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should result in a panic, too important of a functionality to fail silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read the code above correctly (which was churned out by the AI), it downgrades a panic to an error without quitting on legacy import. Can the legacy import be done later, or is it paramount that it's done before importing new data? If the legacy import can be done later, then in my opinion error would be more appropriate than panic - though, it's not up to me to decide the details or the direction of this project, so I'll happily upgrade it to a panic if that's considered more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might even corrupt the database if it's not done before, since it's appending data to a database with a different schema. Or it will just fail to save data to the db. Not sure really, haven't investigated in detail, but I don't think you can skip this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It's moot now anyway, as I closed this pull request - it's sort of superseded by #553
|
The whole point with this pull request was that panics were unsafe, leaving the server in a vegetable state. This has been fixed in #553, so this entire pull request is kind of superseded. It does add some improved logging, but at least the logic with error + quit rather than panic should be reverted. |
|
Closing in favor of #553, which provides a more general solution. PR #553 uses
The improved error logging from this PR could potentially be added separately, but it's lower priority now that the core issue (poisoned locks) is addressed by #553. 🤖 Generated with Claude Code |
Summary
When database commits fail (e.g., due to disk full or I/O errors), the worker thread would panic, poisoning the Mutex lock and making the server completely unusable until restart. Users would see endless "poisoned lock: another task failed inside" errors.
This change:
panic!with error logging and graceful worker shutdownCommitFailederror variant toDatastoreErrorCommitFailedto HTTP 503 Service UnavailableThe worker now exits cleanly when commits fail, allowing the channel to close properly. Future requests receive a clean "channel closed" error instead of encountering a poisoned lock.
Context
This was discovered while investigating issue #405. While the root cause reported there was different (the
period_unionbug, which was fixed in commit 1f4f07d), the symptom of a poisoned lock making the server unusable is shared. In my case, the actual panic was:Any panic while holding the datastore lock will poison it, so this fix addresses one category of such panics.
Test plan
cargo buildcompiles successfullycargo test --package aw-datastorepasses (9 tests)🤖 Generated with Claude Code
Important
Handle database commit failures gracefully by logging errors and shutting down the worker thread instead of panicking, preventing server lock poisoning.
panic!with error logging and graceful shutdown inwork_loop()inworker.rs.CommitFailederror variant toDatastoreErrorinlib.rs.CommitFailedto HTTP 503 inendpoints/util.rs.worker.rs.self.quit = trueon commit failure inwork_loop()inworker.rs.CommitFailedinHttpErrorJsonconversion inendpoints/util.rs.This description was created by
for 44656f1. You can customize this summary. It will automatically update as commits are pushed.