-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[BUG] Resolve deadlock in system crate? #5283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary from Claude, guided by me: ```claude I've identified a critical deadlock pattern in the dispatcher-worker thread communication system. This is a classic bounded-buffer deadlock (not livelock or starvation). The Deadlock Pattern: 1. WorkerThread (rust/system/src/execution/worker_thread.rs:58-61): - After processing a task, sends TaskRequestMessage to dispatcher - This send operation blocks if dispatcher's channel is full 2. Dispatcher (rust/system/src/execution/dispatcher.rs:199-204): - When receiving tasks, if no workers are waiting, tries to send task to a worker - This send operation blocks if worker's channel is full Precise Computer Science Classification: This is a circular wait deadlock with the following characteristics: - Resource type: Bounded channel buffer space - Deadlock condition: All four Coffman conditions are met: a. Mutual exclusion: Channel slots are exclusively owned b. Hold and wait: Worker holds its channel while waiting on dispatcher's channel c. No preemption: Messages cannot be forcibly removed from channels d. Circular wait: Worker→Dispatcher→Worker circular dependency Specific Deadlock Scenario: 1. Dispatcher's channel reaches capacity (dispatcher_queue_size limit) 2. Worker completes task and tries to send TaskRequestMessage at line worker_thread.rs:61 3. Worker blocks because dispatcher's channel is full 4. Dispatcher tries to send new task to worker at line dispatcher.rs:199 5. Dispatcher blocks because worker's channel is full (worker_queue_size limit) 6. DEADLOCK: Both components are blocked waiting for each other Critical Code Locations: - Worker blocking point: rust/system/src/execution/worker_thread.rs:61 - Dispatcher blocking point: rust/system/src/execution/dispatcher.rs:199 - Channel creation: rust/system/src/system.rs:39 (bounded channel with queue_size()) - Queue limits: Configured via DispatcherConfig with dispatcher_queue_size and worker_queue_size This is not a livelock (no active spinning) or starvation (not a fairness issue), but a true deadlock where progress is impossible once both channels are full and each component is trying to send to the other. ``` Fix is to make it so that sending errors and breaks the deadlock. This will fail the task. If this works on staging we'll test it, make it robust, etc.
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Fix Bounded-Buffer Deadlock in Dispatcher-Worker Communication This PR addresses a critical deadlock in the Rust system crate's dispatcher-worker architecture, specifically a bounded-buffer deadlock where both dispatcher and worker channels could become full, causing a circular wait between components. The main fix replaces blocking channel send operations with non-blocking (fail-fast) behavior, causing tasks to fail if channel buffers are full instead of waiting indefinitely. Additional tweaks in companion modules and tests reflect these changes, and include minor test logic adjustments to accommodate the new failure behavior. Key Changes• Replaced blocking Affected Areas• Component channel delivery logic (types.rs) This summary was automatically generated by @propel-code-bot |
self.sender | ||
.send(WrappedMessage::new(message, None, tracing_context)) | ||
.await | ||
.try_send(WrappedMessage::new(message, None, tracing_context)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
The change from send()
to try_send()
effectively breaks the deadlock by making the dispatcher fail fast instead of blocking when worker channels are full. However, this approach has some concerns:
- Silent task failure: Failed sends will only return
ChannelError::SendError
without specific context about which task failed or why - No retry mechanism: Tasks that fail to send will be permanently lost
- Resource waste: The spawned tokio task and associated resources are consumed even when the message cannot be delivered
Consider adding logging when try_send
fails to help with observability:
self.sender | |
.send(WrappedMessage::new(message, None, tracing_context)) | |
.await | |
.try_send(WrappedMessage::new(message, None, tracing_context)) | |
self.sender | |
.try_send(WrappedMessage::new(message, None, tracing_context)) | |
.map_err(|e| { | |
tracing::warn!("Failed to send scheduled message: {:?}", e); | |
ChannelError::SendError | |
}) |
This aligns with your stated plan to "watch staging not deadlock" by providing visibility into when the deadlock prevention mechanism activates.
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
This reverts commit 7690866.
pub(crate) result: Result<Output, TaskError<Error>>, | ||
pub(crate) task_id: Uuid, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[BestPractice]
Adding pub(crate)
visibility to the TaskResult
fields enables the deadlock fix by allowing the dispatcher to access task results directly. However, this breaks encapsulation. Consider:
- Adding a constructor method that takes the result components
- Adding getter methods for controlled access to these fields
- Documenting why these fields need crate-level visibility
Example:
impl<Output, Error> TaskResult<Output, Error> {
pub(crate) fn new(result: Result<Output, TaskError<Error>>, task_id: Uuid) -> Self {
Self { result, task_id }
}
pub(crate) fn result(&self) -> &Result<Output, TaskError<Error>> {
&self.result
}
}
@@ -294,8 +294,7 @@ mod tests { | |||
message: TaskResult<(), TestError>, | |||
_ctx: &ComponentContext<Self>, | |||
) -> Self::Result { | |||
// We expect these to be cancelled, so we ignore the results | |||
let _ = message; | |||
message.result.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[CriticalError]
Calling unwrap()
here will cause a panic when a task is cancelled, as message.result
will be Err(TaskError::Aborted)
. The test_operator_cancellation
test is designed to test cancellation by using a timeout with a long-running SleepingOperator
, so we expect tasks to be cancelled.
This panic within the handler will be caught, causing orchestrator.run()
to complete with an error. Consequently, the timeout
in the test will return an Ok(Err(...))
instead of timing out and returning Err(_)
. This will trigger the panic!("Orchestrator should have timed out")
assertion, causing the test to fail.
The previous implementation correctly ignored the result, which is appropriate for tasks that are expected to be cancelled. Was this change intentional?
If you want to assert that the task was indeed aborted, you could do so explicitly. However, simply ignoring it seems sufficient. I'd recommend reverting to the previous implementation to fix the test.
message.result.unwrap(); | |
// We expect these to be cancelled, so we ignore the results | |
let _ = message; |
⚡ Committable suggestion
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.
Description of changes
Summary from Claude, guided by me:
Fix is to make it so that sending errors and breaks the deadlock. This
will fail the task. If this works on staging we'll test it, make it
robust, etc.
Test plan
CI
Migration plan
N/A
Observability plan
Watch staging not deadlock.
Documentation Changes
N/A