Skip to content

[BUG] Resolve deadlock in system crate? #5283

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 15, 2025
Merged

[BUG] Resolve deadlock in system crate? #5283

merged 4 commits into from
Aug 15, 2025

Conversation

rescrv
Copy link
Contributor

@rescrv rescrv commented Aug 15, 2025

Description of changes

Summary from Claude, guided by me:

I've identified a critical deadlock pattern in the dispatcher-worker thread communication system. This is a classic bounded-buffer deadlock (not livelock or starvation).

The Deadlock Pattern:

1. WorkerThread (rust/system/src/execution/worker_thread.rs:58-61):
  - After processing a task, sends TaskRequestMessage to dispatcher
  - This send operation blocks if dispatcher's channel is full
2. Dispatcher (rust/system/src/execution/dispatcher.rs:199-204):
  - When receiving tasks, if no workers are waiting, tries to send task to a worker
  - This send operation blocks if worker's channel is full

Precise Computer Science Classification:

This is a circular wait deadlock with the following characteristics:

- Resource type: Bounded channel buffer space
- Deadlock condition: All four Coffman conditions are met:
  a. Mutual exclusion: Channel slots are exclusively owned
  b. Hold and wait: Worker holds its channel while waiting on dispatcher's channel
  c. No preemption: Messages cannot be forcibly removed from channels
  d. Circular wait: Worker→Dispatcher→Worker circular dependency

Specific Deadlock Scenario:

1. Dispatcher's channel reaches capacity (dispatcher_queue_size limit)
2. Worker completes task and tries to send TaskRequestMessage at line worker_thread.rs:61
3. Worker blocks because dispatcher's channel is full
4. Dispatcher tries to send new task to worker at line dispatcher.rs:199
5. Dispatcher blocks because worker's channel is full (worker_queue_size limit)
6. DEADLOCK: Both components are blocked waiting for each other

Critical Code Locations:

- Worker blocking point: rust/system/src/execution/worker_thread.rs:61
- Dispatcher blocking point: rust/system/src/execution/dispatcher.rs:199
- Channel creation: rust/system/src/system.rs:39 (bounded channel with queue_size())
- Queue limits: Configured via DispatcherConfig with dispatcher_queue_size and worker_queue_size

This is not a livelock (no active spinning) or starvation (not a fairness issue), but a true deadlock where progress is impossible once both channels are full and each component is trying to send to the other.

Fix is to make it so that sending errors and breaks the deadlock. This
will fail the task. If this works on staging we'll test it, make it
robust, etc.

Test plan

CI

Migration plan

N/A

Observability plan

Watch staging not deadlock.

Documentation Changes

N/A

Summary from Claude, guided by me:

```claude
I've identified a critical deadlock pattern in the dispatcher-worker thread communication system. This is a classic bounded-buffer deadlock (not livelock or starvation).

The Deadlock Pattern:

1. WorkerThread (rust/system/src/execution/worker_thread.rs:58-61):
  - After processing a task, sends TaskRequestMessage to dispatcher
  - This send operation blocks if dispatcher's channel is full
2. Dispatcher (rust/system/src/execution/dispatcher.rs:199-204):
  - When receiving tasks, if no workers are waiting, tries to send task to a worker
  - This send operation blocks if worker's channel is full

Precise Computer Science Classification:

This is a circular wait deadlock with the following characteristics:

- Resource type: Bounded channel buffer space
- Deadlock condition: All four Coffman conditions are met:
  a. Mutual exclusion: Channel slots are exclusively owned
  b. Hold and wait: Worker holds its channel while waiting on dispatcher's channel
  c. No preemption: Messages cannot be forcibly removed from channels
  d. Circular wait: Worker→Dispatcher→Worker circular dependency

Specific Deadlock Scenario:

1. Dispatcher's channel reaches capacity (dispatcher_queue_size limit)
2. Worker completes task and tries to send TaskRequestMessage at line worker_thread.rs:61
3. Worker blocks because dispatcher's channel is full
4. Dispatcher tries to send new task to worker at line dispatcher.rs:199
5. Dispatcher blocks because worker's channel is full (worker_queue_size limit)
6. DEADLOCK: Both components are blocked waiting for each other

Critical Code Locations:

- Worker blocking point: rust/system/src/execution/worker_thread.rs:61
- Dispatcher blocking point: rust/system/src/execution/dispatcher.rs:199
- Channel creation: rust/system/src/system.rs:39 (bounded channel with queue_size())
- Queue limits: Configured via DispatcherConfig with dispatcher_queue_size and worker_queue_size

This is not a livelock (no active spinning) or starvation (not a fairness issue), but a true deadlock where progress is impossible once both channels are full and each component is trying to send to the other.
```

Fix is to make it so that sending errors and breaks the deadlock.  This
will fail the task.  If this works on staging we'll test it, make it
robust, etc.
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@rescrv rescrv requested a review from codetheweb August 15, 2025 18:14
Copy link
Contributor

propel-code-bot bot commented Aug 15, 2025

Fix Bounded-Buffer Deadlock in Dispatcher-Worker Communication

This PR addresses a critical deadlock in the Rust system crate's dispatcher-worker architecture, specifically a bounded-buffer deadlock where both dispatcher and worker channels could become full, causing a circular wait between components. The main fix replaces blocking channel send operations with non-blocking (fail-fast) behavior, causing tasks to fail if channel buffers are full instead of waiting indefinitely. Additional tweaks in companion modules and tests reflect these changes, and include minor test logic adjustments to accommodate the new failure behavior.

Key Changes

• Replaced blocking .send().await with non-blocking .try_send() for message passing in rust/system/src/types.rs to prevent deadlocks when channel buffers are full.
• Adjusted field visibility of TaskResult in rust/system/src/execution/operator.rs to pub(crate) for use in error-handling code.
• Small documentation and logic tweaks in test orchestrator logic (rust/system/src/execution/orchestrator.rs) to correctly check results and handle test operator configuration.
• Refined scheduler comment in rust/system/src/scheduler.rs to clarify locking semantics, but no logic change.
• Unit tests updated to reflect new non-blocking send/fail-fast behavior.

Affected Areas

• Component channel delivery logic (types.rs)
• Task result data structure (operator.rs)
• Executor and orchestrator test logic (orchestrator.rs)
• Scheduler internal documentation (scheduler.rs)

This summary was automatically generated by @propel-code-bot

Comment on lines 178 to +179
self.sender
.send(WrappedMessage::new(message, None, tracing_context))
.await
.try_send(WrappedMessage::new(message, None, tracing_context))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

The change from send() to try_send() effectively breaks the deadlock by making the dispatcher fail fast instead of blocking when worker channels are full. However, this approach has some concerns:

  1. Silent task failure: Failed sends will only return ChannelError::SendError without specific context about which task failed or why
  2. No retry mechanism: Tasks that fail to send will be permanently lost
  3. Resource waste: The spawned tokio task and associated resources are consumed even when the message cannot be delivered

Consider adding logging when try_send fails to help with observability:

Suggested change
self.sender
.send(WrappedMessage::new(message, None, tracing_context))
.await
.try_send(WrappedMessage::new(message, None, tracing_context))
self.sender
.try_send(WrappedMessage::new(message, None, tracing_context))
.map_err(|e| {
tracing::warn!("Failed to send scheduled message: {:?}", e);
ChannelError::SendError
})

This aligns with your stated plan to "watch staging not deadlock" by providing visibility into when the deadlock prevention mechanism activates.

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Comment on lines +79 to +80
pub(crate) result: Result<Output, TaskError<Error>>,
pub(crate) task_id: Uuid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Adding pub(crate) visibility to the TaskResult fields enables the deadlock fix by allowing the dispatcher to access task results directly. However, this breaks encapsulation. Consider:

  1. Adding a constructor method that takes the result components
  2. Adding getter methods for controlled access to these fields
  3. Documenting why these fields need crate-level visibility

Example:

impl<Output, Error> TaskResult<Output, Error> {
    pub(crate) fn new(result: Result<Output, TaskError<Error>>, task_id: Uuid) -> Self {
        Self { result, task_id }
    }
    
    pub(crate) fn result(&self) -> &Result<Output, TaskError<Error>> {
        &self.result
    }
}

@@ -294,8 +294,7 @@ mod tests {
message: TaskResult<(), TestError>,
_ctx: &ComponentContext<Self>,
) -> Self::Result {
// We expect these to be cancelled, so we ignore the results
let _ = message;
message.result.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Calling unwrap() here will cause a panic when a task is cancelled, as message.result will be Err(TaskError::Aborted). The test_operator_cancellation test is designed to test cancellation by using a timeout with a long-running SleepingOperator, so we expect tasks to be cancelled.

This panic within the handler will be caught, causing orchestrator.run() to complete with an error. Consequently, the timeout in the test will return an Ok(Err(...)) instead of timing out and returning Err(_). This will trigger the panic!("Orchestrator should have timed out") assertion, causing the test to fail.

The previous implementation correctly ignored the result, which is appropriate for tasks that are expected to be cancelled. Was this change intentional?

If you want to assert that the task was indeed aborted, you could do so explicitly. However, simply ignoring it seems sufficient. I'd recommend reverting to the previous implementation to fix the test.

Suggested change
message.result.unwrap();
// We expect these to be cancelled, so we ignore the results
let _ = message;

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

@blacksmith-sh blacksmith-sh bot deleted a comment from rescrv Aug 15, 2025
@rescrv rescrv merged commit be5873d into main Aug 15, 2025
115 of 117 checks passed
rescrv added a commit that referenced this pull request Aug 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants