[BUG] Resolve deadlock in system crate? #5283

rescrv · 2025-08-15T18:14:18Z

Description of changes

Summary from Claude, guided by me:

I've identified a critical deadlock pattern in the dispatcher-worker thread communication system. This is a classic bounded-buffer deadlock (not livelock or starvation).

The Deadlock Pattern:

1. WorkerThread (rust/system/src/execution/worker_thread.rs:58-61):
  - After processing a task, sends TaskRequestMessage to dispatcher
  - This send operation blocks if dispatcher's channel is full
2. Dispatcher (rust/system/src/execution/dispatcher.rs:199-204):
  - When receiving tasks, if no workers are waiting, tries to send task to a worker
  - This send operation blocks if worker's channel is full

Precise Computer Science Classification:

This is a circular wait deadlock with the following characteristics:

- Resource type: Bounded channel buffer space
- Deadlock condition: All four Coffman conditions are met:
  a. Mutual exclusion: Channel slots are exclusively owned
  b. Hold and wait: Worker holds its channel while waiting on dispatcher's channel
  c. No preemption: Messages cannot be forcibly removed from channels
  d. Circular wait: Worker→Dispatcher→Worker circular dependency

Specific Deadlock Scenario:

1. Dispatcher's channel reaches capacity (dispatcher_queue_size limit)
2. Worker completes task and tries to send TaskRequestMessage at line worker_thread.rs:61
3. Worker blocks because dispatcher's channel is full
4. Dispatcher tries to send new task to worker at line dispatcher.rs:199
5. Dispatcher blocks because worker's channel is full (worker_queue_size limit)
6. DEADLOCK: Both components are blocked waiting for each other

Critical Code Locations:

- Worker blocking point: rust/system/src/execution/worker_thread.rs:61
- Dispatcher blocking point: rust/system/src/execution/dispatcher.rs:199
- Channel creation: rust/system/src/system.rs:39 (bounded channel with queue_size())
- Queue limits: Configured via DispatcherConfig with dispatcher_queue_size and worker_queue_size

This is not a livelock (no active spinning) or starvation (not a fairness issue), but a true deadlock where progress is impossible once both channels are full and each component is trying to send to the other.

Fix is to make it so that sending errors and breaks the deadlock. This
will fail the task. If this works on staging we'll test it, make it
robust, etc.

Test plan

CI

Migration plan

N/A

Observability plan

Watch staging not deadlock.

Documentation Changes

N/A

Summary from Claude, guided by me: ```claude I've identified a critical deadlock pattern in the dispatcher-worker thread communication system. This is a classic bounded-buffer deadlock (not livelock or starvation). The Deadlock Pattern: 1. WorkerThread (rust/system/src/execution/worker_thread.rs:58-61): - After processing a task, sends TaskRequestMessage to dispatcher - This send operation blocks if dispatcher's channel is full 2. Dispatcher (rust/system/src/execution/dispatcher.rs:199-204): - When receiving tasks, if no workers are waiting, tries to send task to a worker - This send operation blocks if worker's channel is full Precise Computer Science Classification: This is a circular wait deadlock with the following characteristics: - Resource type: Bounded channel buffer space - Deadlock condition: All four Coffman conditions are met: a. Mutual exclusion: Channel slots are exclusively owned b. Hold and wait: Worker holds its channel while waiting on dispatcher's channel c. No preemption: Messages cannot be forcibly removed from channels d. Circular wait: Worker→Dispatcher→Worker circular dependency Specific Deadlock Scenario: 1. Dispatcher's channel reaches capacity (dispatcher_queue_size limit) 2. Worker completes task and tries to send TaskRequestMessage at line worker_thread.rs:61 3. Worker blocks because dispatcher's channel is full 4. Dispatcher tries to send new task to worker at line dispatcher.rs:199 5. Dispatcher blocks because worker's channel is full (worker_queue_size limit) 6. DEADLOCK: Both components are blocked waiting for each other Critical Code Locations: - Worker blocking point: rust/system/src/execution/worker_thread.rs:61 - Dispatcher blocking point: rust/system/src/execution/dispatcher.rs:199 - Channel creation: rust/system/src/system.rs:39 (bounded channel with queue_size()) - Queue limits: Configured via DispatcherConfig with dispatcher_queue_size and worker_queue_size This is not a livelock (no active spinning) or starvation (not a fairness issue), but a true deadlock where progress is impossible once both channels are full and each component is trying to send to the other. ``` Fix is to make it so that sending errors and breaks the deadlock. This will fail the task. If this works on staging we'll test it, make it robust, etc.

github-actions · 2025-08-15T18:14:26Z

propel-code-bot · 2025-08-15T18:14:59Z

Fix Bounded-Buffer Deadlock in Dispatcher-Worker Communication

This PR addresses a critical deadlock in the Rust system crate's dispatcher-worker architecture, specifically a bounded-buffer deadlock where both dispatcher and worker channels could become full, causing a circular wait between components. The main fix replaces blocking channel send operations with non-blocking (fail-fast) behavior, causing tasks to fail if channel buffers are full instead of waiting indefinitely. Additional tweaks in companion modules and tests reflect these changes, and include minor test logic adjustments to accommodate the new failure behavior.

Key Changes

• Replaced blocking .send().await with non-blocking .try_send() for message passing in rust/system/src/types.rs to prevent deadlocks when channel buffers are full.
• Adjusted field visibility of TaskResult in rust/system/src/execution/operator.rs to pub(crate) for use in error-handling code.
• Small documentation and logic tweaks in test orchestrator logic (rust/system/src/execution/orchestrator.rs) to correctly check results and handle test operator configuration.
• Refined scheduler comment in rust/system/src/scheduler.rs to clarify locking semantics, but no logic change.
• Unit tests updated to reflect new non-blocking send/fail-fast behavior.

Affected Areas

• Component channel delivery logic (types.rs)
• Task result data structure (operator.rs)
• Executor and orchestrator test logic (orchestrator.rs)
• Scheduler internal documentation (scheduler.rs)

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-08-15T18:16:07Z

rust/system/src/types.rs

        self.sender
-            .send(WrappedMessage::new(message, None, tracing_context))
-            .await
+            .try_send(WrappedMessage::new(message, None, tracing_context))


[BestPractice]

The change from send() to try_send() effectively breaks the deadlock by making the dispatcher fail fast instead of blocking when worker channels are full. However, this approach has some concerns:

Silent task failure: Failed sends will only return ChannelError::SendError without specific context about which task failed or why

No retry mechanism: Tasks that fail to send will be permanently lost

Resource waste: The spawned tokio task and associated resources are consumed even when the message cannot be delivered

Consider adding logging when try_send fails to help with observability:

Suggested change

self.sender

.send(WrappedMessage::new(message, None, tracing_context))

.await

.try_send(WrappedMessage::new(message, None, tracing_context))

self.sender

.try_send(WrappedMessage::new(message, None, tracing_context))

.map_err(|e| {

tracing::warn!("Failed to send scheduled message: {:?}", e);

ChannelError::SendError

})

This aligns with your stated plan to "watch staging not deadlock" by providing visibility into when the deadlock prevention mechanism activates.

⚡ Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

This reverts commit 7690866.

propel-code-bot · 2025-08-15T18:48:11Z

rust/system/src/execution/operator.rs

+    pub(crate) result: Result<Output, TaskError<Error>>,
+    pub(crate) task_id: Uuid,


[BestPractice]

Adding pub(crate) visibility to the TaskResult fields enables the deadlock fix by allowing the dispatcher to access task results directly. However, this breaks encapsulation. Consider:

Adding a constructor method that takes the result components

Adding getter methods for controlled access to these fields

Documenting why these fields need crate-level visibility

Example:

impl<Output, Error> TaskResult<Output, Error> { pub(crate) fn new(result: Result<Output, TaskError<Error>>, task_id: Uuid) -> Self { Self { result, task_id } } pub(crate) fn result(&self) -> &Result<Output, TaskError<Error>> { &self.result } }

propel-code-bot · 2025-08-15T18:48:12Z

rust/system/src/execution/orchestrator.rs

@@ -294,8 +294,7 @@ mod tests {
            message: TaskResult<(), TestError>,
            _ctx: &ComponentContext<Self>,
        ) -> Self::Result {
-            // We expect these to be cancelled, so we ignore the results
-            let _ = message;
+            message.result.unwrap();


[CriticalError]

Calling unwrap() here will cause a panic when a task is cancelled, as message.result will be Err(TaskError::Aborted). The test_operator_cancellation test is designed to test cancellation by using a timeout with a long-running SleepingOperator, so we expect tasks to be cancelled.

This panic within the handler will be caught, causing orchestrator.run() to complete with an error. Consequently, the timeout in the test will return an Ok(Err(...)) instead of timing out and returning Err(_). This will trigger the panic!("Orchestrator should have timed out") assertion, causing the test to fail.

The previous implementation correctly ignored the result, which is appropriate for tasks that are expected to be cancelled. Was this change intentional?

If you want to assert that the task was indeed aborted, you could do so explicitly. However, simply ignoring it seems sufficient. I'd recommend reverting to the previous implementation to fix the test.

Suggested change

message.result.unwrap();

// We expect these to be cancelled, so we ignore the results

let _ = message;

⚡ Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

This reverts commit be5873d.

rescrv requested a review from codetheweb August 15, 2025 18:14

propel-code-bot bot reviewed Aug 15, 2025

View reviewed changes

rescrv added 3 commits August 15, 2025 11:19

send_timeout

7690866

Revert "send_timeout"

a6b8a0a

This reverts commit 7690866.

tweak failing test to pass and still test

f964a87

propel-code-bot bot reviewed Aug 15, 2025

View reviewed changes

blacksmith-sh bot deleted a comment from rescrv Aug 15, 2025

codetheweb approved these changes Aug 15, 2025

View reviewed changes

rescrv merged commit be5873d into main Aug 15, 2025
115 of 117 checks passed

rescrv added a commit that referenced this pull request Aug 15, 2025

Revert "[BUG] Resolve deadlock in system crate? (#5283)"

c9d145b

This reverts commit be5873d.

rescrv mentioned this pull request Aug 15, 2025

Revert "[BUG] Resolve deadlock in system crate?" #5286

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Resolve deadlock in system crate? #5283

[BUG] Resolve deadlock in system crate? #5283

Uh oh!

rescrv commented Aug 15, 2025

Uh oh!

github-actions bot commented Aug 15, 2025

Uh oh!

propel-code-bot bot commented Aug 15, 2025 •

edited

Loading

Uh oh!

propel-code-bot bot Aug 15, 2025

Uh oh!

propel-code-bot bot Aug 15, 2025

Uh oh!

propel-code-bot bot Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

		pub(crate) result: Result<Output, TaskError<Error>>,
		pub(crate) task_id: Uuid,

	message.result.unwrap();
	// We expect these to be cancelled, so we ignore the results
	let _ = message;

[BUG] Resolve deadlock in system crate? #5283

[BUG] Resolve deadlock in system crate? #5283

Uh oh!

Conversation

rescrv commented Aug 15, 2025

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

github-actions bot commented Aug 15, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

propel-code-bot bot Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

propel-code-bot bot commented Aug 15, 2025 •

edited

Loading