Skip to content

Fix multi-agent bridge bugs and watchdog timeout for workers#195

Merged
PureWeen merged 4 commits intomainfrom
MultiReviewTest
Feb 23, 2026
Merged

Fix multi-agent bridge bugs and watchdog timeout for workers#195
PureWeen merged 4 commits intomainfrom
MultiReviewTest

Conversation

@PureWeen
Copy link
Owner

Summary

Fixes 6 bugs discovered while testing multi-agent PR review orchestration on desktop and mobile.

Bridge fixes (mobile)

  • Unblock WebSocket message loop: SendMessage handler was awaiting SendPromptAsync (blocks until full response), preventing all other client messages. Now fire-and-forget via Task.Run.
  • Prevent history overwrite: SyncRemoteSessions was overwriting incrementally-built streaming history with stale cache on TurnEnd. Now requests fresh history before clearing the streaming guard.
  • Stop IsProcessing race: SyncRemoteSessions unconditionally overwrote IsProcessing from periodic sessions list, racing with event-driven TurnStart/TurnEnd. Now skips processing state updates for actively streaming sessions.
  • Broadcast organization state: OnStateChanged only sent SessionsList, not OrganizationState, so mobile never saw group/role changes.

Multi-agent orchestration fixes

  • Worker dispatch regex: ParseTaskAssignments regex (\S+) only captured first word of worker names with spaces (e.g. "PR Review Squad-worker-1"). Changed to ([^\n]+?).
  • Preset re-creation: Role/group/model assignment was inside the same try block as CreateSessionAsync, so recreating an existing Squad skipped all assignments.

Watchdog timeout fix

  • Multi-agent workers killed prematurely: The 120s inactivity timeout was firing before text-heavy workers (PR reviews, no tools) completed. Responses were lost because CompleteResponse skipped when IsProcessing was already false. Now caches IsMultiAgentSession on SessionState at send time (thread-safe) and uses the 600s timeout.

Tests

  • 6 new tests for worker name parsing and multi-agent watchdog behavior
  • All 1,187 tests pass

Review

Fix reviewed by Opus 4.6, Sonnet 4.5, and GPT-5.2 — all agreed on the thread-safety fix (cache multi-agent flag at send time vs. reading Organization lists from background thread).

PureWeen and others added 4 commits February 23, 2026 03:32
…top IsProcessing race

Three fixes for mobile bridge reliability:

1. WsBridgeServer: Fire-and-forget SendPromptAsync in SendMessage handler.
   The handler was awaiting ResponseCompletion which blocks for the entire
   response duration (minutes), preventing abort/switch/new messages from
   being processed by the per-client WebSocket read loop.

2. CopilotService.Bridge: On TurnEnd, request fresh history before clearing
   the streaming guard. Previously, removing from _remoteStreamingSessions
   immediately allowed SyncRemoteSessions to overwrite incrementally-built
   history with a stale SessionHistories cache, losing the last message.

3. CopilotService.Bridge: Skip IsProcessing updates from SessionsList for
   sessions that are actively streaming. The periodic sessions list could
   race with event-driven TurnStart/TurnEnd, causing stop button flicker.

Also fixes: ParseTaskAssignments regex now captures worker names with spaces
(e.g. 'PR Review Squad-worker-1') instead of only the first word.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OnStateChanged only broadcasted SessionsList, not OrganizationState.
This caused mobile to have stale group assignments — sessions moved
between groups on desktop wouldn't update on mobile until a specific
org-triggering operation occurred.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CreateGroupFromPresetAsync had role/group/model assignment inside the
same try block as CreateSessionAsync. If the session already existed
(e.g. recreating the same Squad team), CreateSessionAsync threw and
the orchestrator lost its Orchestrator role, workers lost their group
assignment and system prompts.

Move assignment outside the try so it runs regardless of whether
session creation succeeded or was skipped.

Also adds 3 tests for ParseTaskAssignments with worker names containing
spaces (the regex fix from the prior commit).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workers doing text-heavy tasks (e.g., PR reviews) can take 2-4 minutes
without tool calls. The 120s inactivity timeout was killing workers
prematurely — the watchdog cleared IsProcessing and added a 'stuck'
warning, then the actual response arrived but CompleteResponse skipped
because IsProcessing was already false, losing the response.

Now sessions in multi-agent groups use the 600s tool-execution timeout.
The multi-agent flag is cached on SessionState at send time (UI thread)
so the watchdog can read it safely from its background thread without
accessing the Organization lists (plain List<T>, UI-thread-only).

The orchestration loop already has its own 10-minute per-worker timeout
via CancelAfter, so the watchdog is a safety net, not the primary guard.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit e2d6c70 into main Feb 23, 2026
@PureWeen PureWeen deleted the MultiReviewTest branch February 23, 2026 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant