Fix multi-agent bridge bugs and watchdog timeout for workers#195
Merged
Fix multi-agent bridge bugs and watchdog timeout for workers#195
Conversation
…top IsProcessing race Three fixes for mobile bridge reliability: 1. WsBridgeServer: Fire-and-forget SendPromptAsync in SendMessage handler. The handler was awaiting ResponseCompletion which blocks for the entire response duration (minutes), preventing abort/switch/new messages from being processed by the per-client WebSocket read loop. 2. CopilotService.Bridge: On TurnEnd, request fresh history before clearing the streaming guard. Previously, removing from _remoteStreamingSessions immediately allowed SyncRemoteSessions to overwrite incrementally-built history with a stale SessionHistories cache, losing the last message. 3. CopilotService.Bridge: Skip IsProcessing updates from SessionsList for sessions that are actively streaming. The periodic sessions list could race with event-driven TurnStart/TurnEnd, causing stop button flicker. Also fixes: ParseTaskAssignments regex now captures worker names with spaces (e.g. 'PR Review Squad-worker-1') instead of only the first word. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
OnStateChanged only broadcasted SessionsList, not OrganizationState. This caused mobile to have stale group assignments — sessions moved between groups on desktop wouldn't update on mobile until a specific org-triggering operation occurred. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CreateGroupFromPresetAsync had role/group/model assignment inside the same try block as CreateSessionAsync. If the session already existed (e.g. recreating the same Squad team), CreateSessionAsync threw and the orchestrator lost its Orchestrator role, workers lost their group assignment and system prompts. Move assignment outside the try so it runs regardless of whether session creation succeeded or was skipped. Also adds 3 tests for ParseTaskAssignments with worker names containing spaces (the regex fix from the prior commit). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Workers doing text-heavy tasks (e.g., PR reviews) can take 2-4 minutes without tool calls. The 120s inactivity timeout was killing workers prematurely — the watchdog cleared IsProcessing and added a 'stuck' warning, then the actual response arrived but CompleteResponse skipped because IsProcessing was already false, losing the response. Now sessions in multi-agent groups use the 600s tool-execution timeout. The multi-agent flag is cached on SessionState at send time (UI thread) so the watchdog can read it safely from its background thread without accessing the Organization lists (plain List<T>, UI-thread-only). The orchestration loop already has its own 10-minute per-worker timeout via CancelAfter, so the watchdog is a safety net, not the primary guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes 6 bugs discovered while testing multi-agent PR review orchestration on desktop and mobile.
Bridge fixes (mobile)
SendMessagehandler wasawaitingSendPromptAsync(blocks until full response), preventing all other client messages. Now fire-and-forget viaTask.Run.SyncRemoteSessionswas overwriting incrementally-built streaming history with stale cache onTurnEnd. Now requests fresh history before clearing the streaming guard.SyncRemoteSessionsunconditionally overwroteIsProcessingfrom periodic sessions list, racing with event-drivenTurnStart/TurnEnd. Now skips processing state updates for actively streaming sessions.OnStateChangedonly sentSessionsList, notOrganizationState, so mobile never saw group/role changes.Multi-agent orchestration fixes
ParseTaskAssignmentsregex(\S+)only captured first word of worker names with spaces (e.g. "PR Review Squad-worker-1"). Changed to([^\n]+?).CreateSessionAsync, so recreating an existing Squad skipped all assignments.Watchdog timeout fix
CompleteResponseskipped whenIsProcessingwas already false. Now cachesIsMultiAgentSessiononSessionStateat send time (thread-safe) and uses the 600s timeout.Tests
Review
Fix reviewed by Opus 4.6, Sonnet 4.5, and GPT-5.2 — all agreed on the thread-safety fix (cache multi-agent flag at send time vs. reading Organization lists from background thread).