Fix stuck sessions: watchdog recovery, model labels, and thread safety#148
Merged
Fix stuck sessions: watchdog recovery, model labels, and thread safety#148
Conversation
Add ProcessingStartedAt timestamp to AgentSessionInfo. Display elapsed time (e.g. '(45s)', '(2m 30s)') next to the Thinking indicator so users can see the session is alive during long tool executions. Only shows after 5 seconds to avoid flicker on quick responses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace Blazor-render-dependent elapsed label with a global JS setInterval that ticks every second. Uses data-start attribute on .live-timer spans so the timer keeps counting even when the CLI goes silent and no Blazor re-renders occur. One global timer handles all sessions with zero re-render cost. Also wire ProcessingStartedAt to SessionCard (compact grid view) which was missing the parameter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add Model property to ChatMessage, set from state.Info.Model in CompleteResponse and abort flush. Display as '10:30 AM · gpt-4.1' next to the timestamp on each assistant message. Remove the elapsed timer (JS setInterval, data-start attributes, GetElapsedLabel) as it provided no actionable information. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Set Model on FlushCurrentResponse assistant messages (mid-turn flush before tool calls was missing it) - Set Model on bridge OnTurnEnd for remote mode messages - Add Model column to ChatMessageEntity and map in both directions so model name survives DB round-trips across app restarts - Remove dead ProcessingStartedAt property (unused after timer removal) - Add 4 tests for ChatMessage.Model property Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add null-coalescing fallbacks for JsonElement.GetString() calls on Command, Cwd, and Type properties to silence CS8601 warnings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Marshal IsProcessing mutation and ResponseCompletion.TrySetException to the UI thread via InvokeOnUI(), matching the pattern used by the processing watchdog. Combines the error notification, state mutation, and state-changed callback into a single UI-thread dispatch to prevent race conditions with Blazor rendering. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Change system message to actionable warning with resend hint - Add LastPrompt field to ActiveSessionEntry for persistence - Save last user message when session is processing during app exit - Show truncated last prompt in recovery message so user knows what was lost - Fire OnError to show error banner on interrupted turn recovery - Add [INTERRUPTED] debug tag for diagnostic traceability Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6 new tests (670 total): - ActiveSessionEntry_LastPrompt_RoundTrips - ActiveSessionEntry_LastPrompt_NullByDefault - MergeSessionEntries_PreservesLastPrompt - InterruptedTurn_SystemMessage_ContainsWarning - InterruptedTurn_SystemMessage_IncludesLastPrompt - InterruptedTurn_SystemMessage_TruncatesLongPrompt Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 10-second hardcoded timeout in ResumeSessionAsync was prematurely clearing IsProcessing on sessions that were still actively working. Tool calls (dotnet build, git push, etc.) can easily go 30-60 seconds between events, causing the resume logic to declare the turn dead. Changes: - Remove the 10-second resume timeout entirely — the processing watchdog (120s inactivity / 600s tool execution) already handles stuck sessions properly - Move event handler subscription (copilotSession.On) BEFORE the watchdog setup to fix a race where events arriving immediately after SDK resume were missed because the handler wasn't wired yet Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The processing watchdog was incorrectly using the 120s inactivity timeout even when the session was actively running multi-turn tool calls. This happened because AssistantTurnStartEvent resets ActiveToolCallCount to 0 between tool rounds, making the model's 'thinking' gap between tools look like inactivity. Added HasUsedToolsThisTurn flag that stays true for the entire processing cycle once any tool executes. The watchdog now uses the 600s tool timeout when: a tool is actively running (hasActiveTool), the session was resumed mid-turn (IsResumed), or tools have been used this turn (HasUsedToolsThisTurn). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 7 tests to ProcessingWatchdogTests covering the 3-way condition that selects between inactivity (120s) and tool execution (600s) timeouts: - HasUsedToolsThisTurn: defaults false, can be set, reset by CompleteResponse - Timeout selection: no tools → 120s, active tool → 600s, resumed → 600s, has-used-tools → 600s SessionState is private, so tests replicate the watchdog's decision logic inline using local variables matching CopilotService.Events.cs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tterns Add to copilot-instructions.md: - Processing Watchdog section (120s/600s timeout tiers, 3-way condition) - Diagnostic Log Tags reference ([SEND], [COMPLETE], [ERROR], etc.) - Thread Safety: IsProcessing Mutations (InvokeOnUI pattern) - LastPrompt in session persistence - 11 missing test files in Test Coverage section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ompleteResponse FlushCurrentResponse was creating ghost messages containing only newlines when the CLI sent formatting whitespace before tool calls. Changed both FlushCurrentResponse and CompleteResponse to use IsNullOrWhiteSpace instead of IsNullOrEmpty. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…med scoping 1. Wire lastPrompt into reconnect system message when session was mid-turn — shows truncated last user message so user knows what was interrupted. 2. Reset HasUsedToolsThisTurn in SendPromptAsync and AbortSessionAsync alongside ActiveToolCallCount — prevents stale tool flag from inflating watchdog timeout on subsequent turns. 3. Scope IsResumed to mid-turn resumes only (set when isStillProcessing is true, not unconditionally). Clear it in CompleteResponse after first successful turn. Changed from init to set property. 3 new tests (680 total). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fleet review found 2 issues: 1. IsResumed not cleared on abort/error/watchdog paths (GPT-5.3 + Gemini). Only CompleteResponse cleared it, so if a resumed session ended via abort, error, or watchdog timeout, all subsequent turns permanently used the 600s tool timeout instead of 120s inactivity timeout. Fix: Clear IsResumed in abort, error, and watchdog paths. 2. HasUsedToolsThisTurn missing volatile barrier (Sonnet). Written on SDK background thread, read on watchdog timer thread. ARM weak memory model could see stale false, causing premature 120s timeout during tool execution. Fix: Use Volatile.Write on set, Volatile.Read on watchdog check — consistent with HasReceivedEventsSinceResume pattern. 4 new tests (684 total). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 21, 2026
Invariant audit found 4 pre-existing gaps where IsProcessing was cleared without resetting all companion fields: 1. CompleteResponse: missing ActiveToolCallCount reset 2. SendAsync reconnect+retry failure: missing IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, FlushCurrentResponse 3. SendAsync initial failure: same missing fields 4. Remote mode abort: missing IsResumed These are the exact same class of bug that caused regressions in PRs #148, #158, and #164. Every path that sets IsProcessing=false must also clear: IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, ProcessingStartedAt, ToolCallCount, ProcessingPhase, and FlushCurrentResponse. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 21, 2026
…edge Add comprehensive documentation of the recurring stuck-session bug pattern (7 PRs, 16 fix/regression cycles) to copilot-instructions.md: - Full cleanup checklist for all IsProcessing=false paths - Table of all 7 paths with locations - 7 common mistakes with PR references where each occurred - Staleness check and IsResumed clearing documentation - Cross-thread volatile field requirements - ProcessingGeneration guard explanation - Watchdog diagnostic log tag additions This knowledge was hard-won across PRs #141, #147, #148, #153, #158, #163, #164 and should prevent future regressions by making the invariants explicit and discoverable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 21, 2026
Address issues found during INV validation: - INV-4: Add comments explaining why ResumeGracePeriodSeconds (30s) is safe and differs from PR #148's regression (unconditional 10s timeout). Our grace period only fires when HasReceivedEventsSinceResume=false. - INV-7: Use Volatile.Write for HasUsedToolsThisTurn in watchdog fallback (ARM weak memory model visibility). - INV-2/INV-1: Document intentional violations in watchdog fallback — bypasses InvokeOnUI because Blazor circuit is dead, skips FlushCurrentResponse because render pipeline is unreachable. - INV-1 (Path 5): Remote abort was missing ActiveToolCallCount, HasUsedToolsThisTurn cleanup and diagnostic log. - INV-1 (Path 8): Bridge OnTurnEnd was missing ActiveToolCallCount and HasUsedToolsThisTurn cleanup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 25, 2026
Three models (Opus 4.6, Sonnet 4.6, Codex 5.3) independently identified a critical bug: seeding LastEventAtTicks from events.jsonl file write time caused the 30s quiescence timeout to fire on the FIRST 15s watchdog check for any session where the file was >15s old. This is exactly the PR #148 regression (killing active sessions with a too-short timeout). Fix: seed from DateTime.UtcNow (not file time) so the 30s quiescence measures from app restart, not from last SDK event before crash. This gives the SDK a full 30s to reconnect and start streaming events. Added 16 new regression guard tests: - ResumeQuiescenceTimeout_IsReasonable (20-120s range, >= 2x check interval) - ResumeQuiescenceTimeout_IsLessThanInactivityTimeout - ResumeQuiescence_OnlyTriggersWhenResumedAndNoEvents (exhaustive combos) - ResumeQuiescence_NotResumed_NeverTriggersQuiescence - ResumeQuiescence_TransitionsToToolTimeout_WhenEventsArrive - ResumeQuiescence_TransitionsToInactivity_AfterIsResumedCleared - WatchdogTimeoutSelection_ExhaustiveMatrix (10 Theory rows) - SeedTime_MustNotCauseImmediateKill_RegressionGuard (PR #148 guard) - SeedTime_FromStaleFile_WouldCauseImmediateKill_DocumentsRisk - QuiescenceTimeout_EscapesOnFirstEvent - QuiescenceTimeout_DoesNotAffect_NormalSendPromptPath - WatchdogResumeQuiescence_Constant_MatchesExpectedValue - AllThreeTimeoutTiers_AreDistinct Also updated existing tests to use ComputeEffectiveTimeout helper mirroring the production 3-tier formula, preventing test-vs-production logic drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 25, 2026
…ime param Three-model review (Opus 4.6, Sonnet 4.6, Codex 5.3) found: 1. Error message shows '0 minute(s)' for 30s quiescence timeout due to integer division (30/60=0). Fixed: format as seconds when < 60s. 2. StartProcessingWatchdog had a seedTime parameter with a comment describing the OPPOSITE of correct usage (said 'seed from file time' when file-time seeding causes PR #148 regression). Removed the dead parameter entirely and replaced comment with a warning. 3. Stale comment at resume site said 'no separate short timeout needed' which contradicts the 30s quiescence this PR adds. Updated to list all three timeout tiers. Added 5 Theory tests for timeout message formatting (30s/59s/60s/120s/600s). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 25, 2026
## Problem After app restart, resumed sessions that were mid-turn show **Thinking...** with a Stop button. The user must manually click Stop every time. The existing watchdog waited 600s (10 min!) before clearing stuck IsProcessing. ## Solution Add a **30s resume quiescence timeout** for sessions that receive zero SDK events after restart. If no events flow within 30s of app start, the session is cleared as stuck. ### Key design decisions (informed by 4-model consultation: Opus 4.6, Sonnet 4.6, Codex 5.3, GPT-5.1): 1. **30s quiescence** — short enough users don't wait, long enough for SDK reconnect (~5s typical, 6x safety margin) 2. **Event-gated** — only fires when \HasReceivedEventsSinceResume == false\. Once events start flowing, transitions to normal 120s/600s timeout tiers 3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models independently flagged that seeding from events.jsonl would cause immediate kills for sessions >15s old (exact PR #148 regression pattern) 4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup code, all 8 invariants preserved ### Timeout tiers (3-tier, was 2-tier): | Condition | Timeout | |-----------|---------| | Resumed, zero events since restart | **30s** (NEW) | | Normal processing, no tools | 120s | | Active tools / resumed with events / multi-agent | 600s | ## Tests - **16 new regression guard tests** covering quiescence edge cases, seed time safety, exhaustive timeout matrix - Updated existing tests to use \ComputeEffectiveTimeout\ helper mirroring production 3-tier formula - **108 total watchdog+recovery tests pass** ✅ ## Regression history context This code has been through 7 PRs of fix/regression cycles (PRs #141→#147→#148→#153→#158→#163→#164). The most relevant precedent: PR #148 added a 10s resume timeout that killed active sessions. Our 30s timeout avoids this by being event-gated and seeded from UtcNow. Fixes the 'click Stop on every restart' UX issue. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix sessions getting stuck on "Thinking" after app relaunch or during long tool executions. Add model labels on assistant messages and comprehensive diagnostic logging.
Bug
Sessions appeared permanently stuck on "Thinking" after PolyPilot was relaunched while sessions were mid-turn. The processing watchdog had several issues that prevented proper recovery.
Changes
Watchdog & Recovery
HasUsedToolsThisTurnstays true for the entire processing cycle, preventing false 120s timeouts between tool rounds whenActiveToolCallCountresetsIsResumedscoping — only set when session was mid-turn at restart, cleared after first successful turn and on abort/error/watchdog pathsHasUsedToolsThisTurnleak — reset inSendPromptAsyncandAbortSessionAsyncso stale flags don't inflate subsequent turn timeoutsThread Safety
SessionErrorEventusesInvokeOnUI()— allIsProcessingmutations now marshaled to UI threadHasUsedToolsThisTurnusesVolatile.Write/Read— ARM memory barrier for cross-thread safety, consistent withHasReceivedEventsSinceResumeModel Labels
ChatMessage.Modelin all 4 creation paths: CompleteResponse, FlushCurrentResponse, abort flush, bridge OnTurnEndGhost Messages
FlushCurrentResponseandCompleteResponse(IsNullOrWhiteSpaceinstead ofIsNullOrEmpty)Interrupted Turn Recovery
LastPromptinactive-sessions.jsonwhen session is mid-turnIsProcessing=falsepaths:[COMPLETE],[ERROR],[ABORT],[BRIDGE-COMPLETE],[INTERRUPTED]Other
ParseMcpServerConfigTests
17 new tests (684 total passing):
Files Changed
CopilotService.Events.cs— watchdog logic, CompleteResponse, FlushCurrentResponse, SessionErrorEventCopilotService.cs— SendPromptAsync, AbortSessionAsync, ResumeSessionAsync, LastPromptCopilotService.Persistence.cs— LastPrompt save during persistenceCopilotService.Bridge.cs— Model on bridge turn endAgentSessionInfo.cs— IsResumed init to setChatMessage.cs— Model propertyChatDatabase.cs— Model columnChatMessageItem.razor— Model display.github/copilot-instructions.md— Watchdog, diagnostic, thread safety docsProcessingWatchdogTests.cs,ChatMessageTests.cs,SessionPersistenceTests.cs