Fix stuck sessions: watchdog recovery, model labels, and thread safety by PureWeen · Pull Request #148 · PureWeen/PolyPilot

PureWeen · 2026-02-19T19:08:21Z

Summary

Fix sessions getting stuck on "Thinking" after app relaunch or during long tool executions. Add model labels on assistant messages and comprehensive diagnostic logging.

Bug

Sessions appeared permanently stuck on "Thinking" after PolyPilot was relaunched while sessions were mid-turn. The processing watchdog had several issues that prevented proper recovery.

Changes

Watchdog & Recovery

Remove 10s resume timeout — watchdog handles all stuck detection (cherry-picked from PR feat: Multi-agent orchestration with Squad integration #104)
Fix watchdog timeout during tool loops — HasUsedToolsThisTurn stays true for the entire processing cycle, preventing false 120s timeouts between tool rounds when ActiveToolCallCount resets
Wire event handler before watchdog start — prevents missing early events after resume
Fix IsResumed scoping — only set when session was mid-turn at restart, cleared after first successful turn and on abort/error/watchdog paths
Fix HasUsedToolsThisTurn leak — reset in SendPromptAsync and AbortSessionAsync so stale flags don't inflate subsequent turn timeouts

Thread Safety

SessionErrorEvent uses InvokeOnUI() — all IsProcessing mutations now marshaled to UI thread
HasUsedToolsThisTurn uses Volatile.Write/Read — ARM memory barrier for cross-thread safety, consistent with HasReceivedEventsSinceResume

Model Labels

Show model name next to timestamp on assistant messages (e.g., "10:30 AM · gpt-4.1")
Set ChatMessage.Model in all 4 creation paths: CompleteResponse, FlushCurrentResponse, abort flush, bridge OnTurnEnd
Persist Model column in ChatDatabase (sqlite-net auto-migrates)

Ghost Messages

Skip whitespace-only assistant messages in FlushCurrentResponse and CompleteResponse (IsNullOrWhiteSpace instead of IsNullOrEmpty)

Interrupted Turn Recovery

Persist LastPrompt in active-sessions.json when session is mid-turn
Show actionable system message on reconnect with truncated last prompt
Diagnostic logging on all IsProcessing=false paths: [COMPLETE], [ERROR], [ABORT], [BRIDGE-COMPLETE], [INTERRUPTED]

Other

Fix nullable reference warnings in ParseMcpServerConfig

Tests

17 new tests (684 total passing):

11 watchdog timeout selection tests (timeout tiers, reset on send/abort, IsResumed scoping/clearing, volatile consistency)
4 model property tests + ghost message prevention
3 LastPrompt persistence round-trip/merge tests
3 interrupted turn message tests

Files Changed

CopilotService.Events.cs — watchdog logic, CompleteResponse, FlushCurrentResponse, SessionErrorEvent
CopilotService.cs — SendPromptAsync, AbortSessionAsync, ResumeSessionAsync, LastPrompt
CopilotService.Persistence.cs — LastPrompt save during persistence
CopilotService.Bridge.cs — Model on bridge turn end
AgentSessionInfo.cs — IsResumed init to set
ChatMessage.cs — Model property
ChatDatabase.cs — Model column
ChatMessageItem.razor — Model display
.github/copilot-instructions.md — Watchdog, diagnostic, thread safety docs
ProcessingWatchdogTests.cs, ChatMessageTests.cs, SessionPersistenceTests.cs

Add ProcessingStartedAt timestamp to AgentSessionInfo. Display elapsed time (e.g. '(45s)', '(2m 30s)') next to the Thinking indicator so users can see the session is alive during long tool executions. Only shows after 5 seconds to avoid flicker on quick responses. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replace Blazor-render-dependent elapsed label with a global JS setInterval that ticks every second. Uses data-start attribute on .live-timer spans so the timer keeps counting even when the CLI goes silent and no Blazor re-renders occur. One global timer handles all sessions with zero re-render cost. Also wire ProcessingStartedAt to SessionCard (compact grid view) which was missing the parameter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add Model property to ChatMessage, set from state.Info.Model in CompleteResponse and abort flush. Display as '10:30 AM · gpt-4.1' next to the timestamp on each assistant message. Remove the elapsed timer (JS setInterval, data-start attributes, GetElapsedLabel) as it provided no actionable information. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Set Model on FlushCurrentResponse assistant messages (mid-turn flush before tool calls was missing it) - Set Model on bridge OnTurnEnd for remote mode messages - Add Model column to ChatMessageEntity and map in both directions so model name survives DB round-trips across app restarts - Remove dead ProcessingStartedAt property (unused after timer removal) - Add 4 tests for ChatMessage.Model property Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add null-coalescing fallbacks for JsonElement.GetString() calls on Command, Cwd, and Type properties to silence CS8601 warnings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Marshal IsProcessing mutation and ResponseCompletion.TrySetException to the UI thread via InvokeOnUI(), matching the pattern used by the processing watchdog. Combines the error notification, state mutation, and state-changed callback into a single UI-thread dispatch to prevent race conditions with Blazor rendering. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Change system message to actionable warning with resend hint - Add LastPrompt field to ActiveSessionEntry for persistence - Save last user message when session is processing during app exit - Show truncated last prompt in recovery message so user knows what was lost - Fire OnError to show error banner on interrupted turn recovery - Add [INTERRUPTED] debug tag for diagnostic traceability Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

6 new tests (670 total): - ActiveSessionEntry_LastPrompt_RoundTrips - ActiveSessionEntry_LastPrompt_NullByDefault - MergeSessionEntries_PreservesLastPrompt - InterruptedTurn_SystemMessage_ContainsWarning - InterruptedTurn_SystemMessage_IncludesLastPrompt - InterruptedTurn_SystemMessage_TruncatesLongPrompt Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The 10-second hardcoded timeout in ResumeSessionAsync was prematurely clearing IsProcessing on sessions that were still actively working. Tool calls (dotnet build, git push, etc.) can easily go 30-60 seconds between events, causing the resume logic to declare the turn dead. Changes: - Remove the 10-second resume timeout entirely — the processing watchdog (120s inactivity / 600s tool execution) already handles stuck sessions properly - Move event handler subscription (copilotSession.On) BEFORE the watchdog setup to fix a race where events arriving immediately after SDK resume were missed because the handler wasn't wired yet Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The processing watchdog was incorrectly using the 120s inactivity timeout even when the session was actively running multi-turn tool calls. This happened because AssistantTurnStartEvent resets ActiveToolCallCount to 0 between tool rounds, making the model's 'thinking' gap between tools look like inactivity. Added HasUsedToolsThisTurn flag that stays true for the entire processing cycle once any tool executes. The watchdog now uses the 600s tool timeout when: a tool is actively running (hasActiveTool), the session was resumed mid-turn (IsResumed), or tools have been used this turn (HasUsedToolsThisTurn). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add 7 tests to ProcessingWatchdogTests covering the 3-way condition that selects between inactivity (120s) and tool execution (600s) timeouts: - HasUsedToolsThisTurn: defaults false, can be set, reset by CompleteResponse - Timeout selection: no tools → 120s, active tool → 600s, resumed → 600s, has-used-tools → 600s SessionState is private, so tests replicate the watchdog's decision logic inline using local variables matching CopilotService.Events.cs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…tterns Add to copilot-instructions.md: - Processing Watchdog section (120s/600s timeout tiers, 3-way condition) - Diagnostic Log Tags reference ([SEND], [COMPLETE], [ERROR], etc.) - Thread Safety: IsProcessing Mutations (InvokeOnUI pattern) - LastPrompt in session persistence - 11 missing test files in Test Coverage section Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ompleteResponse FlushCurrentResponse was creating ghost messages containing only newlines when the CLI sent formatting whitespace before tool calls. Changed both FlushCurrentResponse and CompleteResponse to use IsNullOrWhiteSpace instead of IsNullOrEmpty. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…med scoping 1. Wire lastPrompt into reconnect system message when session was mid-turn — shows truncated last user message so user knows what was interrupted. 2. Reset HasUsedToolsThisTurn in SendPromptAsync and AbortSessionAsync alongside ActiveToolCallCount — prevents stale tool flag from inflating watchdog timeout on subsequent turns. 3. Scope IsResumed to mid-turn resumes only (set when isStillProcessing is true, not unconditionally). Clear it in CompleteResponse after first successful turn. Changed from init to set property. 3 new tests (680 total). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Fleet review found 2 issues: 1. IsResumed not cleared on abort/error/watchdog paths (GPT-5.3 + Gemini). Only CompleteResponse cleared it, so if a resumed session ended via abort, error, or watchdog timeout, all subsequent turns permanently used the 600s tool timeout instead of 120s inactivity timeout. Fix: Clear IsResumed in abort, error, and watchdog paths. 2. HasUsedToolsThisTurn missing volatile barrier (Sonnet). Written on SDK background thread, read on watchdog timer thread. ARM weak memory model could see stale false, causing premature 120s timeout during tool execution. Fix: Use Volatile.Write on set, Volatile.Read on watchdog check — consistent with HasReceivedEventsSinceResume pattern. 4 new tests (684 total). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Invariant audit found 4 pre-existing gaps where IsProcessing was cleared without resetting all companion fields: 1. CompleteResponse: missing ActiveToolCallCount reset 2. SendAsync reconnect+retry failure: missing IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, FlushCurrentResponse 3. SendAsync initial failure: same missing fields 4. Remote mode abort: missing IsResumed These are the exact same class of bug that caused regressions in PRs #148, #158, and #164. Every path that sets IsProcessing=false must also clear: IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, ProcessingStartedAt, ToolCallCount, ProcessingPhase, and FlushCurrentResponse. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…edge Add comprehensive documentation of the recurring stuck-session bug pattern (7 PRs, 16 fix/regression cycles) to copilot-instructions.md: - Full cleanup checklist for all IsProcessing=false paths - Table of all 7 paths with locations - 7 common mistakes with PR references where each occurred - Staleness check and IsResumed clearing documentation - Cross-thread volatile field requirements - ProcessingGeneration guard explanation - Watchdog diagnostic log tag additions This knowledge was hard-won across PRs #141, #147, #148, #153, #158, #163, #164 and should prevent future regressions by making the invariants explicit and discoverable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Address issues found during INV validation: - INV-4: Add comments explaining why ResumeGracePeriodSeconds (30s) is safe and differs from PR #148's regression (unconditional 10s timeout). Our grace period only fires when HasReceivedEventsSinceResume=false. - INV-7: Use Volatile.Write for HasUsedToolsThisTurn in watchdog fallback (ARM weak memory model visibility). - INV-2/INV-1: Document intentional violations in watchdog fallback — bypasses InvokeOnUI because Blazor circuit is dead, skips FlushCurrentResponse because render pipeline is unreachable. - INV-1 (Path 5): Remote abort was missing ActiveToolCallCount, HasUsedToolsThisTurn cleanup and diagnostic log. - INV-1 (Path 8): Bridge OnTurnEnd was missing ActiveToolCallCount and HasUsedToolsThisTurn cleanup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Three models (Opus 4.6, Sonnet 4.6, Codex 5.3) independently identified a critical bug: seeding LastEventAtTicks from events.jsonl file write time caused the 30s quiescence timeout to fire on the FIRST 15s watchdog check for any session where the file was >15s old. This is exactly the PR #148 regression (killing active sessions with a too-short timeout). Fix: seed from DateTime.UtcNow (not file time) so the 30s quiescence measures from app restart, not from last SDK event before crash. This gives the SDK a full 30s to reconnect and start streaming events. Added 16 new regression guard tests: - ResumeQuiescenceTimeout_IsReasonable (20-120s range, >= 2x check interval) - ResumeQuiescenceTimeout_IsLessThanInactivityTimeout - ResumeQuiescence_OnlyTriggersWhenResumedAndNoEvents (exhaustive combos) - ResumeQuiescence_NotResumed_NeverTriggersQuiescence - ResumeQuiescence_TransitionsToToolTimeout_WhenEventsArrive - ResumeQuiescence_TransitionsToInactivity_AfterIsResumedCleared - WatchdogTimeoutSelection_ExhaustiveMatrix (10 Theory rows) - SeedTime_MustNotCauseImmediateKill_RegressionGuard (PR #148 guard) - SeedTime_FromStaleFile_WouldCauseImmediateKill_DocumentsRisk - QuiescenceTimeout_EscapesOnFirstEvent - QuiescenceTimeout_DoesNotAffect_NormalSendPromptPath - WatchdogResumeQuiescence_Constant_MatchesExpectedValue - AllThreeTimeoutTiers_AreDistinct Also updated existing tests to use ComputeEffectiveTimeout helper mirroring the production 3-tier formula, preventing test-vs-production logic drift. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ime param Three-model review (Opus 4.6, Sonnet 4.6, Codex 5.3) found: 1. Error message shows '0 minute(s)' for 30s quiescence timeout due to integer division (30/60=0). Fixed: format as seconds when < 60s. 2. StartProcessingWatchdog had a seedTime parameter with a comment describing the OPPOSITE of correct usage (said 'seed from file time' when file-time seeding causes PR #148 regression). Removed the dead parameter entirely and replaced comment with a warning. 3. Stale comment at resume site said 'no separate short timeout needed' which contradicts the 30s quiescence this PR adds. Updated to list all three timeout tiers. Added 5 Theory tests for timeout message formatting (30s/59s/60s/120s/600s). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

## Problem After app restart, resumed sessions that were mid-turn show **Thinking...** with a Stop button. The user must manually click Stop every time. The existing watchdog waited 600s (10 min!) before clearing stuck IsProcessing. ## Solution Add a **30s resume quiescence timeout** for sessions that receive zero SDK events after restart. If no events flow within 30s of app start, the session is cleared as stuck. ### Key design decisions (informed by 4-model consultation: Opus 4.6, Sonnet 4.6, Codex 5.3, GPT-5.1): 1. **30s quiescence** — short enough users don't wait, long enough for SDK reconnect (~5s typical, 6x safety margin) 2. **Event-gated** — only fires when \HasReceivedEventsSinceResume == false\. Once events start flowing, transitions to normal 120s/600s timeout tiers 3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models independently flagged that seeding from events.jsonl would cause immediate kills for sessions >15s old (exact PR #148 regression pattern) 4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup code, all 8 invariants preserved ### Timeout tiers (3-tier, was 2-tier): | Condition | Timeout | |-----------|---------| | Resumed, zero events since restart | **30s** (NEW) | | Normal processing, no tools | 120s | | Active tools / resumed with events / multi-agent | 600s | ## Tests - **16 new regression guard tests** covering quiescence edge cases, seed time safety, exhaustive timeout matrix - Updated existing tests to use \ComputeEffectiveTimeout\ helper mirroring production 3-tier formula - **108 total watchdog+recovery tests pass** ✅ ## Regression history context This code has been through 7 PRs of fix/regression cycles (PRs #141→#147→#148→#153→#158→#163→#164). The most relevant precedent: PR #148 added a 10s resume timeout that killed active sessions. Our 30s timeout avoids this by being event-gated and seeded from UtcNow. Fixes the 'click Stop on every restart' UX issue. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen and others added 4 commits February 19, 2026 13:08

PureWeen changed the title ~~Show elapsed time on Thinking indicator~~ Show model name on each assistant message Feb 19, 2026

PureWeen and others added 11 commits February 19, 2026 15:03

Fix nullable reference warnings in ParseMcpServerConfig

6f66791

Add null-coalescing fallbacks for JsonElement.GetString() calls on Command, Cwd, and Type properties to silence CS8601 warnings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

PureWeen changed the title ~~Show model name on each assistant message~~ Fix stuck sessions: watchdog recovery, model labels, and thread safety Feb 19, 2026

PureWeen merged commit 7594e20 into main Feb 19, 2026
6 checks passed

PureWeen deleted the feat/elapsed-thinking-timer branch February 22, 2026 00:15

PureWeen mentioned this pull request Feb 25, 2026

Fix stuck Thinking after restart: 30s quiescence timeout #211

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stuck sessions: watchdog recovery, model labels, and thread safety#148

Fix stuck sessions: watchdog recovery, model labels, and thread safety#148
PureWeen merged 15 commits intomainfrom
feat/elapsed-thinking-timer

PureWeen commented Feb 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

PureWeen commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bug

Changes

Watchdog & Recovery

Thread Safety

Model Labels

Ghost Messages

Interrupted Turn Recovery

Other

Tests

Files Changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PureWeen commented Feb 19, 2026 •

edited

Loading