Skip to content

Fix stuck sessions: watchdog recovery, model labels, and thread safety#148

Merged
PureWeen merged 15 commits intomainfrom
feat/elapsed-thinking-timer
Feb 19, 2026
Merged

Fix stuck sessions: watchdog recovery, model labels, and thread safety#148
PureWeen merged 15 commits intomainfrom
feat/elapsed-thinking-timer

Conversation

@PureWeen
Copy link
Owner

@PureWeen PureWeen commented Feb 19, 2026

Summary

Fix sessions getting stuck on "Thinking" after app relaunch or during long tool executions. Add model labels on assistant messages and comprehensive diagnostic logging.

Bug

Sessions appeared permanently stuck on "Thinking" after PolyPilot was relaunched while sessions were mid-turn. The processing watchdog had several issues that prevented proper recovery.

Changes

Watchdog & Recovery

  • Remove 10s resume timeout — watchdog handles all stuck detection (cherry-picked from PR feat: Multi-agent orchestration with Squad integration #104)
  • Fix watchdog timeout during tool loopsHasUsedToolsThisTurn stays true for the entire processing cycle, preventing false 120s timeouts between tool rounds when ActiveToolCallCount resets
  • Wire event handler before watchdog start — prevents missing early events after resume
  • Fix IsResumed scoping — only set when session was mid-turn at restart, cleared after first successful turn and on abort/error/watchdog paths
  • Fix HasUsedToolsThisTurn leak — reset in SendPromptAsync and AbortSessionAsync so stale flags don't inflate subsequent turn timeouts

Thread Safety

  • SessionErrorEvent uses InvokeOnUI() — all IsProcessing mutations now marshaled to UI thread
  • HasUsedToolsThisTurn uses Volatile.Write/Read — ARM memory barrier for cross-thread safety, consistent with HasReceivedEventsSinceResume

Model Labels

  • Show model name next to timestamp on assistant messages (e.g., "10:30 AM · gpt-4.1")
  • Set ChatMessage.Model in all 4 creation paths: CompleteResponse, FlushCurrentResponse, abort flush, bridge OnTurnEnd
  • Persist Model column in ChatDatabase (sqlite-net auto-migrates)

Ghost Messages

  • Skip whitespace-only assistant messages in FlushCurrentResponse and CompleteResponse (IsNullOrWhiteSpace instead of IsNullOrEmpty)

Interrupted Turn Recovery

  • Persist LastPrompt in active-sessions.json when session is mid-turn
  • Show actionable system message on reconnect with truncated last prompt
  • Diagnostic logging on all IsProcessing=false paths: [COMPLETE], [ERROR], [ABORT], [BRIDGE-COMPLETE], [INTERRUPTED]

Other

  • Fix nullable reference warnings in ParseMcpServerConfig

Tests

17 new tests (684 total passing):

  • 11 watchdog timeout selection tests (timeout tiers, reset on send/abort, IsResumed scoping/clearing, volatile consistency)
  • 4 model property tests + ghost message prevention
  • 3 LastPrompt persistence round-trip/merge tests
  • 3 interrupted turn message tests

Files Changed

  • CopilotService.Events.cs — watchdog logic, CompleteResponse, FlushCurrentResponse, SessionErrorEvent
  • CopilotService.cs — SendPromptAsync, AbortSessionAsync, ResumeSessionAsync, LastPrompt
  • CopilotService.Persistence.cs — LastPrompt save during persistence
  • CopilotService.Bridge.cs — Model on bridge turn end
  • AgentSessionInfo.cs — IsResumed init to set
  • ChatMessage.cs — Model property
  • ChatDatabase.cs — Model column
  • ChatMessageItem.razor — Model display
  • .github/copilot-instructions.md — Watchdog, diagnostic, thread safety docs
  • ProcessingWatchdogTests.cs, ChatMessageTests.cs, SessionPersistenceTests.cs

PureWeen and others added 4 commits February 19, 2026 13:08
Add ProcessingStartedAt timestamp to AgentSessionInfo. Display elapsed
time (e.g. '(45s)', '(2m 30s)') next to the Thinking indicator so users
can see the session is alive during long tool executions. Only shows
after 5 seconds to avoid flicker on quick responses.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace Blazor-render-dependent elapsed label with a global JS
setInterval that ticks every second. Uses data-start attribute on
.live-timer spans so the timer keeps counting even when the CLI goes
silent and no Blazor re-renders occur. One global timer handles all
sessions with zero re-render cost.

Also wire ProcessingStartedAt to SessionCard (compact grid view) which
was missing the parameter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add Model property to ChatMessage, set from state.Info.Model in
CompleteResponse and abort flush. Display as '10:30 AM · gpt-4.1'
next to the timestamp on each assistant message.

Remove the elapsed timer (JS setInterval, data-start attributes,
GetElapsedLabel) as it provided no actionable information.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Set Model on FlushCurrentResponse assistant messages (mid-turn flush
  before tool calls was missing it)
- Set Model on bridge OnTurnEnd for remote mode messages
- Add Model column to ChatMessageEntity and map in both directions
  so model name survives DB round-trips across app restarts
- Remove dead ProcessingStartedAt property (unused after timer removal)
- Add 4 tests for ChatMessage.Model property

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen changed the title Show elapsed time on Thinking indicator Show model name on each assistant message Feb 19, 2026
PureWeen and others added 11 commits February 19, 2026 15:03
Add null-coalescing fallbacks for JsonElement.GetString() calls on
Command, Cwd, and Type properties to silence CS8601 warnings.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Marshal IsProcessing mutation and ResponseCompletion.TrySetException to
the UI thread via InvokeOnUI(), matching the pattern used by the
processing watchdog. Combines the error notification, state mutation,
and state-changed callback into a single UI-thread dispatch to prevent
race conditions with Blazor rendering.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Change system message to actionable warning with resend hint
- Add LastPrompt field to ActiveSessionEntry for persistence
- Save last user message when session is processing during app exit
- Show truncated last prompt in recovery message so user knows what was lost
- Fire OnError to show error banner on interrupted turn recovery
- Add [INTERRUPTED] debug tag for diagnostic traceability

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6 new tests (670 total):
- ActiveSessionEntry_LastPrompt_RoundTrips
- ActiveSessionEntry_LastPrompt_NullByDefault
- MergeSessionEntries_PreservesLastPrompt
- InterruptedTurn_SystemMessage_ContainsWarning
- InterruptedTurn_SystemMessage_IncludesLastPrompt
- InterruptedTurn_SystemMessage_TruncatesLongPrompt

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The 10-second hardcoded timeout in ResumeSessionAsync was prematurely
clearing IsProcessing on sessions that were still actively working.
Tool calls (dotnet build, git push, etc.) can easily go 30-60 seconds
between events, causing the resume logic to declare the turn dead.

Changes:
- Remove the 10-second resume timeout entirely — the processing
  watchdog (120s inactivity / 600s tool execution) already handles
  stuck sessions properly
- Move event handler subscription (copilotSession.On) BEFORE the
  watchdog setup to fix a race where events arriving immediately
  after SDK resume were missed because the handler wasn't wired yet

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The processing watchdog was incorrectly using the 120s inactivity timeout
even when the session was actively running multi-turn tool calls. This
happened because AssistantTurnStartEvent resets ActiveToolCallCount to 0
between tool rounds, making the model's 'thinking' gap between tools
look like inactivity.

Added HasUsedToolsThisTurn flag that stays true for the entire processing
cycle once any tool executes. The watchdog now uses the 600s tool timeout
when: a tool is actively running (hasActiveTool), the session was resumed
mid-turn (IsResumed), or tools have been used this turn (HasUsedToolsThisTurn).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add 7 tests to ProcessingWatchdogTests covering the 3-way condition
that selects between inactivity (120s) and tool execution (600s) timeouts:

- HasUsedToolsThisTurn: defaults false, can be set, reset by CompleteResponse
- Timeout selection: no tools → 120s, active tool → 600s, resumed → 600s,
  has-used-tools → 600s

SessionState is private, so tests replicate the watchdog's decision logic
inline using local variables matching CopilotService.Events.cs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tterns

Add to copilot-instructions.md:
- Processing Watchdog section (120s/600s timeout tiers, 3-way condition)
- Diagnostic Log Tags reference ([SEND], [COMPLETE], [ERROR], etc.)
- Thread Safety: IsProcessing Mutations (InvokeOnUI pattern)
- LastPrompt in session persistence
- 11 missing test files in Test Coverage section

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ompleteResponse

FlushCurrentResponse was creating ghost messages containing only newlines
when the CLI sent formatting whitespace before tool calls. Changed both
FlushCurrentResponse and CompleteResponse to use IsNullOrWhiteSpace
instead of IsNullOrEmpty.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…med scoping

1. Wire lastPrompt into reconnect system message when session was
   mid-turn — shows truncated last user message so user knows what
   was interrupted.

2. Reset HasUsedToolsThisTurn in SendPromptAsync and AbortSessionAsync
   alongside ActiveToolCallCount — prevents stale tool flag from
   inflating watchdog timeout on subsequent turns.

3. Scope IsResumed to mid-turn resumes only (set when isStillProcessing
   is true, not unconditionally). Clear it in CompleteResponse after
   first successful turn. Changed from init to set property.

3 new tests (680 total).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fleet review found 2 issues:

1. IsResumed not cleared on abort/error/watchdog paths (GPT-5.3 + Gemini).
   Only CompleteResponse cleared it, so if a resumed session ended via
   abort, error, or watchdog timeout, all subsequent turns permanently
   used the 600s tool timeout instead of 120s inactivity timeout.
   Fix: Clear IsResumed in abort, error, and watchdog paths.

2. HasUsedToolsThisTurn missing volatile barrier (Sonnet). Written on
   SDK background thread, read on watchdog timer thread. ARM weak memory
   model could see stale false, causing premature 120s timeout during
   tool execution. Fix: Use Volatile.Write on set, Volatile.Read on
   watchdog check — consistent with HasReceivedEventsSinceResume pattern.

4 new tests (684 total).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen changed the title Show model name on each assistant message Fix stuck sessions: watchdog recovery, model labels, and thread safety Feb 19, 2026
@PureWeen PureWeen merged commit 7594e20 into main Feb 19, 2026
6 checks passed
PureWeen added a commit that referenced this pull request Feb 21, 2026
Invariant audit found 4 pre-existing gaps where IsProcessing was cleared
without resetting all companion fields:

1. CompleteResponse: missing ActiveToolCallCount reset
2. SendAsync reconnect+retry failure: missing IsResumed, HasUsedToolsThisTurn,
   ActiveToolCallCount, FlushCurrentResponse
3. SendAsync initial failure: same missing fields
4. Remote mode abort: missing IsResumed

These are the exact same class of bug that caused regressions in PRs #148,
#158, and #164. Every path that sets IsProcessing=false must also clear:
IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, ProcessingStartedAt,
ToolCallCount, ProcessingPhase, and FlushCurrentResponse.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 21, 2026
…edge

Add comprehensive documentation of the recurring stuck-session bug pattern
(7 PRs, 16 fix/regression cycles) to copilot-instructions.md:

- Full cleanup checklist for all IsProcessing=false paths
- Table of all 7 paths with locations
- 7 common mistakes with PR references where each occurred
- Staleness check and IsResumed clearing documentation
- Cross-thread volatile field requirements
- ProcessingGeneration guard explanation
- Watchdog diagnostic log tag additions

This knowledge was hard-won across PRs #141, #147, #148, #153, #158,
#163, #164 and should prevent future regressions by making the invariants
explicit and discoverable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 21, 2026
Address issues found during INV validation:

- INV-4: Add comments explaining why ResumeGracePeriodSeconds (30s) is
  safe and differs from PR #148's regression (unconditional 10s timeout).
  Our grace period only fires when HasReceivedEventsSinceResume=false.

- INV-7: Use Volatile.Write for HasUsedToolsThisTurn in watchdog fallback
  (ARM weak memory model visibility).

- INV-2/INV-1: Document intentional violations in watchdog fallback —
  bypasses InvokeOnUI because Blazor circuit is dead, skips
  FlushCurrentResponse because render pipeline is unreachable.

- INV-1 (Path 5): Remote abort was missing ActiveToolCallCount,
  HasUsedToolsThisTurn cleanup and diagnostic log.

- INV-1 (Path 8): Bridge OnTurnEnd was missing ActiveToolCallCount
  and HasUsedToolsThisTurn cleanup.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen deleted the feat/elapsed-thinking-timer branch February 22, 2026 00:15
PureWeen added a commit that referenced this pull request Feb 25, 2026
Three models (Opus 4.6, Sonnet 4.6, Codex 5.3) independently identified a
critical bug: seeding LastEventAtTicks from events.jsonl file write time
caused the 30s quiescence timeout to fire on the FIRST 15s watchdog check
for any session where the file was >15s old. This is exactly the PR #148
regression (killing active sessions with a too-short timeout).

Fix: seed from DateTime.UtcNow (not file time) so the 30s quiescence
measures from app restart, not from last SDK event before crash. This gives
the SDK a full 30s to reconnect and start streaming events.

Added 16 new regression guard tests:
- ResumeQuiescenceTimeout_IsReasonable (20-120s range, >= 2x check interval)
- ResumeQuiescenceTimeout_IsLessThanInactivityTimeout
- ResumeQuiescence_OnlyTriggersWhenResumedAndNoEvents (exhaustive combos)
- ResumeQuiescence_NotResumed_NeverTriggersQuiescence
- ResumeQuiescence_TransitionsToToolTimeout_WhenEventsArrive
- ResumeQuiescence_TransitionsToInactivity_AfterIsResumedCleared
- WatchdogTimeoutSelection_ExhaustiveMatrix (10 Theory rows)
- SeedTime_MustNotCauseImmediateKill_RegressionGuard (PR #148 guard)
- SeedTime_FromStaleFile_WouldCauseImmediateKill_DocumentsRisk
- QuiescenceTimeout_EscapesOnFirstEvent
- QuiescenceTimeout_DoesNotAffect_NormalSendPromptPath
- WatchdogResumeQuiescence_Constant_MatchesExpectedValue
- AllThreeTimeoutTiers_AreDistinct

Also updated existing tests to use ComputeEffectiveTimeout helper mirroring
the production 3-tier formula, preventing test-vs-production logic drift.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 25, 2026
…ime param

Three-model review (Opus 4.6, Sonnet 4.6, Codex 5.3) found:

1. Error message shows '0 minute(s)' for 30s quiescence timeout due to
   integer division (30/60=0). Fixed: format as seconds when < 60s.

2. StartProcessingWatchdog had a seedTime parameter with a comment
   describing the OPPOSITE of correct usage (said 'seed from file time'
   when file-time seeding causes PR #148 regression). Removed the dead
   parameter entirely and replaced comment with a warning.

3. Stale comment at resume site said 'no separate short timeout needed'
   which contradicts the 30s quiescence this PR adds. Updated to list
   all three timeout tiers.

Added 5 Theory tests for timeout message formatting (30s/59s/60s/120s/600s).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 25, 2026
## Problem
After app restart, resumed sessions that were mid-turn show
**Thinking...** with a Stop button. The user must manually click Stop
every time. The existing watchdog waited 600s (10 min!) before clearing
stuck IsProcessing.

## Solution
Add a **30s resume quiescence timeout** for sessions that receive zero
SDK events after restart. If no events flow within 30s of app start, the
session is cleared as stuck.

### Key design decisions (informed by 4-model consultation: Opus 4.6,
Sonnet 4.6, Codex 5.3, GPT-5.1):

1. **30s quiescence** — short enough users don't wait, long enough for
SDK reconnect (~5s typical, 6x safety margin)
2. **Event-gated** — only fires when \HasReceivedEventsSinceResume ==
false\. Once events start flowing, transitions to normal 120s/600s
timeout tiers
3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models
independently flagged that seeding from events.jsonl would cause
immediate kills for sessions >15s old (exact PR #148 regression pattern)
4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup
code, all 8 invariants preserved

### Timeout tiers (3-tier, was 2-tier):
| Condition | Timeout |
|-----------|---------|
| Resumed, zero events since restart | **30s** (NEW) |
| Normal processing, no tools | 120s |
| Active tools / resumed with events / multi-agent | 600s |

## Tests
- **16 new regression guard tests** covering quiescence edge cases, seed
time safety, exhaustive timeout matrix
- Updated existing tests to use \ComputeEffectiveTimeout\ helper
mirroring production 3-tier formula
- **108 total watchdog+recovery tests pass** ✅

## Regression history context
This code has been through 7 PRs of fix/regression cycles (PRs
#141#147#148#153#158#163#164). The most relevant precedent: PR
#148 added a 10s resume timeout that killed active sessions. Our 30s
timeout avoids this by being event-gated and seeded from UtcNow.

Fixes the 'click Stop on every restart' UX issue.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant