Skip to content

Fix sessions stuck in IsProcessing=true after app restart#163

Merged
PureWeen merged 8 commits intomainfrom
fix/session-multiagentsupport-can-you-tell-w-20260220-2220
Feb 21, 2026
Merged

Fix sessions stuck in IsProcessing=true after app restart#163
PureWeen merged 8 commits intomainfrom
fix/session-multiagentsupport-can-you-tell-w-20260220-2220

Conversation

@PureWeen
Copy link
Owner

Bug

Sessions restored from disk could get permanently stuck showing 'Thinking' (IsProcessing=true) when the CLI backend had already finished processing. The debug info showed:

  • \IsProcessing: True\ with \LastUpdatedAt\ far in the past
  • Session appeared stuck with no visible progress

Root Cause

Two issues allowed sessions to stay in stuck 'processing' state:

  1. No staleness check on restore: \IsSessionStillProcessing\ checked the last event TYPE in events.jsonl but not its AGE. If the CLI finished processing hours ago but the last event was an 'active' type, the session was incorrectly marked as still processing.

  2. IsResumed kept watchdog timeout at 600s too long: Resumed sessions used the tool-execution timeout (600s) until the first \CompleteResponse. If the CLI sent non-terminal events but never \SessionIdleEvent, the session stayed in the long-timeout mode indefinitely.

Fix

  1. Staleness check: If \�vents.jsonl\ was last modified >600s ago, treat the session as idle regardless of last event type.
  2. Clear IsResumed after events flow: Once events arrive on a resumed session, clear \IsResumed\ so the watchdog transitions to the shorter 120s timeout. \HasUsedToolsThisTurn\ still preserves the 600s timeout for active tool executions.

Tests

  • 21 new tests in \StuckSessionRecoveryTests.cs\ covering staleness detection, active/inactive event types, corrupt file handling, and IsResumed clearing logic
  • New UI scenario in \mode-switch-scenarios.json\ for stale session recovery verification
  • All 836 existing tests continue to pass (4 pre-existing DiffParser failures on Windows due to \r\n)

PureWeen and others added 2 commits February 20, 2026 19:20
Two fixes prevent sessions from permanently showing 'Thinking' state:

1. Staleness check in IsSessionStillProcessing: If events.jsonl hasn't
   been modified in over 600 seconds (watchdog tool timeout), the session
   is treated as idle regardless of the last event type. This handles the
   case where the CLI finished processing hours ago but the events.jsonl
   last event was an 'active' type (e.g. assistant.message_delta).

2. Clear IsResumed in watchdog after events flow: On resumed sessions,
   IsResumed=true forces the 600s tool-execution timeout. After events
   start arriving (HasReceivedEventsSinceResume), the watchdog clears
   IsResumed so the shorter 120s inactivity timeout applies. The
   HasUsedToolsThisTurn flag still preserves the 600s timeout when
   tools are actively executing.

Includes 21 new tests covering staleness detection, active/inactive
event type classification, corrupt file handling, and IsResumed logic.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tests

- HIGH: Don't clear IsResumed when tools are active or have been used this
  turn. The deduplication path in ToolExecutionStartEvent skips the
  ActiveToolCallCount increment, so a resumed mid-tool session could have
  its 600s timeout incorrectly downgraded to 120s.
- MEDIUM: Wrap IsResumed=false in InvokeOnUI consistent with all other
  state.Info mutations in the watchdog.
- MEDIUM: Replace 2 self-fulfilling tests with 4 tests that cover all
  combinations of the watchdog guard condition (events+tools, events+no
  tools, no events, events+hasUsedToolsThisTurn).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/session-multiagentsupport-can-you-tell-w-20260220-2220 branch from 9f68e11 to 1686ceb Compare February 21, 2026 01:39
PureWeen and others added 6 commits February 20, 2026 20:57
Invariant audit found 4 pre-existing gaps where IsProcessing was cleared
without resetting all companion fields:

1. CompleteResponse: missing ActiveToolCallCount reset
2. SendAsync reconnect+retry failure: missing IsResumed, HasUsedToolsThisTurn,
   ActiveToolCallCount, FlushCurrentResponse
3. SendAsync initial failure: same missing fields
4. Remote mode abort: missing IsResumed

These are the exact same class of bug that caused regressions in PRs #148,
#158, and #164. Every path that sets IsProcessing=false must also clear:
IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, ProcessingStartedAt,
ToolCallCount, ProcessingPhase, and FlushCurrentResponse.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…edge

Add comprehensive documentation of the recurring stuck-session bug pattern
(7 PRs, 16 fix/regression cycles) to copilot-instructions.md:

- Full cleanup checklist for all IsProcessing=false paths
- Table of all 7 paths with locations
- 7 common mistakes with PR references where each occurred
- Staleness check and IsResumed clearing documentation
- Cross-thread volatile field requirements
- ProcessingGeneration guard explanation
- Watchdog diagnostic log tag additions

This knowledge was hard-won across PRs #141, #147, #148, #153, #158,
#163, #164 and should prevent future regressions by making the invariants
explicit and discoverable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The IsProcessing cleanup invariant documentation was bloating
copilot-instructions.md (loaded into every session). Move the
detailed knowledge — 7 invariants, regression history table,
common mistakes with PR references, thread safety rules — into
.claude/skills/processing-state-safety/SKILL.md.

The skill has front matter so it's loaded on-demand when working
on processing/watchdog/resume code. The main instructions keep a
short pointer + essentials (344→286 lines, 25KB→20KB).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow skill-creator guidelines:
- SKILL.md is now a concise procedural workflow (74 lines, not a
  knowledge dump) with 4 clear steps: identify path, apply checklist,
  verify all 7 paths, check thread safety
- Moved regression history, common mistakes, and thread safety details
  to references/regression-history.md (loaded only when debugging)
- Progressive disclosure: SKILL.md has the checklist and rules,
  references/ has the context and history

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Bridge OnTurnEnd (8th path): Was missing IsResumed, ProcessingStartedAt,
   ToolCallCount, ProcessingPhase cleanup. Also moved all state mutations
   inside InvokeOnUI to fix thread safety (Opus finding).

2. Corrected factually wrong comment: The dedup path does NOT skip
   ActiveToolCallCount++ — the increment is at line 277, before the dedup
   break at 291. The real reason HasUsedToolsThisTurn is needed is that
   AssistantTurnStartEvent resets ActiveToolCallCount to 0 between rounds.
   (Sonnet + GPT 5.3 consensus finding)

3. Updated skill docs: Added 8th path to table, noted CompleteResponse
   saves response inline (not via FlushCurrentResponse), corrected the
   dedup claim in both SKILL.md and regression-history.md.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit a52e81f into main Feb 21, 2026
6 checks passed
@PureWeen PureWeen deleted the fix/session-multiagentsupport-can-you-tell-w-20260220-2220 branch February 22, 2026 00:15
PureWeen added a commit that referenced this pull request Feb 25, 2026
## Problem
After app restart, resumed sessions that were mid-turn show
**Thinking...** with a Stop button. The user must manually click Stop
every time. The existing watchdog waited 600s (10 min!) before clearing
stuck IsProcessing.

## Solution
Add a **30s resume quiescence timeout** for sessions that receive zero
SDK events after restart. If no events flow within 30s of app start, the
session is cleared as stuck.

### Key design decisions (informed by 4-model consultation: Opus 4.6,
Sonnet 4.6, Codex 5.3, GPT-5.1):

1. **30s quiescence** — short enough users don't wait, long enough for
SDK reconnect (~5s typical, 6x safety margin)
2. **Event-gated** — only fires when \HasReceivedEventsSinceResume ==
false\. Once events start flowing, transitions to normal 120s/600s
timeout tiers
3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models
independently flagged that seeding from events.jsonl would cause
immediate kills for sessions >15s old (exact PR #148 regression pattern)
4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup
code, all 8 invariants preserved

### Timeout tiers (3-tier, was 2-tier):
| Condition | Timeout |
|-----------|---------|
| Resumed, zero events since restart | **30s** (NEW) |
| Normal processing, no tools | 120s |
| Active tools / resumed with events / multi-agent | 600s |

## Tests
- **16 new regression guard tests** covering quiescence edge cases, seed
time safety, exhaustive timeout matrix
- Updated existing tests to use \ComputeEffectiveTimeout\ helper
mirroring production 3-tier formula
- **108 total watchdog+recovery tests pass** ✅

## Regression history context
This code has been through 7 PRs of fix/regression cycles (PRs
#141#147#148#153#158#163#164). The most relevant precedent: PR
#148 added a 10s resume timeout that killed active sessions. Our 30s
timeout avoids this by being event-gated and seeded from UtcNow.

Fixes the 'click Stop on every restart' UX issue.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant