Conversation
Two fixes prevent sessions from permanently showing 'Thinking' state: 1. Staleness check in IsSessionStillProcessing: If events.jsonl hasn't been modified in over 600 seconds (watchdog tool timeout), the session is treated as idle regardless of the last event type. This handles the case where the CLI finished processing hours ago but the events.jsonl last event was an 'active' type (e.g. assistant.message_delta). 2. Clear IsResumed in watchdog after events flow: On resumed sessions, IsResumed=true forces the 600s tool-execution timeout. After events start arriving (HasReceivedEventsSinceResume), the watchdog clears IsResumed so the shorter 120s inactivity timeout applies. The HasUsedToolsThisTurn flag still preserves the 600s timeout when tools are actively executing. Includes 21 new tests covering staleness detection, active/inactive event type classification, corrupt file handling, and IsResumed logic. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tests - HIGH: Don't clear IsResumed when tools are active or have been used this turn. The deduplication path in ToolExecutionStartEvent skips the ActiveToolCallCount increment, so a resumed mid-tool session could have its 600s timeout incorrectly downgraded to 120s. - MEDIUM: Wrap IsResumed=false in InvokeOnUI consistent with all other state.Info mutations in the watchdog. - MEDIUM: Replace 2 self-fulfilling tests with 4 tests that cover all combinations of the watchdog guard condition (events+tools, events+no tools, no events, events+hasUsedToolsThisTurn). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
9f68e11 to
1686ceb
Compare
Invariant audit found 4 pre-existing gaps where IsProcessing was cleared without resetting all companion fields: 1. CompleteResponse: missing ActiveToolCallCount reset 2. SendAsync reconnect+retry failure: missing IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, FlushCurrentResponse 3. SendAsync initial failure: same missing fields 4. Remote mode abort: missing IsResumed These are the exact same class of bug that caused regressions in PRs #148, #158, and #164. Every path that sets IsProcessing=false must also clear: IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, ProcessingStartedAt, ToolCallCount, ProcessingPhase, and FlushCurrentResponse. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…edge Add comprehensive documentation of the recurring stuck-session bug pattern (7 PRs, 16 fix/regression cycles) to copilot-instructions.md: - Full cleanup checklist for all IsProcessing=false paths - Table of all 7 paths with locations - 7 common mistakes with PR references where each occurred - Staleness check and IsResumed clearing documentation - Cross-thread volatile field requirements - ProcessingGeneration guard explanation - Watchdog diagnostic log tag additions This knowledge was hard-won across PRs #141, #147, #148, #153, #158, #163, #164 and should prevent future regressions by making the invariants explicit and discoverable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The IsProcessing cleanup invariant documentation was bloating copilot-instructions.md (loaded into every session). Move the detailed knowledge — 7 invariants, regression history table, common mistakes with PR references, thread safety rules — into .claude/skills/processing-state-safety/SKILL.md. The skill has front matter so it's loaded on-demand when working on processing/watchdog/resume code. The main instructions keep a short pointer + essentials (344→286 lines, 25KB→20KB). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Follow skill-creator guidelines: - SKILL.md is now a concise procedural workflow (74 lines, not a knowledge dump) with 4 clear steps: identify path, apply checklist, verify all 7 paths, check thread safety - Moved regression history, common mistakes, and thread safety details to references/regression-history.md (loaded only when debugging) - Progressive disclosure: SKILL.md has the checklist and rules, references/ has the context and history Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Bridge OnTurnEnd (8th path): Was missing IsResumed, ProcessingStartedAt, ToolCallCount, ProcessingPhase cleanup. Also moved all state mutations inside InvokeOnUI to fix thread safety (Opus finding). 2. Corrected factually wrong comment: The dedup path does NOT skip ActiveToolCallCount++ — the increment is at line 277, before the dedup break at 291. The real reason HasUsedToolsThisTurn is needed is that AssistantTurnStartEvent resets ActiveToolCallCount to 0 between rounds. (Sonnet + GPT 5.3 consensus finding) 3. Updated skill docs: Added 8th path to table, noted CompleteResponse saves response inline (not via FlushCurrentResponse), corrected the dedup claim in both SKILL.md and regression-history.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 25, 2026
## Problem After app restart, resumed sessions that were mid-turn show **Thinking...** with a Stop button. The user must manually click Stop every time. The existing watchdog waited 600s (10 min!) before clearing stuck IsProcessing. ## Solution Add a **30s resume quiescence timeout** for sessions that receive zero SDK events after restart. If no events flow within 30s of app start, the session is cleared as stuck. ### Key design decisions (informed by 4-model consultation: Opus 4.6, Sonnet 4.6, Codex 5.3, GPT-5.1): 1. **30s quiescence** — short enough users don't wait, long enough for SDK reconnect (~5s typical, 6x safety margin) 2. **Event-gated** — only fires when \HasReceivedEventsSinceResume == false\. Once events start flowing, transitions to normal 120s/600s timeout tiers 3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models independently flagged that seeding from events.jsonl would cause immediate kills for sessions >15s old (exact PR #148 regression pattern) 4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup code, all 8 invariants preserved ### Timeout tiers (3-tier, was 2-tier): | Condition | Timeout | |-----------|---------| | Resumed, zero events since restart | **30s** (NEW) | | Normal processing, no tools | 120s | | Active tools / resumed with events / multi-agent | 600s | ## Tests - **16 new regression guard tests** covering quiescence edge cases, seed time safety, exhaustive timeout matrix - Updated existing tests to use \ComputeEffectiveTimeout\ helper mirroring production 3-tier formula - **108 total watchdog+recovery tests pass** ✅ ## Regression history context This code has been through 7 PRs of fix/regression cycles (PRs #141→#147→#148→#153→#158→#163→#164). The most relevant precedent: PR #148 added a 10s resume timeout that killed active sessions. Our 30s timeout avoids this by being event-gated and seeded from UtcNow. Fixes the 'click Stop on every restart' UX issue. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
Sessions restored from disk could get permanently stuck showing 'Thinking' (IsProcessing=true) when the CLI backend had already finished processing. The debug info showed:
Root Cause
Two issues allowed sessions to stay in stuck 'processing' state:
No staleness check on restore: \IsSessionStillProcessing\ checked the last event TYPE in events.jsonl but not its AGE. If the CLI finished processing hours ago but the last event was an 'active' type, the session was incorrectly marked as still processing.
IsResumed kept watchdog timeout at 600s too long: Resumed sessions used the tool-execution timeout (600s) until the first \CompleteResponse. If the CLI sent non-terminal events but never \SessionIdleEvent, the session stayed in the long-timeout mode indefinitely.
Fix
Tests