Conversation
When the persistent Copilot server dies mid-turn, sessions would get permanently stuck in IsProcessing=true because no SessionIdleEvent arrives to trigger CompleteResponse(). The ResponseCompletion task would also never complete, blocking the caller indefinitely. Add a processing watchdog that monitors event flow during active turns: - Tracks LastEventAt timestamp on every SDK event received - Starts a background watchdog when IsProcessing is set in SendPromptAsync - Checks every 15s; if no events arrive for 120s, clears the stuck state - Adds a system message to history so the user knows what happened - Cancels cleanly on normal completion (SessionIdleEvent/SessionErrorEvent) The watchdog is also reset when reconnection succeeds, and cancelled in all error paths (send failure, reconnect failure). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Observed during live testing: after relaunch.sh deploys a new build while an old copilot server is running, the app can silently fail to restore sessions, showing 0 sessions despite being 'Connected'. Add tests covering: - Persistent mode failed init sets NeedsConfiguration - No sessions stuck after failed init - IsInitialized correct across mode switches - Reconnect clears stuck processing from previous mode - OnStateChanged fires during reconnect - UI scenario for relaunch-with-stale-server Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Marshal watchdog timeout mutations to UI thread via InvokeOnUI to avoid racing with CompleteResponse and concurrent History.Add() on the List<ChatMessage> (not thread-safe). Re-check IsProcessing on UI thread to handle late normal-completion. - Use Interlocked for LastEventAtTicks (long) instead of DateTime property to guarantee atomic reads/writes across threads. - Dispose CancellationTokenSource in CancelProcessingWatchdog to prevent kernel object leaks across many prompts. - Cancel old watchdog BEFORE creating new SessionState on reconnect path — old and new states share Info/ResponseCompletion, so the old watchdog could clear IsProcessing mid-retry. - Cancel all watchdogs in ReconnectAsync and DisposeAsync before clearing sessions to prevent orphaned watchdog tasks. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…all, dynamic timeout string - Cancel watchdog in AbortSessionAsync (both local and remote paths) - Add catch-all exception handler in RunProcessingWatchdogAsync - Make timeout message dynamic based on WatchdogInactivityTimeoutSeconds Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 21, 2026
…edge Add comprehensive documentation of the recurring stuck-session bug pattern (7 PRs, 16 fix/regression cycles) to copilot-instructions.md: - Full cleanup checklist for all IsProcessing=false paths - Table of all 7 paths with locations - 7 common mistakes with PR references where each occurred - Staleness check and IsResumed clearing documentation - Cross-thread volatile field requirements - ProcessingGeneration guard explanation - Watchdog diagnostic log tag additions This knowledge was hard-won across PRs #141, #147, #148, #153, #158, #163, #164 and should prevent future regressions by making the invariants explicit and discoverable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen
added a commit
that referenced
this pull request
Feb 25, 2026
## Problem After app restart, resumed sessions that were mid-turn show **Thinking...** with a Stop button. The user must manually click Stop every time. The existing watchdog waited 600s (10 min!) before clearing stuck IsProcessing. ## Solution Add a **30s resume quiescence timeout** for sessions that receive zero SDK events after restart. If no events flow within 30s of app start, the session is cleared as stuck. ### Key design decisions (informed by 4-model consultation: Opus 4.6, Sonnet 4.6, Codex 5.3, GPT-5.1): 1. **30s quiescence** — short enough users don't wait, long enough for SDK reconnect (~5s typical, 6x safety margin) 2. **Event-gated** — only fires when \HasReceivedEventsSinceResume == false\. Once events start flowing, transitions to normal 120s/600s timeout tiers 3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models independently flagged that seeding from events.jsonl would cause immediate kills for sessions >15s old (exact PR #148 regression pattern) 4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup code, all 8 invariants preserved ### Timeout tiers (3-tier, was 2-tier): | Condition | Timeout | |-----------|---------| | Resumed, zero events since restart | **30s** (NEW) | | Normal processing, no tools | 120s | | Active tools / resumed with events / multi-agent | 600s | ## Tests - **16 new regression guard tests** covering quiescence edge cases, seed time safety, exhaustive timeout matrix - Updated existing tests to use \ComputeEffectiveTimeout\ helper mirroring production 3-tier formula - **108 total watchdog+recovery tests pass** ✅ ## Regression history context This code has been through 7 PRs of fix/regression cycles (PRs #141→#147→#148→#153→#158→#163→#164). The most relevant precedent: PR #148 added a 10s resume timeout that killed active sessions. Our 30s timeout avoids this by being event-gated and seeded from UtcNow. Fixes the 'click Stop on every restart' UX issue. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the persistent Copilot server dies mid-turn (while processing a response), sessions get permanently stuck showing "Thinking..." with no way to recover. This happens because:
IsProcessingis set totruewhenSendAsyncis calledSessionIdleEvent)CompleteResponse()is never called →IsProcessingstays true foreverResponseCompletion.Tasknever completes → the caller blocks indefinitelyRoot Cause
There was a 10-second timeout for resumed sessions that were mid-turn (protecting against stale state), but no equivalent watchdog for active sessions where the server dies mid-turn.
Fix
Add a processing watchdog that monitors SDK event flow during active turns:
LastEventAttimestamp onSessionState, updated on every SDK eventStartProcessingWatchdog()launches a background task whenIsProcessingis setIsProcessing, completesResponseCompletion, adds a system message to chat historySessionIdleEvent,SessionErrorEvent, error paths)The 120s timeout is generous enough for legitimate long-running tool executions (which still emit progress events) but short enough to recover stuck sessions in reasonable time.
Tests
ProcessingWatchdogTests.cscovering:UI Scenario
Added
stuck-session-recovery-after-server-disconnectscenario tomode-switch-scenarios.jsonfor manual verification with MauiDevFlow.