Skip to content

Fix sessions stuck in 'Thinking' state after server disconnect#141

Merged
PureWeen merged 4 commits intomainfrom
fix/the-current-running-instance-of-polypilo-20260218-1424
Feb 18, 2026
Merged

Fix sessions stuck in 'Thinking' state after server disconnect#141
PureWeen merged 4 commits intomainfrom
fix/the-current-running-instance-of-polypilo-20260218-1424

Conversation

@PureWeen
Copy link
Owner

Problem

When the persistent Copilot server dies mid-turn (while processing a response), sessions get permanently stuck showing "Thinking..." with no way to recover. This happens because:

  1. IsProcessing is set to true when SendAsync is called
  2. The server dies, so no more SDK events arrive (no SessionIdleEvent)
  3. CompleteResponse() is never called → IsProcessing stays true forever
  4. ResponseCompletion.Task never completes → the caller blocks indefinitely

Root Cause

There was a 10-second timeout for resumed sessions that were mid-turn (protecting against stale state), but no equivalent watchdog for active sessions where the server dies mid-turn.

Fix

Add a processing watchdog that monitors SDK event flow during active turns:

  • LastEventAt timestamp on SessionState, updated on every SDK event
  • StartProcessingWatchdog() launches a background task when IsProcessing is set
  • Checks every 15 seconds; if no events arrive for 120 seconds, declares the connection dead
  • Clears IsProcessing, completes ResponseCompletion, adds a system message to chat history
  • Cancels cleanly on normal completion (SessionIdleEvent, SessionErrorEvent, error paths)

The 120s timeout is generous enough for legitimate long-running tool executions (which still emit progress events) but short enough to recover stuck sessions in reasonable time.

Tests

  • 11 new tests in ProcessingWatchdogTests.cs covering:
    • Watchdog constant validation (reasonable ranges)
    • Demo mode sessions don't get stuck
    • System message content and format
    • Recovery flow (IsProcessing cleared → can send new messages)
    • Cross-reference scenario test
  • All 625 tests pass ✅
  • Mac Catalyst build succeeds ✅

UI Scenario

Added stuck-session-recovery-after-server-disconnect scenario to mode-switch-scenarios.json for manual verification with MauiDevFlow.

PureWeen and others added 4 commits February 18, 2026 08:36
When the persistent Copilot server dies mid-turn, sessions would get
permanently stuck in IsProcessing=true because no SessionIdleEvent
arrives to trigger CompleteResponse(). The ResponseCompletion task
would also never complete, blocking the caller indefinitely.

Add a processing watchdog that monitors event flow during active turns:
- Tracks LastEventAt timestamp on every SDK event received
- Starts a background watchdog when IsProcessing is set in SendPromptAsync
- Checks every 15s; if no events arrive for 120s, clears the stuck state
- Adds a system message to history so the user knows what happened
- Cancels cleanly on normal completion (SessionIdleEvent/SessionErrorEvent)

The watchdog is also reset when reconnection succeeds, and cancelled in
all error paths (send failure, reconnect failure).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Observed during live testing: after relaunch.sh deploys a new build
while an old copilot server is running, the app can silently fail to
restore sessions, showing 0 sessions despite being 'Connected'.

Add tests covering:
- Persistent mode failed init sets NeedsConfiguration
- No sessions stuck after failed init
- IsInitialized correct across mode switches
- Reconnect clears stuck processing from previous mode
- OnStateChanged fires during reconnect
- UI scenario for relaunch-with-stale-server

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Marshal watchdog timeout mutations to UI thread via InvokeOnUI to
  avoid racing with CompleteResponse and concurrent History.Add() on
  the List<ChatMessage> (not thread-safe). Re-check IsProcessing on
  UI thread to handle late normal-completion.

- Use Interlocked for LastEventAtTicks (long) instead of DateTime
  property to guarantee atomic reads/writes across threads.

- Dispose CancellationTokenSource in CancelProcessingWatchdog to
  prevent kernel object leaks across many prompts.

- Cancel old watchdog BEFORE creating new SessionState on reconnect
  path — old and new states share Info/ResponseCompletion, so the
  old watchdog could clear IsProcessing mid-retry.

- Cancel all watchdogs in ReconnectAsync and DisposeAsync before
  clearing sessions to prevent orphaned watchdog tasks.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…all, dynamic timeout string

- Cancel watchdog in AbortSessionAsync (both local and remote paths)
- Add catch-all exception handler in RunProcessingWatchdogAsync
- Make timeout message dynamic based on WatchdogInactivityTimeoutSeconds

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 8c7910a into main Feb 18, 2026
5 checks passed
PureWeen added a commit that referenced this pull request Feb 21, 2026
…edge

Add comprehensive documentation of the recurring stuck-session bug pattern
(7 PRs, 16 fix/regression cycles) to copilot-instructions.md:

- Full cleanup checklist for all IsProcessing=false paths
- Table of all 7 paths with locations
- 7 common mistakes with PR references where each occurred
- Staleness check and IsResumed clearing documentation
- Cross-thread volatile field requirements
- ProcessingGeneration guard explanation
- Watchdog diagnostic log tag additions

This knowledge was hard-won across PRs #141, #147, #148, #153, #158,
#163, #164 and should prevent future regressions by making the invariants
explicit and discoverable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen deleted the fix/the-current-running-instance-of-polypilo-20260218-1424 branch February 22, 2026 00:15
PureWeen added a commit that referenced this pull request Feb 25, 2026
## Problem
After app restart, resumed sessions that were mid-turn show
**Thinking...** with a Stop button. The user must manually click Stop
every time. The existing watchdog waited 600s (10 min!) before clearing
stuck IsProcessing.

## Solution
Add a **30s resume quiescence timeout** for sessions that receive zero
SDK events after restart. If no events flow within 30s of app start, the
session is cleared as stuck.

### Key design decisions (informed by 4-model consultation: Opus 4.6,
Sonnet 4.6, Codex 5.3, GPT-5.1):

1. **30s quiescence** — short enough users don't wait, long enough for
SDK reconnect (~5s typical, 6x safety margin)
2. **Event-gated** — only fires when \HasReceivedEventsSinceResume ==
false\. Once events start flowing, transitions to normal 120s/600s
timeout tiers
3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models
independently flagged that seeding from events.jsonl would cause
immediate kills for sessions >15s old (exact PR #148 regression pattern)
4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup
code, all 8 invariants preserved

### Timeout tiers (3-tier, was 2-tier):
| Condition | Timeout |
|-----------|---------|
| Resumed, zero events since restart | **30s** (NEW) |
| Normal processing, no tools | 120s |
| Active tools / resumed with events / multi-agent | 600s |

## Tests
- **16 new regression guard tests** covering quiescence edge cases, seed
time safety, exhaustive timeout matrix
- Updated existing tests to use \ComputeEffectiveTimeout\ helper
mirroring production 3-tier formula
- **108 total watchdog+recovery tests pass** ✅

## Regression history context
This code has been through 7 PRs of fix/regression cycles (PRs
#141#147#148#153#158#163#164). The most relevant precedent: PR
#148 added a 10s resume timeout that killed active sessions. Our 30s
timeout avoids this by being event-gated and seeded from UtcNow.

Fixes the 'click Stop on every restart' UX issue.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant