Skip to content

Fix: session response lost when IsProcessing cleared prematurely#158

Merged
PureWeen merged 2 commits intoPureWeen:mainfrom
vitek-karas:fix/session-network-files-access-the-session-20260220-1917
Feb 20, 2026
Merged

Fix: session response lost when IsProcessing cleared prematurely#158
PureWeen merged 2 commits intoPureWeen:mainfrom
vitek-karas:fix/session-network-files-access-the-session-20260220-1917

Conversation

@vitek-karas
Copy link
Contributor

@vitek-karas vitek-karas commented Feb 20, 2026

Bug

Session shows "Thinking" with no response in the UI, but the SDK/CLI side completed the work and is ready for the next message. IsProcessing is false but the response content was never flushed to the message history.

Root Cause

When IsProcessing is cleared prematurely (by watchdog timeout, SessionErrorEvent, or reconnect race condition), accumulated delta content in CurrentResponse (StringBuilder) was silently lost. The subsequent SessionIdleEventCompleteResponse call was skipped because IsProcessing was already false.

Three code paths could trigger this:

  1. Watchdog timeout — fires after 120s/600s of inactivity, clears IsProcessing but doesn't flush CurrentResponse
  2. SessionErrorEvent — clears IsProcessing without flushing accumulated partial response
  3. Reconnect race — new SessionState has ProcessingGeneration=0, replayed SessionIdleEvent from old turn passes generation check (0==0) and clears IsProcessing before retry turn completes

Fix (Commit 1: response flush)

  • CompleteResponse: Flush CurrentResponse even when IsProcessing is already false (events can arrive after premature clearing)
  • Watchdog timeout: Call FlushCurrentResponse() before setting IsProcessing = false
  • SessionErrorEvent: Call FlushCurrentResponse() before setting IsProcessing = false
  • Reconnect path: Increment ProcessingGeneration on new SessionState before retry SendAsync, so replayed SessionIdleEvent has stale generation and is skipped

Fix (Commit 2: branch ancestry)

The "Fix with Copilot" feature created branches from the current HEAD instead of main, causing PRs to include unrelated commits when the repo happened to be on a feature branch.

  • LaunchFixIt fallback: Detect upstream default branch (upstream/main, origin/main, etc.) and base the new branch on it
  • BuildCopilotPrompt: Added step instructing Copilot to verify branch ancestry before pushing
  • copilot-instructions.md: Added "verify branch is based on main" rule to Git Workflow section

Tests

Added ResponseFlushTests.cs with 15 tests covering:

  • Watchdog flush preserves history
  • Error event flush preserves partial response
  • CompleteResponse flush-on-skip behavior
  • Reconnect clears processing state
  • High message count session recovery
  • History integrity across multiple state transitions
  • OnStateChanged fires during abort recovery

When IsProcessing is cleared prematurely (by watchdog timeout, SessionErrorEvent,
or reconnect race), accumulated delta content in CurrentResponse was silently lost.
The next SessionIdleEvent would skip CompleteResponse because IsProcessing was
already false, leaving the UI with no response despite the SDK completing the work.

Fixes:
- CompleteResponse: flush CurrentResponse even when IsProcessing is already false,
  so late-arriving SessionIdleEvent still preserves accumulated content
- Watchdog timeout: call FlushCurrentResponse before clearing IsProcessing
- SessionErrorEvent: call FlushCurrentResponse before clearing IsProcessing
- Reconnect path in SendPromptAsync: increment ProcessingGeneration on the new
  SessionState before retrying SendAsync, preventing replayed SessionIdleEvent
  from the old turn from interfering with the retry turn's completion

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vitek-karas vitek-karas force-pushed the fix/session-network-files-access-the-session-20260220-1917 branch from 4a7f256 to 99ed9b1 Compare February 20, 2026 19:38
The fallback path in LaunchFixIt() used 'git checkout -b <branch>' which
creates the new branch from whatever commit HEAD points to. If the repo
was on a feature branch, the fix branch would inherit unrelated commits,
polluting the PR.

Changes:
- LaunchFixIt fallback: detect upstream default branch (upstream/main,
  origin/main, etc.) and base the new branch on it
- BuildCopilotPrompt: add step 8 instructing Copilot to verify branch
  ancestry before pushing
- copilot-instructions.md: add 'verify branch is based on main' rule

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vitek-karas vitek-karas force-pushed the fix/session-network-files-access-the-session-20260220-1917 branch from f28ef76 to b38e6e5 Compare February 20, 2026 19:47
@PureWeen PureWeen merged commit 32c2a0d into PureWeen:main Feb 20, 2026
6 checks passed
PureWeen added a commit that referenced this pull request Feb 21, 2026
Invariant audit found 4 pre-existing gaps where IsProcessing was cleared
without resetting all companion fields:

1. CompleteResponse: missing ActiveToolCallCount reset
2. SendAsync reconnect+retry failure: missing IsResumed, HasUsedToolsThisTurn,
   ActiveToolCallCount, FlushCurrentResponse
3. SendAsync initial failure: same missing fields
4. Remote mode abort: missing IsResumed

These are the exact same class of bug that caused regressions in PRs #148,
#158, and #164. Every path that sets IsProcessing=false must also clear:
IsResumed, HasUsedToolsThisTurn, ActiveToolCallCount, ProcessingStartedAt,
ToolCallCount, ProcessingPhase, and FlushCurrentResponse.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 21, 2026
…edge

Add comprehensive documentation of the recurring stuck-session bug pattern
(7 PRs, 16 fix/regression cycles) to copilot-instructions.md:

- Full cleanup checklist for all IsProcessing=false paths
- Table of all 7 paths with locations
- 7 common mistakes with PR references where each occurred
- Staleness check and IsResumed clearing documentation
- Cross-thread volatile field requirements
- ProcessingGeneration guard explanation
- Watchdog diagnostic log tag additions

This knowledge was hard-won across PRs #141, #147, #148, #153, #158,
#163, #164 and should prevent future regressions by making the invariants
explicit and discoverable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PureWeen added a commit that referenced this pull request Feb 25, 2026
## Problem
After app restart, resumed sessions that were mid-turn show
**Thinking...** with a Stop button. The user must manually click Stop
every time. The existing watchdog waited 600s (10 min!) before clearing
stuck IsProcessing.

## Solution
Add a **30s resume quiescence timeout** for sessions that receive zero
SDK events after restart. If no events flow within 30s of app start, the
session is cleared as stuck.

### Key design decisions (informed by 4-model consultation: Opus 4.6,
Sonnet 4.6, Codex 5.3, GPT-5.1):

1. **30s quiescence** — short enough users don't wait, long enough for
SDK reconnect (~5s typical, 6x safety margin)
2. **Event-gated** — only fires when \HasReceivedEventsSinceResume ==
false\. Once events start flowing, transitions to normal 120s/600s
timeout tiers
3. **Seed from DateTime.UtcNow, NOT file time** — all 3 models
independently flagged that seeding from events.jsonl would cause
immediate kills for sessions >15s old (exact PR #148 regression pattern)
4. **Reuses existing watchdog fire path** — no new IsProcessing cleanup
code, all 8 invariants preserved

### Timeout tiers (3-tier, was 2-tier):
| Condition | Timeout |
|-----------|---------|
| Resumed, zero events since restart | **30s** (NEW) |
| Normal processing, no tools | 120s |
| Active tools / resumed with events / multi-agent | 600s |

## Tests
- **16 new regression guard tests** covering quiescence edge cases, seed
time safety, exhaustive timeout matrix
- Updated existing tests to use \ComputeEffectiveTimeout\ helper
mirroring production 3-tier formula
- **108 total watchdog+recovery tests pass** ✅

## Regression history context
This code has been through 7 PRs of fix/regression cycles (PRs
#141#147#148#153#158#163#164). The most relevant precedent: PR
#148 added a 10s resume timeout that killed active sessions. Our 30s
timeout avoids this by being event-gated and seeded from UtcNow.

Fixes the 'click Stop on every restart' UX issue.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants