Skip to content
75 changes: 75 additions & 0 deletions .claude/skills/processing-state-safety/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
---
name: processing-state-safety
description: >
Safety guide for modifying IsProcessing, watchdog, session resume, or abort code paths in CopilotService.
Use when: (1) Modifying any code that sets IsProcessing to true or false, (2) Changing watchdog timeout
logic or adding new timeout paths, (3) Touching session resume/restore logic, (4) Modifying
AbortSessionAsync or CompleteResponse, (5) Adding new processing-related fields to AgentSessionInfo
or SessionState, (6) Debugging sessions stuck in "Thinking" state, (7) Reviewing PRs that touch
CopilotService.Events.cs, CopilotService.cs, or CopilotService.Utilities.cs processing paths.
---

# Processing State Safety Guide

Modifying processing state code involves these steps:

1. Identify which of the 7 cleanup paths you're touching
2. Apply the cleanup checklist to your change
3. Verify all 7 paths still satisfy the checklist
4. Ensure thread safety rules are followed

If debugging a stuck session, see [references/regression-history.md](references/regression-history.md)
for the 7 common mistakes and full regression timeline across 7 PRs.

## The Cleanup Checklist

Every code path that sets `IsProcessing = false` MUST perform ALL of these:

```csharp
FlushCurrentResponse(state); // BEFORE clearing — saves accumulated response
CancelProcessingWatchdog(state);
Interlocked.Exchange(ref state.ActiveToolCallCount, 0);
state.HasUsedToolsThisTurn = false;
state.Info.IsResumed = false;
state.Info.IsProcessing = false;
state.Info.ProcessingStartedAt = null;
state.Info.ToolCallCount = 0;
state.Info.ProcessingPhase = 0;
```

Skip any field not applicable to the path (e.g., remote mode has no `ActiveToolCallCount`).

## The 7 Cleanup Paths

| # | Path | Location | Notes |
|---|------|----------|-------|
| 1 | CompleteResponse | Events.cs ~L699 | Normal completion via SessionIdleEvent (saves response inline, not via FlushCurrentResponse) |
| 2 | SessionErrorEvent | Events.cs ~L517 | SDK error — wrapped in InvokeOnUI |
| 3 | Watchdog timeout | Events.cs ~L1192 | InvokeOnUI + generation guard |
| 4 | AbortSessionAsync (local) | CopilotService.cs ~L1681 | User clicks Stop |
| 5 | AbortSessionAsync (remote) | CopilotService.cs ~L1638 | Remote mode optimistic clear |
| 6 | SendAsync reconnect failure | CopilotService.cs ~L1600 | Reconnect+retry failed |
| 7 | SendAsync initial failure | CopilotService.cs ~L1613 | First send attempt failed |
| 8 | Bridge OnTurnEnd | Bridge.cs ~L127 | Remote mode normal turn completion — InvokeOnUI |

**When adding a new field to AgentSessionInfo or SessionState**, add its reset to ALL 8 paths.
**When adding a new cleanup path**, copy the full checklist from an existing path (path 3 is the most complete).

## Key Watchdog Rules

- **Two timeout tiers**: 120s inactivity, 600s tool execution
- **600s triggers when**: `ActiveToolCallCount > 0` OR `IsResumed` OR `HasUsedToolsThisTurn`
- **Never add timeouts shorter than 120s** for resume — tool calls gap 30-60s between events
- **`ActiveToolCallCount` returns to 0 between tool rounds** — `AssistantTurnStartEvent` resets it to 0 (line ~365). Between rounds the model reasons about the next tool call, so `hasActiveTool` is 0 even though the session is actively working. Always check `HasUsedToolsThisTurn` too
- **IsResumed clearing** must guard on `!hasActiveTool && !HasUsedToolsThisTurn`
- **Staleness check**: `IsSessionStillProcessing` uses `File.GetLastWriteTimeUtc` >600s = idle

## Thread Safety Rules

- All `state.Info.*` mutations from background threads → `InvokeOnUI()`
- `HasUsedToolsThisTurn`, `HasReceivedEventsSinceResume` → `Volatile.Write`/`Read`
- `ActiveToolCallCount` → `Interlocked` operations only
- Capture `ProcessingGeneration` before `SyncContext.Post`, verify inside callback

For detailed thread safety patterns and the full regression history, see
[references/regression-history.md](references/regression-history.md).
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Regression History & Common Mistakes

## 7 Mistakes That Keep Recurring

### 1. Forgetting companion fields on error paths
**What happens**: Clear `IsProcessing` and `ProcessingPhase` but forget `IsResumed` or
`HasUsedToolsThisTurn`. Next turn inherits stale 600s timeout or stale tool state.
**PRs where this happened**: #148, #158, #164

### 2. Missing FlushCurrentResponse before clearing
**What happens**: Accumulated `CurrentResponse` (StringBuilder) content is silently lost.
User sees "Thinking..." then nothing — the partial response vanishes.
**PRs where this happened**: #158

### 3. Using ActiveToolCallCount alone as tool signal
**What happens**: `AssistantTurnStartEvent` resets `ActiveToolCallCount` to 0 between tool rounds
(line ~365 in Events.cs). Between rounds, the model reasons about the next tool call, so
`hasActiveTool` is 0 even though the session is actively working.
`HasUsedToolsThisTurn` persists across tool rounds and is the reliable signal.
**PRs where this happened**: #148, #163

### 4. Adding hardcoded short timeouts for resume
**What happens**: A 10s timeout kills sessions that are legitimately processing (tool calls
take 30-60s between events). The watchdog's tiered 120s/600s approach is the correct mechanism.
**PRs where this happened**: #148

### 5. Mutating state on background threads
**What happens**: SDK events arrive on worker threads. `IsProcessing` write on a background
thread races with Blazor rendering on the UI thread. Use `InvokeOnUI()` for all `state.Info.*`
mutations from background code.
**PRs where this happened**: #147, #148, #163

### 6. Clearing IsResumed without checking tool activity
**What happens**: After resume, the dedup path leaves `ActiveToolCallCount` at 0, so
`hasActiveTool` is false. If you clear `IsResumed` based only on `HasReceivedEventsSinceResume`,
the 600s timeout drops to 120s and kills resumed mid-tool sessions.
**Guard condition**: `!hasActiveTool && !HasUsedToolsThisTurn`
**PRs where this happened**: #163

### 7. InvokeAsync in HandleComplete (Dashboard.razor)
**What happens**: `HandleComplete` is called from `CompleteResponse` via
`Invoke(SyncContext.Post)` — already on UI thread. Wrapping in `InvokeAsync` defers
`StateHasChanged()` to next render cycle, causing stale "Thinking" indicators.
**PRs where this happened**: #153

## Full Regression Timeline

| PR | What broke | Root cause |
|----|-----------|------------|
| #141 | 120s timeout killed tool executions | Single timeout tier too aggressive |
| #147 | Stale IDLE killed new turns | Missing ProcessingGeneration guard |
| #148 | 10s resume timeout killed active sessions | Hardcoded short timeout |
| #148 | 120s during tool loops | ActiveToolCallCount reset between rounds |
| #148 | IsResumed leaked → permanent 600s | Not cleared on abort/error/watchdog |
| #153 | Stale "Thinking" renders | InvokeAsync deferred StateHasChanged |
| #158 | Response content silently lost | No FlushCurrentResponse before clearing |
| #163 | Resumed mid-tool killed at 120s | IsResumed cleared without tool guard |
| #164 | Processing fields not reset on error | New fields added to only some paths |

## Thread Safety Details

1. **All `state.Info.*` mutations from background threads** → `InvokeOnUI()`
2. **`HasUsedToolsThisTurn`, `HasReceivedEventsSinceResume`** → `Volatile.Write` on set, `Volatile.Read` on check (ARM memory model)
3. **`ActiveToolCallCount`** → `Interlocked.Increment`/`Decrement`/`Exchange` (concurrent tool starts/completions)
4. **`LastEventAtTicks`** → `Interlocked.Exchange`/`Read` (long requires atomic ops)
5. **`ProcessingGeneration`** → `Interlocked.Increment` on send, `Interlocked.Read` on check
6. **`ProcessingGeneration` guard**: `SyncContext.Post` is async — a new `SendPromptAsync` can execute between the Post() and the callback. Capture generation before posting, verify inside callback.
15 changes: 5 additions & 10 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,14 +157,10 @@ All three are reset in `SendPromptAsync` (new turn) and cleared in `CompleteResp
The UI shows: "Sending…" → "Server connected…" → "Thinking…" → "Working · Xm Xs · N tool calls…".

### Abort Behavior
`AbortSessionAsync` must clear ALL processing state:
- `IsProcessing = false`, `IsResumed = false`
- `ProcessingStartedAt = null`, `ToolCallCount = 0`, `ProcessingPhase = 0`
- `MessageQueue.Clear()` — prevents queued messages from auto-sending after abort
- `_queuedImagePaths.TryRemove()` — clears associated image attachments
- `CancelProcessingWatchdog()` and `ResponseCompletion.TrySetCanceled()`
`AbortSessionAsync` must clear ALL processing state — see `.claude/skills/processing-state-safety/SKILL.md` for the full cleanup checklist and the 7 paths that clear `IsProcessing`.

In remote mode, the mobile client optimistically clears all fields and delegates to the bridge server.
### ⚠️ IsProcessing Cleanup Invariant
**CRITICAL**: Every code path that sets `IsProcessing = false` must clear 9 companion fields and call `FlushCurrentResponse`. This is the most recurring bug category (7 PRs, 16 fix/regression cycles). **Read `.claude/skills/processing-state-safety/SKILL.md` before modifying ANY processing path.** There are 8 such paths across CopilotService.cs, Events.cs, and Bridge.cs.

### Processing Watchdog
The processing watchdog (`RunProcessingWatchdogAsync` in `CopilotService.Events.cs`) detects stuck sessions by checking how long since the last SDK event. It checks every 15 seconds and has two timeout tiers:
Expand All @@ -174,9 +170,7 @@ The processing watchdog (`RunProcessingWatchdogAsync` in `CopilotService.Events.
- The session was resumed mid-turn after app restart (`IsResumed`)
- Tools have been used this turn (`HasUsedToolsThisTurn`) — even between tool rounds when the model is thinking

The 10-second resume timeout was removed — the watchdog handles all stuck-session detection.

When the watchdog fires, it marshals state mutations to the UI thread via `InvokeOnUI()` and adds a system warning message. All code paths that set `IsProcessing = false` must go through the UI thread.
When the watchdog fires, it marshals state mutations to the UI thread via `InvokeOnUI()` and adds a system warning message.

### Diagnostic Log Tags
The event diagnostics log (`~/.polypilot/event-diagnostics.log`) uses these tags:
Expand All @@ -189,6 +183,7 @@ The event diagnostics log (`~/.polypilot/event-diagnostics.log`) uses these tags
- `[ABORT]` — user-initiated abort cleared IsProcessing
- `[BRIDGE-COMPLETE]` — bridge OnTurnEnd cleared IsProcessing
- `[INTERRUPTED]` — app restart detected interrupted turn (watchdog timeout after resume)
- `[WATCHDOG]` — watchdog clearing IsResumed or timing out a stuck session

Every code path that sets `IsProcessing = false` MUST have a diagnostic log entry. This is critical for debugging stuck-session issues.

Expand Down
28 changes: 28 additions & 0 deletions PolyPilot.Tests/Scenarios/mode-switch-scenarios.json
Original file line number Diff line number Diff line change
Expand Up @@ -602,6 +602,34 @@
"note": "Processing status text (elapsed time / tool rounds / waiting) should be visible"
}
]
},
{
"id": "stale-session-not-marked-processing-on-restore",
"name": "Stale sessions are NOT marked as processing on restore",
"description": "When a session's events.jsonl hasn't been modified in over 10 minutes, the session should not be restored with IsProcessing=true, even if the last event was an 'active' type. This prevents hours-old sessions from showing a permanent 'Thinking' indicator.",
"unitTestCoverage": [
"StuckSessionRecoveryTests.IsSessionStillProcessing_StaleFile_ReturnsFalse",
"StuckSessionRecoveryTests.IsSessionStillProcessing_RecentFile_ActiveEvent_ReturnsTrue",
"StuckSessionRecoveryTests.IsSessionStillProcessing_RecentFile_IdleEvent_ReturnsFalse",
"StuckSessionRecoveryTests.StalenessThreshold_UsesWatchdogToolExecutionTimeout"
],
"steps": [
{
"action": "evaluate",
"script": "document.querySelector('.status')?.className",
"expect": { "contains": "connected" },
"note": "App should be connected"
},
{
"action": "evaluate",
"script": "Array.from(document.querySelectorAll('.session-card.processing, .session-item .processing-dot')).length",
"note": "Count sessions showing processing state — stale sessions should not be here"
},
{
"action": "note",
"text": "Manual verification: check debug info for any sessions with IsProcessing=true where LastUpdatedAt is over 10 minutes old. Such sessions should have been detected as stale during restore."
}
]
}
]
}
Loading
Loading