setCheckpoint overwrites concurrent job state updates — completed plan jobs stuck as running

## Description

When a plan has multiple jobs running and one job fails, the other jobs that have already sent `mc_report(status: "completed")` remain stuck as `running` in the plan status. The plan shows 1 failed and N running, even though the other jobs completed successfully.

This is a race condition in `Orchestrator.setCheckpoint()` → `savePlan()` that overwrites concurrent `updatePlanJob()` updates with a stale plan snapshot. There is no recovery path — once overwritten, the monitor won't re-check those jobs.

## Steps to Reproduce

1. Create a plan with 3 jobs (no dependencies between them)
2. Have one job fail (e.g. exit with non-zero code)
3. Have the other two jobs call `mc_report(status: "completed")` before or around the same time
4. Run `mc_plan_status`

## Expected Behavior

Plan status shows 1 failed, 2 completed. The plan pauses at `on_error` checkpoint with the correct job states.

## Actual Behavior

Plan status shows 1 failed, 2 running. The two completed jobs are permanently stuck as `running` despite having reported completion.

## Root Cause

When `handleJobFailed` fires in `orchestrator.ts`:

1. Fire-and-forgets `updatePlanJob(planId, failedJob, { status: 'failed' })`
2. Calls `loadPlan()` **outside the mutex** — gets a stale snapshot where sibling jobs are still `running`
3. Passes that snapshot to `setCheckpoint()`, which calls `savePlan(plan)` — writing the **entire** stale plan object back to disk
4. This overwrites `updatePlanJob()` changes that `handleJobComplete` concurrently made for the other jobs

```
Race sequence:
  a. handleJobFailed: loadPlan() → snapshot {A=running, B=running, C=running}
  b. updatePlanJob(A, 'failed')    → disk: {A=failed, B=running, C=running}
  c. updatePlanJob(B, 'completed') → disk: {A=failed, B=completed, C=running}
  d. updatePlanJob(C, 'completed') → disk: {A=failed, B=completed, C=completed}
  e. savePlan(staleSnapshot)       → disk: {A=failed, B=running, C=running} ← OVERWRITES
```

The `planMutex` in `savePlan` doesn't prevent this because it writes the passed-in object, not a freshly-read one. The monitor marks jobs `completed` in `jobs.json` (removing them from `getRunningJobs()`), so they are never re-polled.

**Affected code:** `setCheckpoint()`, `clearCheckpoint()`, and `_doReconcile()` `savePlan` calls in `src/lib/orchestrator.ts`; `savePlan()` blind overwrite in `src/lib/plan-state.ts`.

## Environment

- OS: macOS
- Discovered during normal plan execution with 3 parallel jobs

## Additional Context

**Proposed fix:**

1. Add `updatePlanFields()` to `plan-state.ts` — atomic read-modify-write for plan-level fields only (status, checkpoint, completedAt, prUrl) inside the mutex, preserving job states
2. Replace `savePlan(staleSnapshot)` in `setCheckpoint` / `clearCheckpoint` / `_doReconcile` with `updatePlanFields()`
3. Add reconciliation safety net in `_doReconcile` — cross-reference `jobs.json` for plan jobs stuck as `running` when they've already completed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

setCheckpoint overwrites concurrent job state updates — completed plan jobs stuck as running #63

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Environment

Additional Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

setCheckpoint overwrites concurrent job state updates — completed plan jobs stuck as running #63

Description

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Environment

Additional Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions