-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Description
When a plan has multiple jobs running and one job fails, the other jobs that have already sent mc_report(status: "completed") remain stuck as running in the plan status. The plan shows 1 failed and N running, even though the other jobs completed successfully.
This is a race condition in Orchestrator.setCheckpoint() → savePlan() that overwrites concurrent updatePlanJob() updates with a stale plan snapshot. There is no recovery path — once overwritten, the monitor won't re-check those jobs.
Steps to Reproduce
- Create a plan with 3 jobs (no dependencies between them)
- Have one job fail (e.g. exit with non-zero code)
- Have the other two jobs call
mc_report(status: "completed")before or around the same time - Run
mc_plan_status
Expected Behavior
Plan status shows 1 failed, 2 completed. The plan pauses at on_error checkpoint with the correct job states.
Actual Behavior
Plan status shows 1 failed, 2 running. The two completed jobs are permanently stuck as running despite having reported completion.
Root Cause
When handleJobFailed fires in orchestrator.ts:
- Fire-and-forgets
updatePlanJob(planId, failedJob, { status: 'failed' }) - Calls
loadPlan()outside the mutex — gets a stale snapshot where sibling jobs are stillrunning - Passes that snapshot to
setCheckpoint(), which callssavePlan(plan)— writing the entire stale plan object back to disk - This overwrites
updatePlanJob()changes thathandleJobCompleteconcurrently made for the other jobs
Race sequence:
a. handleJobFailed: loadPlan() → snapshot {A=running, B=running, C=running}
b. updatePlanJob(A, 'failed') → disk: {A=failed, B=running, C=running}
c. updatePlanJob(B, 'completed') → disk: {A=failed, B=completed, C=running}
d. updatePlanJob(C, 'completed') → disk: {A=failed, B=completed, C=completed}
e. savePlan(staleSnapshot) → disk: {A=failed, B=running, C=running} ← OVERWRITES
The planMutex in savePlan doesn't prevent this because it writes the passed-in object, not a freshly-read one. The monitor marks jobs completed in jobs.json (removing them from getRunningJobs()), so they are never re-polled.
Affected code: setCheckpoint(), clearCheckpoint(), and _doReconcile() savePlan calls in src/lib/orchestrator.ts; savePlan() blind overwrite in src/lib/plan-state.ts.
Environment
- OS: macOS
- Discovered during normal plan execution with 3 parallel jobs
Additional Context
Proposed fix:
- Add
updatePlanFields()toplan-state.ts— atomic read-modify-write for plan-level fields only (status, checkpoint, completedAt, prUrl) inside the mutex, preserving job states - Replace
savePlan(staleSnapshot)insetCheckpoint/clearCheckpoint/_doReconcilewithupdatePlanFields() - Add reconciliation safety net in
_doReconcile— cross-referencejobs.jsonfor plan jobs stuck asrunningwhen they've already completed