Skip to content

setCheckpoint overwrites concurrent job state updates — completed plan jobs stuck as running #63

@nigel-dev

Description

@nigel-dev

Description

When a plan has multiple jobs running and one job fails, the other jobs that have already sent mc_report(status: "completed") remain stuck as running in the plan status. The plan shows 1 failed and N running, even though the other jobs completed successfully.

This is a race condition in Orchestrator.setCheckpoint()savePlan() that overwrites concurrent updatePlanJob() updates with a stale plan snapshot. There is no recovery path — once overwritten, the monitor won't re-check those jobs.

Steps to Reproduce

  1. Create a plan with 3 jobs (no dependencies between them)
  2. Have one job fail (e.g. exit with non-zero code)
  3. Have the other two jobs call mc_report(status: "completed") before or around the same time
  4. Run mc_plan_status

Expected Behavior

Plan status shows 1 failed, 2 completed. The plan pauses at on_error checkpoint with the correct job states.

Actual Behavior

Plan status shows 1 failed, 2 running. The two completed jobs are permanently stuck as running despite having reported completion.

Root Cause

When handleJobFailed fires in orchestrator.ts:

  1. Fire-and-forgets updatePlanJob(planId, failedJob, { status: 'failed' })
  2. Calls loadPlan() outside the mutex — gets a stale snapshot where sibling jobs are still running
  3. Passes that snapshot to setCheckpoint(), which calls savePlan(plan) — writing the entire stale plan object back to disk
  4. This overwrites updatePlanJob() changes that handleJobComplete concurrently made for the other jobs
Race sequence:
  a. handleJobFailed: loadPlan() → snapshot {A=running, B=running, C=running}
  b. updatePlanJob(A, 'failed')    → disk: {A=failed, B=running, C=running}
  c. updatePlanJob(B, 'completed') → disk: {A=failed, B=completed, C=running}
  d. updatePlanJob(C, 'completed') → disk: {A=failed, B=completed, C=completed}
  e. savePlan(staleSnapshot)       → disk: {A=failed, B=running, C=running} ← OVERWRITES

The planMutex in savePlan doesn't prevent this because it writes the passed-in object, not a freshly-read one. The monitor marks jobs completed in jobs.json (removing them from getRunningJobs()), so they are never re-polled.

Affected code: setCheckpoint(), clearCheckpoint(), and _doReconcile() savePlan calls in src/lib/orchestrator.ts; savePlan() blind overwrite in src/lib/plan-state.ts.

Environment

  • OS: macOS
  • Discovered during normal plan execution with 3 parallel jobs

Additional Context

Proposed fix:

  1. Add updatePlanFields() to plan-state.ts — atomic read-modify-write for plan-level fields only (status, checkpoint, completedAt, prUrl) inside the mutex, preserving job states
  2. Replace savePlan(staleSnapshot) in setCheckpoint / clearCheckpoint / _doReconcile with updatePlanFields()
  3. Add reconciliation safety net in _doReconcile — cross-reference jobs.json for plan jobs stuck as running when they've already completed

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0: criticalMust fix immediately — blocks core functionalitybugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions