-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Labels
P1: highImportant fix or feature — next up after criticalImportant fix or feature — next up after criticalenhancementNew feature or requestNew feature or request
Description
Problem
When a plan job completes but has modified files outside its touchSet, the orchestrator marks the job as failed and pauses the plan at an on_error checkpoint. Currently the only suggested action is:
Fix the branch and retry with
mc_plan_approve(checkpoint: "on_error", retry: "{jobName}")
This has several problems:
- No accept path. If the violations are legitimate (e.g., the agent needed to update a shared types file), there's no way to say "these changes are fine, proceed."
- Retry doesn't actually retry.
mc_plan_approve(retry: "job")sets the job toready_to_merge, which bypasses touchSet re-validation entirely. It doesn't relaunch the agent. - No correction path. If the violations are actually wrong, there's no mechanism to relaunch the agent with context about what to fix.
- Invalid state transition.
completed → failedwas missing fromVALID_JOB_TRANSITIONS, producing a console warning. (Fixed separately —failedadded to completed's transition list.)
Proposed Solution
Three distinct actions for mc_plan_approve when a touchSet violation occurs:
Path 1 — Accept
The user reviews the violations and determines they're valid.
mc_plan_approve(checkpoint: "on_error")— accepts the checkpoint job's violations- Moves the specific touchSet-failed job from
failed → ready_to_merge - Uses structured checkpoint context (
checkpointContext.jobName+failureKind: "touchset") instead of parsing error strings - Only acts on the job that triggered the checkpoint, not all failed jobs
Path 2 — Correct (Relaunch)
The user wants the agent to fix the violations.
mc_plan_approve(checkpoint: "on_error", relaunch: "jobName")- New
relaunchJobForCorrectionmethod on Orchestrator (separate fromlaunchJob) - Reuses the existing worktree and branch (changes are already there)
- Kills the old tmux session if still alive
- Constructs a correction prompt containing:
- The original task prompt
- The specific violations (which files)
- The allowed touchSet patterns
- Instructions to revert violating files without breaking intended work
- Creates a new tmux session in the existing worktree
- Sets pane-died hook, updates job entry in place (preserves identity)
- Job transitions:
failed → running - On completion, touchSet re-validates normally through the reconciler
Path 3 — Retry (existing, fixed)
The user manually fixed the branch and wants re-validation.
mc_plan_approve(checkpoint: "on_error", retry: "jobName")— existing param, corrected behavior- Re-runs
validateTouchSetbefore moving toready_to_merge - If still violating, stays
failedand reports remaining violations - Fixes the current bug where retry skips validation entirely
State Machine Changes
# Add to VALID_JOB_TRANSITIONS:
completed: ['ready_to_merge', 'failed', 'stopped', 'canceled'] # already done
failed: ['ready_to_merge', 'running', 'stopped', 'canceled'] # add 'running'
Implementation Details
Structured Checkpoint Context
Store failure metadata instead of relying on error string parsing:
// On PlanSpec or alongside checkpoint
checkpointContext?: {
jobName: string;
failureKind: 'touchset' | 'merge_conflict' | 'test_failure' | 'job_failed';
touchSetViolations?: string[]; // file paths
};Relaunch Method
New relaunchJobForCorrection(job, violations, touchSet) on Orchestrator that:
- Kills old tmux session/pane if alive
- Writes correction prompt + launcher script to existing worktree
- Creates new tmux session pointing at existing worktree
- Sets pane-died hook
- Updates existing job entry in place (new tmux target, reset timestamps, increment attempt count)
- Sets plan job status to
running
Updated Notification Message
❌ Job "{name}" modified files outside its touchSet:
Violations: src/types/search.ts, src/utils/format.ts
Allowed: src/db/**
Options:
• Accept violations: mc_plan_approve(checkpoint: "on_error")
• Agent fixes branch: mc_plan_approve(checkpoint: "on_error", relaunch: "{name}")
• You fix, re-check: mc_plan_approve(checkpoint: "on_error", retry: "{name}")
Files to Modify
src/lib/plan-types.ts— Addfailed → runningtransition, checkpoint context typessrc/lib/orchestrator.ts—relaunchJobForCorrectionmethod, checkpoint context storage, updated notificationssrc/tools/plan-approve.ts— Handle accept (no retry/relaunch), relaunch param, fix retry to re-validatesrc/lib/job-state.ts— Support in-place job updates for relaunch
Edge Cases
- Correction agent also violates touchSet: Normal re-validation catches it. Infinite manual retries are fine since each requires explicit user action. Track attempt count and display it.
- Multiple jobs fail before plan pauses: Checkpoint context stores the specific job that triggered the pause. Other failures are handled separately.
- Old tmux session still alive on relaunch: Kill deterministically before creating new session to prevent stale pane-died hooks from misfiring.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1: highImportant fix or feature — next up after criticalImportant fix or feature — next up after criticalenhancementNew feature or requestNew feature or request