feat: add 3-tier outcome evaluation and re-prompt cycle to supervisor (t128.3)#378
Conversation
… (t128.3) Enhance supervisor-helper.sh with outcome evaluation and retry intelligence: - 3-tier evaluation: deterministic signals, heuristic error patterns, AI eval (Sonnet) - extract_log_metadata(): structured log parsing (signals, PR URLs, error counts) - evaluate_with_ai(): dispatch cheap Sonnet call for ambiguous outcomes (~30s) - cmd_reprompt: re-prompt workers in existing worktree with failure context - cmd_evaluate: manual evaluation command for debugging stuck tasks - Max retries marks tasks as blocked (not failed) for human investigation - Pulse cycle now uses re-prompt instead of naive re-dispatch - Updated help text and subagent-index.toon with new commands Zero ShellCheck violations.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughThe PR extends the supervisor-helper script with log analysis utilities and AI-assisted task evaluation capabilities. New functions extract log metadata and perform multi-tier evaluation (deterministic, heuristic, AI-based). Additional CLI commands enable manual task re-prompting and evaluation, with state machine adjustments to accommodate new fields and outcomes. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Main as main()
participant Eval as evaluate_worker()
participant Heur as Heuristic Check
participant AI as evaluate_with_ai()
participant Claude as Claude API
participant DB as Database
participant Log as Log File
User->>Main: cmd_evaluate task_id [--no-ai]
Main->>Eval: evaluate_worker(task_id, skip_ai)
Eval->>Log: extract_log_metadata()
Log-->>Eval: log size, completion, errors
Eval->>Heur: deterministic patterns
alt Deterministic Decision
Heur-->>Eval: complete/failed
else Ambiguous
alt skip_ai == false
Eval->>AI: evaluate_with_ai()
AI->>Log: read full log
AI->>Claude: dispatch evaluation
Claude-->>AI: VERDICT line
AI->>DB: store audit entry
AI-->>Eval: verdict result
else no-ai flag
Eval->>Eval: defer to heuristic
end
end
Eval-->>Main: result (complete/retry/blocked/failed)
Main-->>User: display evaluation report
sequenceDiagram
actor User
participant Main as main()
participant Reprompt as cmd_reprompt()
participant DB as Database
participant Worker as worker process
participant Dispatch as Task Dispatch
User->>Main: cmd_reprompt task_id [--prompt "text"]
Main->>Reprompt: cmd_reprompt task_id [--prompt]
Reprompt->>DB: read task + session_id
DB-->>Reprompt: task metadata
Reprompt->>Reprompt: construct AI-assisted prompt
Reprompt->>DB: increment retry count, check max_retries
alt retry_count < max_retries
Reprompt->>DB: update task state (retrying)
Reprompt->>Reprompt: create new retry log
Reprompt->>Dispatch: dispatch task in worktree
Dispatch->>Worker: spawn worker process
Worker-->>Dispatch: task in progress
else max retries exceeded
Reprompt->>DB: transition to blocked + escalation
end
Reprompt-->>User: confirm re-prompt + dispatch
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Fri Feb 6 04:52:58 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In @.agent/scripts/supervisor-helper.sh:
- Around line 1565-1569: The TASK_COMPLETE branch currently only treats
exit_code=="0" as task-only complete; update the logic so that when meta[signal]
== "TASK_COMPLETE" (regardless of meta[exit_code]) the script echoes
"complete:task_only" and returns 0, thereby making TASK_COMPLETE a tier-1 signal
that bypasses the heuristic fallback (modify the conditional around the meta
array check for "TASK_COMPLETE" in the supervisor-helper.sh snippet to ignore
exit_code or add a separate branch that handles meta[signal]=="TASK_COMPLETE"
first).
- Around line 1706-1723: The VERDICT parsing is too restrictive and the verdict
is inserted into SQL unescaped; update the grep/regex used to extract
verdict_line (the variable verdict_line and the pattern matching around
VERDICT:) to allow digits, hyphens and underscores (e.g., include 0-9, - and _
in the token characters or use a more permissive non-whitespace capture for the
detail portion) so strings like VERDICT:retry:rate-limited and
VERDICT:complete:pr_123 match, then after stripping the VERDICT: prefix when
assigning local verdict use the sql_escape helper on that value in the sqlite3
INSERT (replace 'AI eval verdict: $verdict' with a version that uses sql_escape
"$verdict") to prevent SQL injection.
- Around line 1977-1980: The variable dispatched_count is being incremented
during Phase 1 (in the cmd_reprompt block) but is later re-declared (shadowed)
in Phase 2, so Phase 1 increments are lost; to fix, declare and initialize
dispatched_count (e.g., local dispatched_count=0) together with the other Phase
1 counters before any Phase 1 logic (so increments in the cmd_reprompt branch
are applied to the same variable), then remove the duplicate local
dispatched_count declaration in Phase 2 so the final pulse summary uses the
single shared counter.
- Around line 1782-1787: The retry-exceeded path currently transitions the task
to "failed" causing a state mismatch downstream; update the VALID_TRANSITIONS
array to include the "retrying:blocked" transition and change the cmd_reprompt
logic (the cmd_reprompt function where it checks tretries >= tmax_retries) to
call cmd_transition "$task_id" "blocked" (preserving the existing --error
payload) instead of transitioning to "failed" so that cmd_pulse and escalation
logic see the intended "blocked" state and cmd_cleanup won't remove the
worktree.
| local rate_limit_count=0 auth_error_count=0 conflict_count=0 timeout_count=0 oom_count=0 | ||
| rate_limit_count=$(grep -ci 'rate.limit\|429\|too many requests' "$log_file" 2>/dev/null || echo 0) | ||
| auth_error_count=$(grep -ci 'permission denied\|unauthorized\|403\|401' "$log_file" 2>/dev/null || echo 0) | ||
| conflict_count=$(grep -ci 'merge conflict\|CONFLICT\|conflict marker' "$log_file" 2>/dev/null || echo 0) | ||
| timeout_count=$(grep -ci 'timeout\|timed out\|ETIMEDOUT' "$log_file" 2>/dev/null || echo 0) | ||
| oom_count=$(grep -ci 'out of memory\|OOM\|heap.*exceeded\|ENOMEM' "$log_file" 2>/dev/null || echo 0) | ||
|
|
||
| echo "rate_limit_count=$rate_limit_count" | ||
| echo "auth_error_count=$auth_error_count" | ||
| echo "conflict_count=$conflict_count" | ||
| echo "timeout_count=$timeout_count" | ||
| echo "oom_count=$oom_count" |
There was a problem hiding this comment.
Heuristic grep patterns are too broad — false positives will cause incorrect evaluations.
The short numeric patterns 429, 401, 403 match in non-HTTP contexts (timestamps like 14:29:xx, port numbers, line counts). The case-insensitive OOM matches substrings in "room", "zoom", "bloom", etc. Since evaluate_worker treats any count > 0 as a definitive signal (blocked:auth_error, blocked:out_of_memory), a single false positive will misclassify the outcome.
Proposed fix: use word boundaries and more specific patterns
- rate_limit_count=$(grep -ci 'rate.limit\|429\|too many requests' "$log_file" 2>/dev/null || echo 0)
- auth_error_count=$(grep -ci 'permission denied\|unauthorized\|403\|401' "$log_file" 2>/dev/null || echo 0)
- conflict_count=$(grep -ci 'merge conflict\|CONFLICT\|conflict marker' "$log_file" 2>/dev/null || echo 0)
- timeout_count=$(grep -ci 'timeout\|timed out\|ETIMEDOUT' "$log_file" 2>/dev/null || echo 0)
- oom_count=$(grep -ci 'out of memory\|OOM\|heap.*exceeded\|ENOMEM' "$log_file" 2>/dev/null || echo 0)
+ rate_limit_count=$(grep -cP -i 'rate.limit|\b429\b|too many requests' "$log_file" 2>/dev/null || echo 0)
+ auth_error_count=$(grep -cP -i 'permission denied|unauthorized|\b403\b|\b401\b' "$log_file" 2>/dev/null || echo 0)
+ conflict_count=$(grep -ci 'merge conflict\|CONFLICT\|conflict marker' "$log_file" 2>/dev/null || echo 0)
+ timeout_count=$(grep -ci 'timeout\|timed out\|ETIMEDOUT' "$log_file" 2>/dev/null || echo 0)
+ oom_count=$(grep -cP -i 'out of memory|\bOOM\b|heap.*exceeded|ENOMEM' "$log_file" 2>/dev/null || echo 0)As per coding guidelines, .agent/scripts/*.sh should focus on reliability and robustness — false positive error detection undermines both.
| # TASK_COMPLETE with clean exit = partial success (PR phase may have failed) | ||
| if [[ "${meta[signal]:-}" == "TASK_COMPLETE" && "${meta[exit_code]:-}" == "0" ]]; then | ||
| echo "complete:task_only" | ||
| return 0 | ||
| fi |
There was a problem hiding this comment.
TASK_COMPLETE with non-zero exit falls through to heuristics that may misclassify.
When the log contains TASK_COMPLETE but exit ≠ 0, the flow falls to the heuristic tier. If the worker encountered transient errors earlier in the run (rate limits, timeouts) that were internally recovered, those counts will still be > 0, causing the heuristic to incorrectly return retry:rate_limited or blocked:auth_error instead of recognizing the task actually completed.
Consider treating TASK_COMPLETE (regardless of exit code) as at least complete:task_only — or as a distinct tier-1 signal that bypasses heuristics.
🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1565 - 1569, The
TASK_COMPLETE branch currently only treats exit_code=="0" as task-only complete;
update the logic so that when meta[signal] == "TASK_COMPLETE" (regardless of
meta[exit_code]) the script echoes "complete:task_only" and returns 0, thereby
making TASK_COMPLETE a tier-1 signal that bypasses the heuristic fallback
(modify the conditional around the meta array check for "TASK_COMPLETE" in the
supervisor-helper.sh snippet to ignore exit_code or add a separate branch that
handles meta[signal]=="TASK_COMPLETE" first).
| # Parse the VERDICT line from AI response | ||
| local verdict_line | ||
| verdict_line=$(echo "$ai_result" | grep -o 'VERDICT:[a-z]*:[a-z_]*' | head -1 || true) | ||
|
|
||
| if [[ -n "$verdict_line" ]]; then | ||
| # Strip VERDICT: prefix and return | ||
| local verdict="${verdict_line#VERDICT:}" | ||
| log_info "AI eval for $task_id: $verdict" | ||
|
|
||
| # Store AI evaluation in state log for audit trail | ||
| sqlite3 "$SUPERVISOR_DB" " | ||
| INSERT INTO state_log (task_id, from_state, to_state, reason) | ||
| VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating', | ||
| 'AI eval verdict: $verdict'); | ||
| " 2>/dev/null || true | ||
|
|
||
| echo "$verdict" | ||
| return 0 |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Verdict regex is too restrictive and the value isn't SQL-escaped.
Two issues:
-
Regex
VERDICT:[a-z]*:[a-z_]*rejects hyphens and digits in the detail field. The AI is likely to return verdicts likeVERDICT:retry:rate-limitedorVERDICT:complete:pr_123— both would fail to match, causing the function to return "unparseable" and falling back toretry:ambiguous_ai_unavailableon every ambiguous case. -
SQL safety —
$verdictis interpolated directly into the SQL INSERT at Line 1719 withoutsql_escape. Currently safe due to the regex constraint, but fragile — any future relaxation of the regex creates an injection path.
Proposed fix
- verdict_line=$(echo "$ai_result" | grep -o 'VERDICT:[a-z]*:[a-z_]*' | head -1 || true)
+ verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+' | head -1 || true) sqlite3 "$SUPERVISOR_DB" "
INSERT INTO state_log (task_id, from_state, to_state, reason)
VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating',
- 'AI eval verdict: $verdict');
+ 'AI eval verdict: $(sql_escape "$verdict")');
" 2>/dev/null || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Parse the VERDICT line from AI response | |
| local verdict_line | |
| verdict_line=$(echo "$ai_result" | grep -o 'VERDICT:[a-z]*:[a-z_]*' | head -1 || true) | |
| if [[ -n "$verdict_line" ]]; then | |
| # Strip VERDICT: prefix and return | |
| local verdict="${verdict_line#VERDICT:}" | |
| log_info "AI eval for $task_id: $verdict" | |
| # Store AI evaluation in state log for audit trail | |
| sqlite3 "$SUPERVISOR_DB" " | |
| INSERT INTO state_log (task_id, from_state, to_state, reason) | |
| VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating', | |
| 'AI eval verdict: $verdict'); | |
| " 2>/dev/null || true | |
| echo "$verdict" | |
| return 0 | |
| # Parse the VERDICT line from AI response | |
| local verdict_line | |
| verdict_line=$(echo "$ai_result" | grep -oE 'VERDICT:[a-z]+:[a-z0-9_-]+' | head -1 || true) | |
| if [[ -n "$verdict_line" ]]; then | |
| # Strip VERDICT: prefix and return | |
| local verdict="${verdict_line#VERDICT:}" | |
| log_info "AI eval for $task_id: $verdict" | |
| # Store AI evaluation in state log for audit trail | |
| sqlite3 "$SUPERVISOR_DB" " | |
| INSERT INTO state_log (task_id, from_state, to_state, reason) | |
| VALUES ('$(sql_escape "$task_id")', 'evaluating', 'evaluating', | |
| 'AI eval verdict: $(sql_escape "$verdict")'); | |
| " 2>/dev/null || true | |
| echo "$verdict" | |
| return 0 |
🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1706 - 1723, The VERDICT
parsing is too restrictive and the verdict is inserted into SQL unescaped;
update the grep/regex used to extract verdict_line (the variable verdict_line
and the pattern matching around VERDICT:) to allow digits, hyphens and
underscores (e.g., include 0-9, - and _ in the token characters or use a more
permissive non-whitespace capture for the detail portion) so strings like
VERDICT:retry:rate-limited and VERDICT:complete:pr_123 match, then after
stripping the VERDICT: prefix when assigning local verdict use the sql_escape
helper on that value in the sqlite3 INSERT (replace 'AI eval verdict: $verdict'
with a version that uses sql_escape "$verdict") to prevent SQL injection.
| # Check max retries | ||
| if [[ "$tretries" -ge "$tmax_retries" ]]; then | ||
| log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)" | ||
| cmd_transition "$task_id" "failed" --error "Max retries exceeded during re-prompt" | ||
| return 1 | ||
| fi |
There was a problem hiding this comment.
State machine conflict: max-retries path marks "failed" instead of "blocked", defeating the PR's stated intent.
When retries >= max_retries, cmd_reprompt transitions the task from retrying → failed (Line 1785). Back in cmd_pulse (Line 1989), the fallback tries to transition from failed → blocked, but this is not a valid transition (only failed:queued is permitted). The || true suppresses the error silently, so:
- The task ends up
failedinstead ofblocked. cmd_cleanuptargetsfailedtasks, so the worktree gets removed — contradicting the PR objective of preserving the worktree for blocked tasks.- The escalation email (Lines 1991-1996) still fires, but its message says "blocked" while the actual state is "failed".
The fix requires two changes: add retrying:blocked to VALID_TRANSITIONS, and transition to blocked here instead of failed.
Proposed fix
Add the missing transition to the valid transitions array:
"retrying:dispatched"
"retrying:failed"
+ "retrying:blocked"
"retrying:cancelled"Then in cmd_reprompt, transition to blocked instead of failed:
if [[ "$tretries" -ge "$tmax_retries" ]]; then
log_error "Task $task_id has exceeded max retries ($tretries/$tmax_retries)"
- cmd_transition "$task_id" "failed" --error "Max retries exceeded during re-prompt"
+ cmd_transition "$task_id" "blocked" --error "Max retries exceeded during re-prompt"
return 1
fiThis way cmd_pulse's fallback (Line 1989) won't need to re-transition — the task is already blocked, and the escalation logic can key off the state directly.
As per coding guidelines, .agent/scripts/*.sh must focus on error recovery mechanisms — the current path silently loses the intended blocked state.
🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1782 - 1787, The
retry-exceeded path currently transitions the task to "failed" causing a state
mismatch downstream; update the VALID_TRANSITIONS array to include the
"retrying:blocked" transition and change the cmd_reprompt logic (the
cmd_reprompt function where it checks tretries >= tmax_retries) to call
cmd_transition "$task_id" "blocked" (preserving the existing --error payload)
instead of transitioning to "failed" so that cmd_pulse and escalation logic see
the intended "blocked" state and cmd_cleanup won't remove the worktree.
| # Re-prompt in existing worktree (continues context) | ||
| if cmd_reprompt "$tid" 2>/dev/null; then | ||
| dispatched_count=$((dispatched_count + 1)) | ||
| log_info " $tid: re-prompted successfully" |
There was a problem hiding this comment.
dispatched_count used in Phase 1 before declaration in Phase 2 — pulse summary undercounts.
dispatched_count is incremented here (Line 1979) during re-prompts in Phase 1, but it's declared with local dispatched_count=0 at Line 2027 in Phase 2. The Phase 2 declaration shadows the Phase 1 writes, so the pulse summary (Line 2080) will never include re-prompt dispatches — it only counts Phase 2 fresh dispatches.
Proposed fix: move the declaration before Phase 1
Move local dispatched_count=0 up alongside the other Phase 1 counters:
local completed_count=0
local failed_count=0
+ local dispatched_count=0
if [[ -n "$running_tasks" ]]; thenAnd remove the duplicate declaration at Line 2027:
# Phase 2: Dispatch queued tasks up to concurrency limit
- local dispatched_count=0🤖 Prompt for AI Agents
In @.agent/scripts/supervisor-helper.sh around lines 1977 - 1980, The variable
dispatched_count is being incremented during Phase 1 (in the cmd_reprompt block)
but is later re-declared (shadowed) in Phase 2, so Phase 1 increments are lost;
to fix, declare and initialize dispatched_count (e.g., local dispatched_count=0)
together with the other Phase 1 counters before any Phase 1 logic (so increments
in the cmd_reprompt branch are applied to the same variable), then remove the
duplicate local dispatched_count declaration in Phase 2 so the final pulse
summary uses the single shared counter.



Summary
repromptcommand that re-dispatches workers in their existing worktree with failure context, enabling intelligent retry cyclesevaluatecommand for manual debugging of stuck tasksblocked(notfailed) after exhausting retries, preserving worktree for human investigationChanges
New functions
extract_log_metadata()- Structured log parsing (signals, PR URLs, error counts for rate limit/auth/conflict/OOM/timeout)extract_log_tail()- Safe tail extraction for AI eval contextevaluate_with_ai()- Dispatches cheap Sonnet call to classify ambiguous outcomescmd_reprompt- Re-prompts workers in existing worktree with failure contextcmd_evaluate- Manual evaluation command with--no-aiflagEnhanced evaluation tiers
Pulse cycle improvements
cmd_repromptinstead of naive state re-queueblockedwith mail escalationfailedstateTesting
help,init,evaluatecommands tested manuallyTask
Closes t128.3 (Outcome evaluation and re-prompt cycle)
Summary by CodeRabbit
Release Notes