t1193: Reconcile supervisor DB running count with actual worker state#1813
t1193: Reconcile supervisor DB running count with actual worker state#1813marcusquinn merged 1 commit intomainfrom
Conversation
Phase 0.7 uses updated_at for staleness detection, which can be refreshed by other DB operations (reconciliation, state_log writes) while a worker is dead. This causes the supervisor DB to report tasks in 'running' status with no active workers — stale state from workers that terminated without recording their exit. Phase 0.8 fixes this by using started_at (set once at dispatch, never refreshed) to detect running tasks that have exceeded a wall-clock timeout with no live worker process: - Queries tasks WHERE status='running' AND started_at < now - timeout - Cross-references against PID files + kill -0 liveness check - Tasks with a PR → pr_review (work may be done, only worker died) - Tasks without a PR → failed with reason 'stale_running_no_process' - Logs recovery to state_log and pattern-tracker for observability - Configurable via SUPERVISOR_RUNNING_STALE_SECONDS (default 3600 = 1h) - ShellCheck: zero violations Chose started_at over updated_at because started_at is immutable after dispatch — matches existing patterns in Phase 4 hung-worker detection. Chose 1h default (not 2x expected duration) because tasks lack a duration estimate field; 1h is conservative enough to avoid false positives while catching the reported scenario (3 tasks stuck for hours). Ref #1804
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughIntroduces Phase 0.8 to the supervisor pulse cycle for detecting and recovering stale running tasks. The phase checks task started_at timestamps against a timeout threshold, verifies worker process liveness via PID files, and routes tasks to pr_review or failed states with corresponding cleanup and state logging. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 18 22:05:43 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
.agents/scripts/supervisor/pulse.sh (2)
401-405:sr_startedandrunning_stale_secondsare interpolated into thestate_logreason string without escaping.While
sr_startedis a DB-sourced ISO timestamp andrunning_stale_secondsis numeric (both unlikely to contain single quotes), thesql_escapeutility is already in use forsr_idon the same line. For consistency with Phase 0.7's pattern and to maintain that zero-technical-debt grade, consider escaping them too — or at minimumsr_startedsince it's external data flowing back into a new INSERT.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 401 - 405, Escape interpolated variables sr_started (and optionally running_stale_seconds) before embedding them into the SQL INSERT reason string to avoid unescaped external data; use the existing sql_escape helper the same way sr_id is escaped so the VALUES line becomes VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review', 'Phase 0.8 stale_running_no_process (t1193): started_at=$(sql_escape "$sr_started"), no live worker after $(sql_escape "$running_stale_seconds")s, PR exists') — at minimum wrap sr_started with sql_escape when constructing the state_log INSERT in pulse.sh.
352-360: Consider validatingrunning_stale_secondsis numeric before SQL interpolation.The variable is interpolated directly into the SQL query (line 358). While this mirrors Phase 0.7's
stale_grace_secondspattern (line 243), a non-numericSUPERVISOR_RUNNING_STALE_SECONDSenv var value would produce a broken SQL statement. Both phases share this concern.Optional defensive check
local running_stale_seconds="${SUPERVISOR_RUNNING_STALE_SECONDS:-3600}" + # Ensure numeric to prevent SQL injection via env var + [[ "$running_stale_seconds" =~ ^[0-9]+$ ]] || running_stale_seconds=3600🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 352 - 360, The script directly interpolates SUPERVISOR_RUNNING_STALE_SECONDS into running_stale_seconds and then into the SQL in the db call (see variable running_stale_seconds and the stale_running_tasks db query), so add a defensive numeric validation: ensure running_stale_seconds contains only digits (optionally allow leading +) and fall back to the default 3600 if not numeric or empty; then use that validated value in the db SQL interpolation to avoid malformed queries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 396-422: The PR-exists branch in Phase 0.8 skips calling
cleanup_worker_processes, risking zombie child processes; update the logic so
cleanup_worker_processes "$sr_id" is invoked regardless of whether sr_pr_url is
set (i.e., either add cleanup_worker_processes "$sr_id" before the fi or move
the call out of the else branch), keeping existing logging redirection
(2>>"$SUPERVISOR_LOG" || true) and preserving surrounding transitions
(cmd_transition "$sr_id" "pr_review" ... and the DB inserts) so process trees
are always reaped after handling the state transition.
In `@tests/test-supervisor-state-machine.sh`:
- Around line 1799-1810: The test is failing because Phase 1 / Phase 4b detect a
missing PID and transition the task after Phase 0.8 skips it; fix by creating a
live PID file for test-t1193c immediately after marking it running so both Phase
1 and Phase 4b also consider it alive. After calling "sup transition test-t1193c
running", write a PID file containing a live PID (use $$ or start a short-lived
sleep background process and write its PID) to the supervisor's expected
PID-file location for task "test-t1193c" so that the code paths in Phase 1 (the
is_alive check and cmd_transition) and Phase 4b will skip acting on this task;
then run the SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse and assert
get_status still returns "running".
---
Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 401-405: Escape interpolated variables sr_started (and optionally
running_stale_seconds) before embedding them into the SQL INSERT reason string
to avoid unescaped external data; use the existing sql_escape helper the same
way sr_id is escaped so the VALUES line becomes VALUES ('$(sql_escape
"$sr_id")', 'running', 'pr_review', 'Phase 0.8 stale_running_no_process (t1193):
started_at=$(sql_escape "$sr_started"), no live worker after $(sql_escape
"$running_stale_seconds")s, PR exists') — at minimum wrap sr_started with
sql_escape when constructing the state_log INSERT in pulse.sh.
- Around line 352-360: The script directly interpolates
SUPERVISOR_RUNNING_STALE_SECONDS into running_stale_seconds and then into the
SQL in the db call (see variable running_stale_seconds and the
stale_running_tasks db query), so add a defensive numeric validation: ensure
running_stale_seconds contains only digits (optionally allow leading +) and fall
back to the default 3600 if not numeric or empty; then use that validated value
in the db SQL interpolation to avoid malformed queries.
| if [[ -n "$sr_pr_url" ]]; then | ||
| log_info " Phase 0.8: $sr_id → pr_review (has PR, worker died after running)" | ||
| cmd_transition "$sr_id" "pr_review" --pr-url "$sr_pr_url" \ | ||
| --error "Stale running recovery (Phase 0.8/t1193): worker died after ${running_stale_seconds}s, PR exists" \ | ||
| 2>>"$SUPERVISOR_LOG" || true | ||
| db "$SUPERVISOR_DB" " | ||
| INSERT INTO state_log (task_id, from_state, to_state, reason) | ||
| VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review', | ||
| 'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s, PR exists'); | ||
| " 2>/dev/null || true | ||
| else | ||
| # No PR — mark failed with stale_running_no_process reason | ||
| cmd_transition "$sr_id" "failed" \ | ||
| --error "stale_running_no_process: worker terminated without recording exit after ${running_stale_seconds}s (Phase 0.8/t1193)" \ | ||
| 2>>"$SUPERVISOR_LOG" || true | ||
| db "$SUPERVISOR_DB" " | ||
| INSERT INTO state_log (task_id, from_state, to_state, reason) | ||
| VALUES ('$(sql_escape "$sr_id")', 'running', 'failed', | ||
| 'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s'); | ||
| " 2>/dev/null || true | ||
| # Clean up worker process tree (zombie children) | ||
| cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true | ||
| # Attempt self-heal for retry eligibility | ||
| attempt_self_heal "$sr_id" "failed" \ | ||
| "stale_running_no_process: worker terminated without recording exit" \ | ||
| "${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true | ||
| fi |
There was a problem hiding this comment.
cleanup_worker_processes is missing for the PR-exists path — potential zombie child leak.
Phase 0.7 (line 322) calls cleanup_worker_processes unconditionally after the transition branch. Phase 0.8 only calls it in the no-PR else branch (line 417), skipping it when routing to pr_review. If the dead worker spawned child processes (shellcheck, node, bash-language-server), those zombies won't be reaped for tasks that have a PR.
Proposed fix: call cleanup unconditionally after the branch
if [[ -n "$sr_pr_url" ]]; then
log_info " Phase 0.8: $sr_id → pr_review (has PR, worker died after running)"
cmd_transition "$sr_id" "pr_review" --pr-url "$sr_pr_url" \
--error "Stale running recovery (Phase 0.8/t1193): worker died after ${running_stale_seconds}s, PR exists" \
2>>"$SUPERVISOR_LOG" || true
db "$SUPERVISOR_DB" "
INSERT INTO state_log (task_id, from_state, to_state, reason)
VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review',
'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s, PR exists');
" 2>/dev/null || true
+ # Clean up worker process tree (zombie children)
+ cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true
else
# No PR — mark failed with stale_running_no_process reason
cmd_transition "$sr_id" "failed" \
--error "stale_running_no_process: worker terminated without recording exit after ${running_stale_seconds}s (Phase 0.8/t1193)" \
2>>"$SUPERVISOR_LOG" || true
db "$SUPERVISOR_DB" "
INSERT INTO state_log (task_id, from_state, to_state, reason)
VALUES ('$(sql_escape "$sr_id")', 'running', 'failed',
'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s');
" 2>/dev/null || true
# Clean up worker process tree (zombie children)
cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true
# Attempt self-heal for retry eligibility
attempt_self_heal "$sr_id" "failed" \
"stale_running_no_process: worker terminated without recording exit" \
"${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
fiAlternatively, move the cleanup call after the fi (like Phase 0.7 line 322) to avoid duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/supervisor/pulse.sh around lines 396 - 422, The PR-exists
branch in Phase 0.8 skips calling cleanup_worker_processes, risking zombie child
processes; update the logic so cleanup_worker_processes "$sr_id" is invoked
regardless of whether sr_pr_url is set (i.e., either add
cleanup_worker_processes "$sr_id" before the fi or move the call out of the else
branch), keeping existing logging redirection (2>>"$SUPERVISOR_LOG" || true) and
preserving surrounding transitions (cmd_transition "$sr_id" "pr_review" ... and
the DB inserts) so process trees are always reaped after handling the state
transition.
| # Test 3: recently started running task (within timeout) is NOT recovered | ||
| sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null | ||
| sup transition test-t1193c dispatched >/dev/null | ||
| sup transition test-t1193c running >/dev/null | ||
| # started_at is recent (default) — should NOT be touched by Phase 0.8 | ||
| SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true | ||
| t1193c_status=$(get_status test-t1193c) | ||
| if [[ "$t1193c_status" == "running" ]]; then | ||
| pass "Phase 0.8: recently started running task not falsely recovered (t1193)" | ||
| else | ||
| fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)" | ||
| fi |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find the supervisor pulse script and examine Phase 1 logic
find . -name "pulse.sh" -type fRepository: marcusquinn/aidevops
Length of output: 101
🏁 Script executed:
# Get the directory structure to understand the repo layout
ls -la | head -20
git ls-files | grep -E "(supervisor|pulse)" | head -20Repository: marcusquinn/aidevops
Length of output: 2077
🏁 Script executed:
# Let's examine the pulse.sh file to verify Phase 1 and Phase 4b logic
wc -l .agents/scripts/supervisor/pulse.shRepository: marcusquinn/aidevops
Length of output: 104
🏁 Script executed:
# Look for Phase 1 logic and queries around line 442
sed -n '430,500p' .agents/scripts/supervisor/pulse.shRepository: marcusquinn/aidevops
Length of output: 2604
🏁 Script executed:
# Look for Phase 4b logic around line 1607
sed -n '1595,1650p' .agents/scripts/supervisor/pulse.shRepository: marcusquinn/aidevops
Length of output: 2771
🏁 Script executed:
# Now let's find Phase 0.8 to understand when it runs
sed -n '300,450p' .agents/scripts/supervisor/pulse.sh | grep -A 20 "Phase 0.8"Repository: marcusquinn/aidevops
Length of output: 4015
🏁 Script executed:
# Let's also examine the test code around lines 1799-1810
sed -n '1790,1825p' tests/test-supervisor-state-machine.shRepository: marcusquinn/aidevops
Length of output: 1815
🏁 Script executed:
# Find what get_status does and what the test setup looks like
grep -n "get_status\|test-t1193c" tests/test-supervisor-state-machine.sh | head -20Repository: marcusquinn/aidevops
Length of output: 1236
🏁 Script executed:
# Let's verify get_status implementation to confirm it returns the database status
sed -n '70,90p' tests/test-supervisor-state-machine.shRepository: marcusquinn/aidevops
Length of output: 584
🏁 Script executed:
# Also verify cmd_transition is called in Phase 1 to confirm state change
grep -n "cmd_transition.*evaluating" .agents/scripts/supervisor/pulse.sh | head -5Repository: marcusquinn/aidevops
Length of output: 638
Test 3 is fragile — Phase 1 and Phase 4b will interfere with the assertion.
This test expects test-t1193c to remain in running after a full pulse, proving Phase 0.8 skipped it. However:
- Phase 0.8 (lines 350–430) correctly selects only tasks where
status = 'running'ANDstarted_at < now - 3600s, so it will skip your recently created task. ✓ - Phase 1 (line 442) selects ALL tasks where
status IN ('running', 'dispatched', 'evaluating')with no grace period. Sincetest-t1193chas no PID file, Phase 1 detectsis_alive=falseat line 481 and immediately transitions it toevaluatingviacmd_transitionat line 484. - Phase 4b (lines 1607–1625) also catches running tasks with no PID file and evaluates them.
Your assertion t1193c_status == "running" will fail because the database status changes to evaluating (or failed if evaluation fails) after Phase 1/4b processing.
To isolate Phase 0.8's skip behavior, the test needs a live PID file pointing to a real process (e.g., $$ or a sleep background job) so Phases 1 and 4b also skip it:
Proposed fix: create a PID file with a live process
# Test 3: recently started running task (within timeout) is NOT recovered
sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null
sup transition test-t1193c dispatched >/dev/null
sup transition test-t1193c running >/dev/null
# started_at is recent (default) — should NOT be touched by Phase 0.8
+# Create a PID file with a live process so Phase 1/4b also skip this task
+sleep 300 &
+local_bg_pid=$!
+mkdir -p "$TEST_DIR/pids"
+echo "$local_bg_pid" > "$TEST_DIR/pids/test-t1193c.pid"
SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true
+kill "$local_bg_pid" 2>/dev/null || true
+rm -f "$TEST_DIR/pids/test-t1193c.pid"
t1193c_status=$(get_status test-t1193c)
if [[ "$t1193c_status" == "running" ]]; then
pass "Phase 0.8: recently started running task not falsely recovered (t1193)"
else
fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Test 3: recently started running task (within timeout) is NOT recovered | |
| sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null | |
| sup transition test-t1193c dispatched >/dev/null | |
| sup transition test-t1193c running >/dev/null | |
| # started_at is recent (default) — should NOT be touched by Phase 0.8 | |
| SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true | |
| t1193c_status=$(get_status test-t1193c) | |
| if [[ "$t1193c_status" == "running" ]]; then | |
| pass "Phase 0.8: recently started running task not falsely recovered (t1193)" | |
| else | |
| fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)" | |
| fi | |
| # Test 3: recently started running task (within timeout) is NOT recovered | |
| sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null | |
| sup transition test-t1193c dispatched >/dev/null | |
| sup transition test-t1193c running >/dev/null | |
| # started_at is recent (default) — should NOT be touched by Phase 0.8 | |
| # Create a PID file with a live process so Phase 1/4b also skip this task | |
| sleep 300 & | |
| local_bg_pid=$! | |
| mkdir -p "$TEST_DIR/pids" | |
| echo "$local_bg_pid" > "$TEST_DIR/pids/test-t1193c.pid" | |
| SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true | |
| kill "$local_bg_pid" 2>/dev/null || true | |
| rm -f "$TEST_DIR/pids/test-t1193c.pid" | |
| t1193c_status=$(get_status test-t1193c) | |
| if [[ "$t1193c_status" == "running" ]]; then | |
| pass "Phase 0.8: recently started running task not falsely recovered (t1193)" | |
| else | |
| fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)" | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/test-supervisor-state-machine.sh` around lines 1799 - 1810, The test is
failing because Phase 1 / Phase 4b detect a missing PID and transition the task
after Phase 0.8 skips it; fix by creating a live PID file for test-t1193c
immediately after marking it running so both Phase 1 and Phase 4b also consider
it alive. After calling "sup transition test-t1193c running", write a PID file
containing a live PID (use $$ or start a short-lived sleep background process
and write its PID) to the supervisor's expected PID-file location for task
"test-t1193c" so that the code paths in Phase 1 (the is_alive check and
cmd_transition) and Phase 4b will skip acting on this task; then run the
SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse and assert get_status still
returns "running".
Auto-dismissed: bot review does not block autonomous pipeline



Summary
runningtasks whose worker processes have terminated without recording their exitstarted_at(immutable after dispatch) instead ofupdated_at(refreshed by other DB operations) to reliably identify stale running tasksfailedwith reasonstale_running_no_process, or topr_reviewif a PR already existsRoot Cause
Phase 0.7 uses
updated_atfor staleness detection. Other DB operations (reconciliation sweeps, state_log writes, Phase 7b bidirectional sync) can refreshupdated_atwhile the worker is dead, causing Phase 0.7 to miss tasks that have been stuck inrunningfor hours. The result: the supervisor DB reports tasks inrunningstatus with no active workers.Changes
.agents/scripts/supervisor/pulse.shAdded Phase 0.8 between Phase 0.7 and Phase 1:
WHERE status='running' AND started_at < now - SUPERVISOR_RUNNING_STALE_SECONDSkill -0 $pid— skips tasks with live workerspr_review(work done, only worker died)failed+attempt_self_heal+cleanup_worker_processesstate_logwith reasonstale_running_no_processpattern-tracker-helper.shfor observabilitySUPERVISOR_RUNNING_STALE_SECONDS(default 3600 = 1h)tests/test-supervisor-state-machine.shAdded 3 tests for Phase 0.8:
started_at) → recovered tofailed/queuedpr_reviewVerification
Design Decisions
started_atnotupdated_at:started_atis set once at dispatch and never refreshed — immune to the DB churn that defeats Phase 0.7runningstatus:dispatchedandevaluatingare already covered by Phase 0.7 and Phase 1c respectively — no overlapRef #1804
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests