t1193: Reconcile supervisor DB running count with actual worker state by marcusquinn · Pull Request #1813 · marcusquinn/aidevops

marcusquinn · 2026-02-18T22:05:13Z

Summary

Adds Phase 0.8 to the supervisor pulse cycle to detect and recover running tasks whose worker processes have terminated without recording their exit
Uses started_at (immutable after dispatch) instead of updated_at (refreshed by other DB operations) to reliably identify stale running tasks
Transitions orphaned tasks to failed with reason stale_running_no_process, or to pr_review if a PR already exists

Root Cause

Phase 0.7 uses updated_at for staleness detection. Other DB operations (reconciliation sweeps, state_log writes, Phase 7b bidirectional sync) can refresh updated_at while the worker is dead, causing Phase 0.7 to miss tasks that have been stuck in running for hours. The result: the supervisor DB reports tasks in running status with no active workers.

Changes

`.agents/scripts/supervisor/pulse.sh`

Added Phase 0.8 between Phase 0.7 and Phase 1:

Phase 0.8: Stale 'running' task recovery using started_at (t1193)

SQL: WHERE status='running' AND started_at < now - SUPERVISOR_RUNNING_STALE_SECONDS
PID liveness check: kill -0 $pid — skips tasks with live workers
Recovery paths:
- Has PR → pr_review (work done, only worker died)
- No PR → failed + attempt_self_heal + cleanup_worker_processes
Logs to state_log with reason stale_running_no_process
Records to pattern-tracker-helper.sh for observability
Configurable: SUPERVISOR_RUNNING_STALE_SECONDS (default 3600 = 1h)

`tests/test-supervisor-state-machine.sh`

Added 3 tests for Phase 0.8:

Stale running task (no PID, old started_at) → recovered to failed/queued
Stale running task with PR → routed to pr_review
Recently started running task → NOT falsely recovered

Verification

ShellCheck: zero violations on both modified files
Unit tests: 3/3 pass (SQL query logic, state_log recording)

Phase 0.8 recovery confirmed in pulse output:

Phase 0.8: Orphaned running task test-t1193a (started: ..., no live worker after 0s)
Phase 0.8: Recovered 1 stale running task(s) (0 still alive)

Design Decisions

started_at not updated_at: started_at is set once at dispatch and never refreshed — immune to the DB churn that defeats Phase 0.7
1h default: Tasks lack a duration estimate field, so 2x-expected-duration is not computable. 1h is conservative enough to avoid false positives while catching the reported scenario
Only running status: dispatched and evaluating are already covered by Phase 0.7 and Phase 1c respectively — no overlap

Ref #1804

Summary by CodeRabbit

Release Notes

Bug Fixes
- Tasks that remain in running state without an active worker are now automatically detected and recovered.
- Stale orphaned tasks are transitioned to appropriate states (failed, queued, or pr_review) based on configuration.
Tests
- Added test coverage for stale task detection and recovery scenarios.

Phase 0.7 uses updated_at for staleness detection, which can be refreshed by other DB operations (reconciliation, state_log writes) while a worker is dead. This causes the supervisor DB to report tasks in 'running' status with no active workers — stale state from workers that terminated without recording their exit. Phase 0.8 fixes this by using started_at (set once at dispatch, never refreshed) to detect running tasks that have exceeded a wall-clock timeout with no live worker process: - Queries tasks WHERE status='running' AND started_at < now - timeout - Cross-references against PID files + kill -0 liveness check - Tasks with a PR → pr_review (work may be done, only worker died) - Tasks without a PR → failed with reason 'stale_running_no_process' - Logs recovery to state_log and pattern-tracker for observability - Configurable via SUPERVISOR_RUNNING_STALE_SECONDS (default 3600 = 1h) - ShellCheck: zero violations Chose started_at over updated_at because started_at is immutable after dispatch — matches existing patterns in Phase 4 hung-worker detection. Chose 1h default (not 2x expected duration) because tasks lack a duration estimate field; 1h is conservative enough to avoid false positives while catching the reported scenario (3 tasks stuck for hours). Ref #1804

gemini-code-assist · 2026-02-18T22:05:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-02-18T22:05:39Z

Walkthrough

Introduces Phase 0.8 to the supervisor pulse cycle for detecting and recovering stale running tasks. The phase checks task started_at timestamps against a timeout threshold, verifies worker process liveness via PID files, and routes tasks to pr_review or failed states with corresponding cleanup and state logging.

Changes

Cohort / File(s)	Summary
Stale-Running Recovery Phase `.agents/scripts/supervisor/pulse.sh`	Implements Phase 0.8 logic to detect orphaned tasks in running state, check worker liveness via PID files, transition stale tasks to pr_review (with PR URL) or failed (without PR URL), perform cleanup, and record state transitions and patterns.
Phase 0.8 Test Suite `tests/test-supervisor-state-machine.sh`	Adds three validation tests: stale task without PID recovery to failed/queued, stale task with PR URL routing to pr_review, and recent task remaining running. Uses environment variable override (SUPERVISOR_RUNNING_STALE_SECONDS=0) to force immediate detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

fix: supervisor self-healing -- macOS timeout, PR detection, model names, stale PID cleanup #429 – Both PRs address stale PID cleanup and missing PR URL fallbacks in supervisor self-healing recovery paths.
fix: supervisor integration testing fixes (t128.7) #384 – Implements complementary worker cleanup helpers and periodic stale-process integration into the pulse flow.
fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431 – Extends PID file handling robustness and worker startup patterns that Phase 0.8 relies upon for liveness checks.

Poem

🔍 Stale runners fade in the orphaned night,
Phase 0.8 brings them back to light,
Check the PID, check the time,
Route to review or mark as crime,
Self-heal flows through the DevOps line ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main objective: reconciling supervisor database running count with actual worker state through Phase 0.8 stale-running recovery logic.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1193

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-18T22:05:45Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 27 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 22:05:40 UTC 2026: Code review monitoring started
Wed Feb 18 22:05:41 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 27

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 27
VULNERABILITIES: 0

Generated on: Wed Feb 18 22:05:43 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-18T22:06:29Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

.agents/scripts/supervisor/pulse.sh (2)
401-405: sr_started and running_stale_seconds are interpolated into the state_log reason string without escaping.

While sr_started is a DB-sourced ISO timestamp and running_stale_seconds is numeric (both unlikely to contain single quotes), the sql_escape utility is already in use for sr_id on the same line. For consistency with Phase 0.7's pattern and to maintain that zero-technical-debt grade, consider escaping them too — or at minimum sr_started since it's external data flowing back into a new INSERT.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 401 - 405, Escape
interpolated variables sr_started (and optionally running_stale_seconds) before
embedding them into the SQL INSERT reason string to avoid unescaped external
data; use the existing sql_escape helper the same way sr_id is escaped so the
VALUES line becomes VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review',
'Phase 0.8 stale_running_no_process (t1193): started_at=$(sql_escape
"$sr_started"), no live worker after $(sql_escape "$running_stale_seconds")s, PR
exists') — at minimum wrap sr_started with sql_escape when constructing the
state_log INSERT in pulse.sh.
352-360: Consider validating running_stale_seconds is numeric before SQL interpolation.

The variable is interpolated directly into the SQL query (line 358). While this mirrors Phase 0.7's stale_grace_seconds pattern (line 243), a non-numeric SUPERVISOR_RUNNING_STALE_SECONDS env var value would produce a broken SQL statement. Both phases share this concern.
Optional defensive check
 	local running_stale_seconds="${SUPERVISOR_RUNNING_STALE_SECONDS:-3600}"
+	# Ensure numeric to prevent SQL injection via env var
+	[[ "$running_stale_seconds" =~ ^[0-9]+$ ]] || running_stale_seconds=3600
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 352 - 360, The script
directly interpolates SUPERVISOR_RUNNING_STALE_SECONDS into
running_stale_seconds and then into the SQL in the db call (see variable
running_stale_seconds and the stale_running_tasks db query), so add a defensive
numeric validation: ensure running_stale_seconds contains only digits
(optionally allow leading +) and fall back to the default 3600 if not numeric or
empty; then use that validated value in the db SQL interpolation to avoid
malformed queries.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 396-422: The PR-exists branch in Phase 0.8 skips calling
cleanup_worker_processes, risking zombie child processes; update the logic so
cleanup_worker_processes "$sr_id" is invoked regardless of whether sr_pr_url is
set (i.e., either add cleanup_worker_processes "$sr_id" before the fi or move
the call out of the else branch), keeping existing logging redirection
(2>>"$SUPERVISOR_LOG" || true) and preserving surrounding transitions
(cmd_transition "$sr_id" "pr_review" ... and the DB inserts) so process trees
are always reaped after handling the state transition.

In `@tests/test-supervisor-state-machine.sh`:
- Around line 1799-1810: The test is failing because Phase 1 / Phase 4b detect a
missing PID and transition the task after Phase 0.8 skips it; fix by creating a
live PID file for test-t1193c immediately after marking it running so both Phase
1 and Phase 4b also consider it alive. After calling "sup transition test-t1193c
running", write a PID file containing a live PID (use $$ or start a short-lived
sleep background process and write its PID) to the supervisor's expected
PID-file location for task "test-t1193c" so that the code paths in Phase 1 (the
is_alive check and cmd_transition) and Phase 4b will skip acting on this task;
then run the SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse and assert
get_status still returns "running".

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 401-405: Escape interpolated variables sr_started (and optionally
running_stale_seconds) before embedding them into the SQL INSERT reason string
to avoid unescaped external data; use the existing sql_escape helper the same
way sr_id is escaped so the VALUES line becomes VALUES ('$(sql_escape
"$sr_id")', 'running', 'pr_review', 'Phase 0.8 stale_running_no_process (t1193):
started_at=$(sql_escape "$sr_started"), no live worker after $(sql_escape
"$running_stale_seconds")s, PR exists') — at minimum wrap sr_started with
sql_escape when constructing the state_log INSERT in pulse.sh.
- Around line 352-360: The script directly interpolates
SUPERVISOR_RUNNING_STALE_SECONDS into running_stale_seconds and then into the
SQL in the db call (see variable running_stale_seconds and the
stale_running_tasks db query), so add a defensive numeric validation: ensure
running_stale_seconds contains only digits (optionally allow leading +) and fall
back to the default 3600 if not numeric or empty; then use that validated value
in the db SQL interpolation to avoid malformed queries.

coderabbitai · 2026-02-18T22:10:41Z

.agents/scripts/supervisor/pulse.sh

+			if [[ -n "$sr_pr_url" ]]; then
+				log_info "  Phase 0.8: $sr_id → pr_review (has PR, worker died after running)"
+				cmd_transition "$sr_id" "pr_review" --pr-url "$sr_pr_url" \
+					--error "Stale running recovery (Phase 0.8/t1193): worker died after ${running_stale_seconds}s, PR exists" \
+					2>>"$SUPERVISOR_LOG" || true
+				db "$SUPERVISOR_DB" "
+					INSERT INTO state_log (task_id, from_state, to_state, reason)
+					VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review',
+						'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s, PR exists');
+				" 2>/dev/null || true
+			else
+				# No PR — mark failed with stale_running_no_process reason
+				cmd_transition "$sr_id" "failed" \
+					--error "stale_running_no_process: worker terminated without recording exit after ${running_stale_seconds}s (Phase 0.8/t1193)" \
+					2>>"$SUPERVISOR_LOG" || true
+				db "$SUPERVISOR_DB" "
+					INSERT INTO state_log (task_id, from_state, to_state, reason)
+					VALUES ('$(sql_escape "$sr_id")', 'running', 'failed',
+						'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s');
+				" 2>/dev/null || true
+				# Clean up worker process tree (zombie children)
+				cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true
+				# Attempt self-heal for retry eligibility
+				attempt_self_heal "$sr_id" "failed" \
+					"stale_running_no_process: worker terminated without recording exit" \
+					"${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
+			fi


⚠️ Potential issue | 🟡 Minor

cleanup_worker_processes is missing for the PR-exists path — potential zombie child leak.

Phase 0.7 (line 322) calls cleanup_worker_processes unconditionally after the transition branch. Phase 0.8 only calls it in the no-PR else branch (line 417), skipping it when routing to pr_review. If the dead worker spawned child processes (shellcheck, node, bash-language-server), those zombies won't be reaped for tasks that have a PR.

Proposed fix: call cleanup unconditionally after the branch

if [[ -n "$sr_pr_url" ]]; then log_info " Phase 0.8: $sr_id → pr_review (has PR, worker died after running)" cmd_transition "$sr_id" "pr_review" --pr-url "$sr_pr_url" \ --error "Stale running recovery (Phase 0.8/t1193): worker died after ${running_stale_seconds}s, PR exists" \ 2>>"$SUPERVISOR_LOG" || true db "$SUPERVISOR_DB" " INSERT INTO state_log (task_id, from_state, to_state, reason) VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review', 'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s, PR exists'); " 2>/dev/null || true + # Clean up worker process tree (zombie children) + cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true else # No PR — mark failed with stale_running_no_process reason cmd_transition "$sr_id" "failed" \ --error "stale_running_no_process: worker terminated without recording exit after ${running_stale_seconds}s (Phase 0.8/t1193)" \ 2>>"$SUPERVISOR_LOG" || true db "$SUPERVISOR_DB" " INSERT INTO state_log (task_id, from_state, to_state, reason) VALUES ('$(sql_escape "$sr_id")', 'running', 'failed', 'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s'); " 2>/dev/null || true # Clean up worker process tree (zombie children) cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true # Attempt self-heal for retry eligibility attempt_self_heal "$sr_id" "failed" \ "stale_running_no_process: worker terminated without recording exit" \ "${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true fi

Alternatively, move the cleanup call after the fi (like Phase 0.7 line 322) to avoid duplication.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 396 - 422, The PR-exists branch in Phase 0.8 skips calling cleanup_worker_processes, risking zombie child processes; update the logic so cleanup_worker_processes "$sr_id" is invoked regardless of whether sr_pr_url is set (i.e., either add cleanup_worker_processes "$sr_id" before the fi or move the call out of the else branch), keeping existing logging redirection (2>>"$SUPERVISOR_LOG" || true) and preserving surrounding transitions (cmd_transition "$sr_id" "pr_review" ... and the DB inserts) so process trees are always reaped after handling the state transition.

coderabbitai · 2026-02-18T22:10:41Z

tests/test-supervisor-state-machine.sh

+# Test 3: recently started running task (within timeout) is NOT recovered
+sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null
+sup transition test-t1193c dispatched >/dev/null
+sup transition test-t1193c running >/dev/null
+# started_at is recent (default) — should NOT be touched by Phase 0.8
+SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true
+t1193c_status=$(get_status test-t1193c)
+if [[ "$t1193c_status" == "running" ]]; then
+	pass "Phase 0.8: recently started running task not falsely recovered (t1193)"
+else
+	fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"
+fi


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the supervisor pulse script and examine Phase 1 logic find . -name "pulse.sh" -type f

Repository: marcusquinn/aidevops

Length of output: 101

🏁 Script executed:

# Get the directory structure to understand the repo layout ls -la | head -20 git ls-files | grep -E "(supervisor|pulse)" | head -20

Repository: marcusquinn/aidevops

Length of output: 2077

🏁 Script executed:

# Let's examine the pulse.sh file to verify Phase 1 and Phase 4b logic wc -l .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 104

🏁 Script executed:

# Look for Phase 1 logic and queries around line 442 sed -n '430,500p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 2604

🏁 Script executed:

# Look for Phase 4b logic around line 1607 sed -n '1595,1650p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 2771

🏁 Script executed:

# Now let's find Phase 0.8 to understand when it runs sed -n '300,450p' .agents/scripts/supervisor/pulse.sh | grep -A 20 "Phase 0.8"

Repository: marcusquinn/aidevops

Length of output: 4015

🏁 Script executed:

# Let's also examine the test code around lines 1799-1810 sed -n '1790,1825p' tests/test-supervisor-state-machine.sh

Repository: marcusquinn/aidevops

Length of output: 1815

🏁 Script executed:

# Find what get_status does and what the test setup looks like grep -n "get_status\|test-t1193c" tests/test-supervisor-state-machine.sh | head -20

Repository: marcusquinn/aidevops

Length of output: 1236

🏁 Script executed:

# Let's verify get_status implementation to confirm it returns the database status sed -n '70,90p' tests/test-supervisor-state-machine.sh

Repository: marcusquinn/aidevops

Length of output: 584

🏁 Script executed:

# Also verify cmd_transition is called in Phase 1 to confirm state change grep -n "cmd_transition.*evaluating" .agents/scripts/supervisor/pulse.sh | head -5

Repository: marcusquinn/aidevops

Length of output: 638

Test 3 is fragile — Phase 1 and Phase 4b will interfere with the assertion.

This test expects test-t1193c to remain in running after a full pulse, proving Phase 0.8 skipped it. However:

Phase 0.8 (lines 350–430) correctly selects only tasks where status = 'running' AND started_at < now - 3600s, so it will skip your recently created task. ✓

Phase 1 (line 442) selects ALL tasks where status IN ('running', 'dispatched', 'evaluating') with no grace period. Since test-t1193c has no PID file, Phase 1 detects is_alive=false at line 481 and immediately transitions it to evaluating via cmd_transition at line 484.

Phase 4b (lines 1607–1625) also catches running tasks with no PID file and evaluates them.

Your assertion t1193c_status == "running" will fail because the database status changes to evaluating (or failed if evaluation fails) after Phase 1/4b processing.

To isolate Phase 0.8's skip behavior, the test needs a live PID file pointing to a real process (e.g., $$ or a sleep background job) so Phases 1 and 4b also skip it:

Proposed fix: create a PID file with a live process

# Test 3: recently started running task (within timeout) is NOT recovered sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null sup transition test-t1193c dispatched >/dev/null sup transition test-t1193c running >/dev/null # started_at is recent (default) — should NOT be touched by Phase 0.8 +# Create a PID file with a live process so Phase 1/4b also skip this task +sleep 300 & +local_bg_pid=$! +mkdir -p "$TEST_DIR/pids" +echo "$local_bg_pid" > "$TEST_DIR/pids/test-t1193c.pid" SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true +kill "$local_bg_pid" 2>/dev/null || true +rm -f "$TEST_DIR/pids/test-t1193c.pid" t1193c_status=$(get_status test-t1193c) if [[ "$t1193c_status" == "running" ]]; then pass "Phase 0.8: recently started running task not falsely recovered (t1193)" else fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Test 3: recently started running task (within timeout) is NOT recovered

sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null

sup transition test-t1193c dispatched >/dev/null

sup transition test-t1193c running >/dev/null

# started_at is recent (default) — should NOT be touched by Phase 0.8

SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true

t1193c_status=$(get_status test-t1193c)

if [[ "$t1193c_status" == "running" ]]; then

pass "Phase 0.8: recently started running task not falsely recovered (t1193)"

else

fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"

fi

# Test 3: recently started running task (within timeout) is NOT recovered

sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null

sup transition test-t1193c dispatched >/dev/null

sup transition test-t1193c running >/dev/null

# started_at is recent (default) — should NOT be touched by Phase 0.8

# Create a PID file with a live process so Phase 1/4b also skip this task

sleep 300 &

local_bg_pid=$!

mkdir -p "$TEST_DIR/pids"

echo "$local_bg_pid" > "$TEST_DIR/pids/test-t1193c.pid"

SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true

kill "$local_bg_pid" 2>/dev/null || true

rm -f "$TEST_DIR/pids/test-t1193c.pid"

t1193c_status=$(get_status test-t1193c)

if [[ "$t1193c_status" == "running" ]]; then

pass "Phase 0.8: recently started running task not falsely recovered (t1193)"

else

fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/test-supervisor-state-machine.sh` around lines 1799 - 1810, The test is failing because Phase 1 / Phase 4b detect a missing PID and transition the task after Phase 0.8 skips it; fix by creating a live PID file for test-t1193c immediately after marking it running so both Phase 1 and Phase 4b also consider it alive. After calling "sup transition test-t1193c running", write a PID file containing a live PID (use $$ or start a short-lived sleep background process and write its PID) to the supervisor's expected PID-file location for task "test-t1193c" so that the code paths in Phase 1 (the is_alive check and cmd_transition) and Phase 4b will skip acting on this task; then run the SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse and assert get_status still returns "running".

Auto-dismissed: bot review does not block autonomous pipeline

coderabbitai bot previously requested changes Feb 18, 2026

View reviewed changes

marcusquinn mentioned this pull request Feb 18, 2026

[Supervisor:marcusquinn] 0 queued, 0 working, 1 in review at 14:46 UTC #1502

Closed

marcusquinn merged commit f5b4d68 into main Feb 18, 2026
19 checks passed

marcusquinn deleted the feature/t1193 branch February 18, 2026 22:33

marcusquinn added a commit that referenced this pull request Feb 18, 2026

chore: mark t1193 complete in TODO.md (#1813)

16fb64c

marcusquinn mentioned this pull request Feb 18, 2026

t1193: Reconcile supervisor DB running count with actual worker state #1804

Closed

coderabbitai bot mentioned this pull request Feb 19, 2026

t1258: fix stale evaluating recovery false positives in Phase 0.7 #1966

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t1193: Reconcile supervisor DB running count with actual worker state#1813

t1193: Reconcile supervisor DB running count with actual worker state#1813
marcusquinn merged 1 commit intomainfrom
feature/t1193

marcusquinn commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 18, 2026

Uh oh!

sonarqubecloud bot commented Feb 18, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

coderabbitai bot Feb 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcusquinn commented Feb 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Changes

.agents/scripts/supervisor/pulse.sh

tests/test-supervisor-state-machine.sh

Verification

Design Decisions

Summary by CodeRabbit

Release Notes

Uh oh!

gemini-code-assist bot commented Feb 18, 2026

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

github-actions bot commented Feb 18, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 18, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 18, 2026 •

edited by coderabbitai bot

Loading

`.agents/scripts/supervisor/pulse.sh`

`tests/test-supervisor-state-machine.sh`

coderabbitai bot commented Feb 18, 2026 •

edited

Loading