Skip to content

t1193: Reconcile supervisor DB running count with actual worker state#1813

Merged
marcusquinn merged 1 commit intomainfrom
feature/t1193
Feb 18, 2026
Merged

t1193: Reconcile supervisor DB running count with actual worker state#1813
marcusquinn merged 1 commit intomainfrom
feature/t1193

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 18, 2026

Summary

  • Adds Phase 0.8 to the supervisor pulse cycle to detect and recover running tasks whose worker processes have terminated without recording their exit
  • Uses started_at (immutable after dispatch) instead of updated_at (refreshed by other DB operations) to reliably identify stale running tasks
  • Transitions orphaned tasks to failed with reason stale_running_no_process, or to pr_review if a PR already exists

Root Cause

Phase 0.7 uses updated_at for staleness detection. Other DB operations (reconciliation sweeps, state_log writes, Phase 7b bidirectional sync) can refresh updated_at while the worker is dead, causing Phase 0.7 to miss tasks that have been stuck in running for hours. The result: the supervisor DB reports tasks in running status with no active workers.

Changes

.agents/scripts/supervisor/pulse.sh

Added Phase 0.8 between Phase 0.7 and Phase 1:

Phase 0.8: Stale 'running' task recovery using started_at (t1193)
  • SQL: WHERE status='running' AND started_at < now - SUPERVISOR_RUNNING_STALE_SECONDS
  • PID liveness check: kill -0 $pid — skips tasks with live workers
  • Recovery paths:
    • Has PR → pr_review (work done, only worker died)
    • No PR → failed + attempt_self_heal + cleanup_worker_processes
  • Logs to state_log with reason stale_running_no_process
  • Records to pattern-tracker-helper.sh for observability
  • Configurable: SUPERVISOR_RUNNING_STALE_SECONDS (default 3600 = 1h)

tests/test-supervisor-state-machine.sh

Added 3 tests for Phase 0.8:

  1. Stale running task (no PID, old started_at) → recovered to failed/queued
  2. Stale running task with PR → routed to pr_review
  3. Recently started running task → NOT falsely recovered

Verification

  • ShellCheck: zero violations on both modified files
  • Unit tests: 3/3 pass (SQL query logic, state_log recording)
  • Phase 0.8 recovery confirmed in pulse output:
    Phase 0.8: Orphaned running task test-t1193a (started: ..., no live worker after 0s)
    Phase 0.8: Recovered 1 stale running task(s) (0 still alive)
    

Design Decisions

  • started_at not updated_at: started_at is set once at dispatch and never refreshed — immune to the DB churn that defeats Phase 0.7
  • 1h default: Tasks lack a duration estimate field, so 2x-expected-duration is not computable. 1h is conservative enough to avoid false positives while catching the reported scenario
  • Only running status: dispatched and evaluating are already covered by Phase 0.7 and Phase 1c respectively — no overlap

Ref #1804

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Tasks that remain in running state without an active worker are now automatically detected and recovered.
    • Stale orphaned tasks are transitioned to appropriate states (failed, queued, or pr_review) based on configuration.
  • Tests

    • Added test coverage for stale task detection and recovery scenarios.

Phase 0.7 uses updated_at for staleness detection, which can be refreshed
by other DB operations (reconciliation, state_log writes) while a worker
is dead. This causes the supervisor DB to report tasks in 'running' status
with no active workers — stale state from workers that terminated without
recording their exit.

Phase 0.8 fixes this by using started_at (set once at dispatch, never
refreshed) to detect running tasks that have exceeded a wall-clock timeout
with no live worker process:

- Queries tasks WHERE status='running' AND started_at < now - timeout
- Cross-references against PID files + kill -0 liveness check
- Tasks with a PR → pr_review (work may be done, only worker died)
- Tasks without a PR → failed with reason 'stale_running_no_process'
- Logs recovery to state_log and pattern-tracker for observability
- Configurable via SUPERVISOR_RUNNING_STALE_SECONDS (default 3600 = 1h)
- ShellCheck: zero violations

Chose started_at over updated_at because started_at is immutable after
dispatch — matches existing patterns in Phase 4 hung-worker detection.
Chose 1h default (not 2x expected duration) because tasks lack a duration
estimate field; 1h is conservative enough to avoid false positives while
catching the reported scenario (3 tasks stuck for hours).

Ref #1804
@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 18, 2026

Walkthrough

Introduces Phase 0.8 to the supervisor pulse cycle for detecting and recovering stale running tasks. The phase checks task started_at timestamps against a timeout threshold, verifies worker process liveness via PID files, and routes tasks to pr_review or failed states with corresponding cleanup and state logging.

Changes

Cohort / File(s) Summary
Stale-Running Recovery Phase
.agents/scripts/supervisor/pulse.sh
Implements Phase 0.8 logic to detect orphaned tasks in running state, check worker liveness via PID files, transition stale tasks to pr_review (with PR URL) or failed (without PR URL), perform cleanup, and record state transitions and patterns.
Phase 0.8 Test Suite
tests/test-supervisor-state-machine.sh
Adds three validation tests: stale task without PID recovery to failed/queued, stale task with PR URL routing to pr_review, and recent task remaining running. Uses environment variable override (SUPERVISOR_RUNNING_STALE_SECONDS=0) to force immediate detection.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🔍 Stale runners fade in the orphaned night,
Phase 0.8 brings them back to light,
Check the PID, check the time,
Route to review or mark as crime,
Self-heal flows through the DevOps line ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main objective: reconciling supervisor database running count with actual worker state through Phase 0.8 stale-running recovery logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1193

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 27 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Wed Feb 18 22:05:40 UTC 2026: Code review monitoring started
Wed Feb 18 22:05:41 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 27

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 27
  • VULNERABILITIES: 0

Generated on: Wed Feb 18 22:05:43 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 18, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
.agents/scripts/supervisor/pulse.sh (2)

401-405: sr_started and running_stale_seconds are interpolated into the state_log reason string without escaping.

While sr_started is a DB-sourced ISO timestamp and running_stale_seconds is numeric (both unlikely to contain single quotes), the sql_escape utility is already in use for sr_id on the same line. For consistency with Phase 0.7's pattern and to maintain that zero-technical-debt grade, consider escaping them too — or at minimum sr_started since it's external data flowing back into a new INSERT.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 401 - 405, Escape
interpolated variables sr_started (and optionally running_stale_seconds) before
embedding them into the SQL INSERT reason string to avoid unescaped external
data; use the existing sql_escape helper the same way sr_id is escaped so the
VALUES line becomes VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review',
'Phase 0.8 stale_running_no_process (t1193): started_at=$(sql_escape
"$sr_started"), no live worker after $(sql_escape "$running_stale_seconds")s, PR
exists') — at minimum wrap sr_started with sql_escape when constructing the
state_log INSERT in pulse.sh.

352-360: Consider validating running_stale_seconds is numeric before SQL interpolation.

The variable is interpolated directly into the SQL query (line 358). While this mirrors Phase 0.7's stale_grace_seconds pattern (line 243), a non-numeric SUPERVISOR_RUNNING_STALE_SECONDS env var value would produce a broken SQL statement. Both phases share this concern.

Optional defensive check
 	local running_stale_seconds="${SUPERVISOR_RUNNING_STALE_SECONDS:-3600}"
+	# Ensure numeric to prevent SQL injection via env var
+	[[ "$running_stale_seconds" =~ ^[0-9]+$ ]] || running_stale_seconds=3600
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 352 - 360, The script
directly interpolates SUPERVISOR_RUNNING_STALE_SECONDS into
running_stale_seconds and then into the SQL in the db call (see variable
running_stale_seconds and the stale_running_tasks db query), so add a defensive
numeric validation: ensure running_stale_seconds contains only digits
(optionally allow leading +) and fall back to the default 3600 if not numeric or
empty; then use that validated value in the db SQL interpolation to avoid
malformed queries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 396-422: The PR-exists branch in Phase 0.8 skips calling
cleanup_worker_processes, risking zombie child processes; update the logic so
cleanup_worker_processes "$sr_id" is invoked regardless of whether sr_pr_url is
set (i.e., either add cleanup_worker_processes "$sr_id" before the fi or move
the call out of the else branch), keeping existing logging redirection
(2>>"$SUPERVISOR_LOG" || true) and preserving surrounding transitions
(cmd_transition "$sr_id" "pr_review" ... and the DB inserts) so process trees
are always reaped after handling the state transition.

In `@tests/test-supervisor-state-machine.sh`:
- Around line 1799-1810: The test is failing because Phase 1 / Phase 4b detect a
missing PID and transition the task after Phase 0.8 skips it; fix by creating a
live PID file for test-t1193c immediately after marking it running so both Phase
1 and Phase 4b also consider it alive. After calling "sup transition test-t1193c
running", write a PID file containing a live PID (use $$ or start a short-lived
sleep background process and write its PID) to the supervisor's expected
PID-file location for task "test-t1193c" so that the code paths in Phase 1 (the
is_alive check and cmd_transition) and Phase 4b will skip acting on this task;
then run the SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse and assert
get_status still returns "running".

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 401-405: Escape interpolated variables sr_started (and optionally
running_stale_seconds) before embedding them into the SQL INSERT reason string
to avoid unescaped external data; use the existing sql_escape helper the same
way sr_id is escaped so the VALUES line becomes VALUES ('$(sql_escape
"$sr_id")', 'running', 'pr_review', 'Phase 0.8 stale_running_no_process (t1193):
started_at=$(sql_escape "$sr_started"), no live worker after $(sql_escape
"$running_stale_seconds")s, PR exists') — at minimum wrap sr_started with
sql_escape when constructing the state_log INSERT in pulse.sh.
- Around line 352-360: The script directly interpolates
SUPERVISOR_RUNNING_STALE_SECONDS into running_stale_seconds and then into the
SQL in the db call (see variable running_stale_seconds and the
stale_running_tasks db query), so add a defensive numeric validation: ensure
running_stale_seconds contains only digits (optionally allow leading +) and fall
back to the default 3600 if not numeric or empty; then use that validated value
in the db SQL interpolation to avoid malformed queries.

Comment on lines +396 to +422
if [[ -n "$sr_pr_url" ]]; then
log_info " Phase 0.8: $sr_id → pr_review (has PR, worker died after running)"
cmd_transition "$sr_id" "pr_review" --pr-url "$sr_pr_url" \
--error "Stale running recovery (Phase 0.8/t1193): worker died after ${running_stale_seconds}s, PR exists" \
2>>"$SUPERVISOR_LOG" || true
db "$SUPERVISOR_DB" "
INSERT INTO state_log (task_id, from_state, to_state, reason)
VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review',
'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s, PR exists');
" 2>/dev/null || true
else
# No PR — mark failed with stale_running_no_process reason
cmd_transition "$sr_id" "failed" \
--error "stale_running_no_process: worker terminated without recording exit after ${running_stale_seconds}s (Phase 0.8/t1193)" \
2>>"$SUPERVISOR_LOG" || true
db "$SUPERVISOR_DB" "
INSERT INTO state_log (task_id, from_state, to_state, reason)
VALUES ('$(sql_escape "$sr_id")', 'running', 'failed',
'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s');
" 2>/dev/null || true
# Clean up worker process tree (zombie children)
cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true
# Attempt self-heal for retry eligibility
attempt_self_heal "$sr_id" "failed" \
"stale_running_no_process: worker terminated without recording exit" \
"${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

cleanup_worker_processes is missing for the PR-exists path — potential zombie child leak.

Phase 0.7 (line 322) calls cleanup_worker_processes unconditionally after the transition branch. Phase 0.8 only calls it in the no-PR else branch (line 417), skipping it when routing to pr_review. If the dead worker spawned child processes (shellcheck, node, bash-language-server), those zombies won't be reaped for tasks that have a PR.

Proposed fix: call cleanup unconditionally after the branch
 			if [[ -n "$sr_pr_url" ]]; then
 				log_info "  Phase 0.8: $sr_id → pr_review (has PR, worker died after running)"
 				cmd_transition "$sr_id" "pr_review" --pr-url "$sr_pr_url" \
 					--error "Stale running recovery (Phase 0.8/t1193): worker died after ${running_stale_seconds}s, PR exists" \
 					2>>"$SUPERVISOR_LOG" || true
 				db "$SUPERVISOR_DB" "
 					INSERT INTO state_log (task_id, from_state, to_state, reason)
 					VALUES ('$(sql_escape "$sr_id")', 'running', 'pr_review',
 						'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s, PR exists');
 				" 2>/dev/null || true
+				# Clean up worker process tree (zombie children)
+				cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true
 			else
 				# No PR — mark failed with stale_running_no_process reason
 				cmd_transition "$sr_id" "failed" \
 					--error "stale_running_no_process: worker terminated without recording exit after ${running_stale_seconds}s (Phase 0.8/t1193)" \
 					2>>"$SUPERVISOR_LOG" || true
 				db "$SUPERVISOR_DB" "
 					INSERT INTO state_log (task_id, from_state, to_state, reason)
 					VALUES ('$(sql_escape "$sr_id")', 'running', 'failed',
 						'Phase 0.8 stale_running_no_process (t1193): started_at=${sr_started}, no live worker after ${running_stale_seconds}s');
 				" 2>/dev/null || true
 				# Clean up worker process tree (zombie children)
 				cleanup_worker_processes "$sr_id" 2>>"$SUPERVISOR_LOG" || true
 				# Attempt self-heal for retry eligibility
 				attempt_self_heal "$sr_id" "failed" \
 					"stale_running_no_process: worker terminated without recording exit" \
 					"${batch_id:-}" 2>>"$SUPERVISOR_LOG" || true
 			fi

Alternatively, move the cleanup call after the fi (like Phase 0.7 line 322) to avoid duplication.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 396 - 422, The PR-exists
branch in Phase 0.8 skips calling cleanup_worker_processes, risking zombie child
processes; update the logic so cleanup_worker_processes "$sr_id" is invoked
regardless of whether sr_pr_url is set (i.e., either add
cleanup_worker_processes "$sr_id" before the fi or move the call out of the else
branch), keeping existing logging redirection (2>>"$SUPERVISOR_LOG" || true) and
preserving surrounding transitions (cmd_transition "$sr_id" "pr_review" ... and
the DB inserts) so process trees are always reaped after handling the state
transition.

Comment on lines +1799 to +1810
# Test 3: recently started running task (within timeout) is NOT recovered
sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null
sup transition test-t1193c dispatched >/dev/null
sup transition test-t1193c running >/dev/null
# started_at is recent (default) — should NOT be touched by Phase 0.8
SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true
t1193c_status=$(get_status test-t1193c)
if [[ "$t1193c_status" == "running" ]]; then
pass "Phase 0.8: recently started running task not falsely recovered (t1193)"
else
fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's find the supervisor pulse script and examine Phase 1 logic
find . -name "pulse.sh" -type f

Repository: marcusquinn/aidevops

Length of output: 101


🏁 Script executed:

# Get the directory structure to understand the repo layout
ls -la | head -20
git ls-files | grep -E "(supervisor|pulse)" | head -20

Repository: marcusquinn/aidevops

Length of output: 2077


🏁 Script executed:

# Let's examine the pulse.sh file to verify Phase 1 and Phase 4b logic
wc -l .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 104


🏁 Script executed:

# Look for Phase 1 logic and queries around line 442
sed -n '430,500p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 2604


🏁 Script executed:

# Look for Phase 4b logic around line 1607
sed -n '1595,1650p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 2771


🏁 Script executed:

# Now let's find Phase 0.8 to understand when it runs
sed -n '300,450p' .agents/scripts/supervisor/pulse.sh | grep -A 20 "Phase 0.8"

Repository: marcusquinn/aidevops

Length of output: 4015


🏁 Script executed:

# Let's also examine the test code around lines 1799-1810
sed -n '1790,1825p' tests/test-supervisor-state-machine.sh

Repository: marcusquinn/aidevops

Length of output: 1815


🏁 Script executed:

# Find what get_status does and what the test setup looks like
grep -n "get_status\|test-t1193c" tests/test-supervisor-state-machine.sh | head -20

Repository: marcusquinn/aidevops

Length of output: 1236


🏁 Script executed:

# Let's verify get_status implementation to confirm it returns the database status
sed -n '70,90p' tests/test-supervisor-state-machine.sh

Repository: marcusquinn/aidevops

Length of output: 584


🏁 Script executed:

# Also verify cmd_transition is called in Phase 1 to confirm state change
grep -n "cmd_transition.*evaluating" .agents/scripts/supervisor/pulse.sh | head -5

Repository: marcusquinn/aidevops

Length of output: 638


Test 3 is fragile — Phase 1 and Phase 4b will interfere with the assertion.

This test expects test-t1193c to remain in running after a full pulse, proving Phase 0.8 skipped it. However:

  • Phase 0.8 (lines 350–430) correctly selects only tasks where status = 'running' AND started_at < now - 3600s, so it will skip your recently created task. ✓
  • Phase 1 (line 442) selects ALL tasks where status IN ('running', 'dispatched', 'evaluating') with no grace period. Since test-t1193c has no PID file, Phase 1 detects is_alive=false at line 481 and immediately transitions it to evaluating via cmd_transition at line 484.
  • Phase 4b (lines 1607–1625) also catches running tasks with no PID file and evaluates them.

Your assertion t1193c_status == "running" will fail because the database status changes to evaluating (or failed if evaluation fails) after Phase 1/4b processing.

To isolate Phase 0.8's skip behavior, the test needs a live PID file pointing to a real process (e.g., $$ or a sleep background job) so Phases 1 and 4b also skip it:

Proposed fix: create a PID file with a live process
 # Test 3: recently started running task (within timeout) is NOT recovered
 sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null
 sup transition test-t1193c dispatched >/dev/null
 sup transition test-t1193c running >/dev/null
 # started_at is recent (default) — should NOT be touched by Phase 0.8
+# Create a PID file with a live process so Phase 1/4b also skip this task
+sleep 300 &
+local_bg_pid=$!
+mkdir -p "$TEST_DIR/pids"
+echo "$local_bg_pid" > "$TEST_DIR/pids/test-t1193c.pid"
 SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true
+kill "$local_bg_pid" 2>/dev/null || true
+rm -f "$TEST_DIR/pids/test-t1193c.pid"
 t1193c_status=$(get_status test-t1193c)
 if [[ "$t1193c_status" == "running" ]]; then
 	pass "Phase 0.8: recently started running task not falsely recovered (t1193)"
 else
 	fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Test 3: recently started running task (within timeout) is NOT recovered
sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null
sup transition test-t1193c dispatched >/dev/null
sup transition test-t1193c running >/dev/null
# started_at is recent (default) — should NOT be touched by Phase 0.8
SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true
t1193c_status=$(get_status test-t1193c)
if [[ "$t1193c_status" == "running" ]]; then
pass "Phase 0.8: recently started running task not falsely recovered (t1193)"
else
fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"
fi
# Test 3: recently started running task (within timeout) is NOT recovered
sup add test-t1193c --repo /tmp/test --description "Recent running task" >/dev/null
sup transition test-t1193c dispatched >/dev/null
sup transition test-t1193c running >/dev/null
# started_at is recent (default) — should NOT be touched by Phase 0.8
# Create a PID file with a live process so Phase 1/4b also skip this task
sleep 300 &
local_bg_pid=$!
mkdir -p "$TEST_DIR/pids"
echo "$local_bg_pid" > "$TEST_DIR/pids/test-t1193c.pid"
SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse 2>/dev/null || true
kill "$local_bg_pid" 2>/dev/null || true
rm -f "$TEST_DIR/pids/test-t1193c.pid"
t1193c_status=$(get_status test-t1193c)
if [[ "$t1193c_status" == "running" ]]; then
pass "Phase 0.8: recently started running task not falsely recovered (t1193)"
else
fail "Phase 0.8: recently started running task was incorrectly recovered" "status=$t1193c_status (expected running)"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test-supervisor-state-machine.sh` around lines 1799 - 1810, The test is
failing because Phase 1 / Phase 4b detect a missing PID and transition the task
after Phase 0.8 skips it; fix by creating a live PID file for test-t1193c
immediately after marking it running so both Phase 1 and Phase 4b also consider
it alive. After calling "sup transition test-t1193c running", write a PID file
containing a live PID (use $$ or start a short-lived sleep background process
and write its PID) to the supervisor's expected PID-file location for task
"test-t1193c" so that the code paths in Phase 1 (the is_alive check and
cmd_transition) and Phase 4b will skip acting on this task; then run the
SUPERVISOR_RUNNING_STALE_SECONDS=3600 sup pulse and assert get_status still
returns "running".

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 18, 2026 22:22

Auto-dismissed: bot review does not block autonomous pipeline

@marcusquinn marcusquinn merged commit f5b4d68 into main Feb 18, 2026
19 checks passed
@marcusquinn marcusquinn deleted the feature/t1193 branch February 18, 2026 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant