Skip to content

t1301: fix rc=143 SIGTERM kills during AI reasoning phase#2120

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1301
Feb 22, 2026
Merged

t1301: fix rc=143 SIGTERM kills during AI reasoning phase#2120
marcusquinn merged 2 commits intomainfrom
feature/t1301

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 22, 2026

Summary

  • Root cause identified: Concurrent supervisor pulses were breaking stale locks (600s timeout < actual pulse duration including Phase 14 AI reasoning) and running Phase 4e orphan cleanup which killed opencode/claude processes mid-reasoning, producing rc=143 (SIGTERM)
  • Three-part fix: (1) increase PULSE_LOCK_TIMEOUT 600s→1800s to prevent concurrent pulses, (2) write ai-supervisor.pid during Phase 14 so Phase 4e protects AI-spawned processes, (3) increase SUPERVISOR_AI_TIMEOUT 300s→600s for large context prompts
  • Diagnostic improvement: explicit rc=143 log message in Phase 14 for future diagnosis

Changes

File Change
supervisor-helper.sh Default PULSE_LOCK_TIMEOUT 600→1800s
supervisor/launchd.sh Set SUPERVISOR_PULSE_LOCK_TIMEOUT=1800 and SUPERVISOR_AI_TIMEOUT=600 in launchd plist env
supervisor/pulse.sh Write/remove ai-supervisor.pid around Phase 14 pipeline; protect it in Phase 4e orphan sweep; add rc=143 log branch
supervisor/utility.sh Add comment explaining the 1800s timeout rationale
TODO.md Resolve merge conflict from concurrent supervisor edits

Root Cause Analysis

The rc=143 kills at 01:10 and 01:17 on 2026-02-22 were caused by:

  1. Pulse A starts, acquires lock, begins Phase 14 AI reasoning (108KB+ context)
  2. AI reasoning takes >600s (the old PULSE_LOCK_TIMEOUT)
  3. Pulse B starts (launchd fires every 2 min), sees lock age >600s, breaks it as "stale"
  4. Pulse B runs Phase 4e orphan cleanup, finds opencode/claude processes with PPID=1 (reparented after pulse A's subshell exited)
  5. Phase 4e kills those processes → rc=143 in Pulse A's pipeline

The fix ensures: (a) the lock timeout exceeds the maximum pulse duration, and (b) even if a concurrent pulse breaks the lock, the ai-supervisor.pid file causes Phase 4e to protect the AI session's processes.

Testing

  • Shellcheck clean on all modified files
  • Logic verified: ai-supervisor.pid protection uses kill -0 liveness check, so stale PID files from crashed pulses don't cause false protection

Closes #2119

Summary by CodeRabbit

Release Notes

  • Chores
    • Optimized supervisor timeout configurations to enhance system stability and reliability
    • Enhanced concurrent process protection to prevent race conditions
    • Updated environment variables for improved timeout management across supervisor processes
    • Refined task orchestration and process lifecycle handling

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where AI reasoning processes were being prematurely terminated by the supervisor due to conflicts arising from concurrent pulse executions. By adjusting lock and AI timeouts and introducing a PID file-based protection mechanism, the changes ensure that AI operations can complete reliably without interruption, significantly improving the stability of AI-driven tasks.

Highlights

  • Root Cause Identified: Concurrent supervisor pulses were breaking stale locks and running orphan cleanup, which led to the premature termination (rc=143 SIGTERM) of opencode/claude processes during AI reasoning phases.
  • Three-Part Fix Implemented: The fix involves increasing PULSE_LOCK_TIMEOUT from 600s to 1800s, writing ai-supervisor.pid during Phase 14 to protect AI-spawned processes from orphan cleanup, and increasing SUPERVISOR_AI_TIMEOUT from 300s to 600s for large context prompts.
  • Diagnostic Improvement: An explicit log message for rc=143 (SIGTERM) has been added in Phase 14 for better future diagnosis of similar issues.
Changelog
  • .agents/scripts/supervisor-helper.sh
    • Increased the default PULSE_LOCK_TIMEOUT from 600 seconds to 1800 seconds.
  • .agents/scripts/supervisor/launchd.sh
    • Set SUPERVISOR_PULSE_LOCK_TIMEOUT to 1800 and SUPERVISOR_AI_TIMEOUT to 600 within the launchd plist environment variables.
    • Added comments explaining the rationale behind the increased timeout values.
  • .agents/scripts/supervisor/pulse.sh
    • Implemented logic to protect ai-supervisor.pid and its descendant processes during the Phase 4e orphan sweep.
    • Added code to write ai-supervisor.pid at the start of the Phase 14 AI pipeline and remove it upon completion.
    • Introduced a specific log branch for rc=143 (SIGTERM) in Phase 14 for improved diagnostics.
  • .agents/scripts/supervisor/utility.sh
    • Added a comment explaining the rationale for the 1800s PULSE_LOCK_TIMEOUT.
  • TODO.md
    • Resolved merge conflict markers related to concurrent supervisor edits.
Activity
  • The root cause of the rc=143 SIGTERM kills during AI reasoning was identified as concurrent supervisor pulses breaking stale locks and killing AI processes.
  • A three-part fix was designed and implemented to address the identified root cause.
  • Testing was performed to verify the logic, including shellcheck on modified files and liveness checks for ai-supervisor.pid protection.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 22, 2026

Walkthrough

This PR addresses rc=143 supervisor pipeline kills during concurrent AI reasoning by extending timeout values from 600s to 1800s and implementing AI session PID protection in Phase 14 and Phase 4e cleanup logic. Changes prevent Phase 4e cleanup from prematurely terminating AI-spawned processes during overlapping pulse cycles.

Changes

Cohort / File(s) Summary
Timeout Configuration
.agents/scripts/supervisor-helper.sh, .agents/scripts/supervisor/launchd.sh, .agents/scripts/supervisor/utility.sh
Extended PULSE_LOCK_TIMEOUT default from 600s to 1800s (30 min); injected SUPERVISOR_PULSE_LOCK_TIMEOUT and SUPERVISOR_AI_TIMEOUT environment variables; added explanatory comments clarifying timeout semantics for concurrent pulse prevention.
AI Session Process Protection
.agents/scripts/supervisor/pulse.sh
Introduced ai-supervisor.pid file creation in Phase 14 before AI pipeline startup and cleanup after completion; expanded Phase 4e PID-protection logic to guard AI supervisor and descendants from concurrent pulse termination; added sweep-basename alignment and SIGTERM (rc=143) handling for AI-related failures.
Task Tracking
TODO.md
Resolved merge conflicts and reorganized task entries (t1301, t1302, t1284–t1287); consolidated PR references and stash markers; net reduction in task tracking lines reflecting conflict resolution and cleanup.

Sequence Diagram

sequenceDiagram
    actor Pulse as Pulse Cycle
    participant Phase14 as Phase 14: AI Supervisor
    participant PID as PID Registry
    participant AI as AI Pipeline
    participant Phase4e as Phase 4e: Cleanup
    
    Pulse->>Phase14: Start AI supervision
    Phase14->>PID: Write ai-supervisor.pid
    Phase14->>AI: Launch AI pipeline
    AI->>AI: Execute reasoning/prompt
    
    Note over Pulse,Phase4e: Concurrent pulse arrives
    
    Phase4e->>PID: Check protection (ai-supervisor.pid exists)
    Phase4e->>Phase4e: Skip AI supervisor & descendants
    Phase4e->>PID: Remove other stale PIDs
    
    AI->>AI: Complete work
    Phase14->>PID: Remove ai-supervisor.pid
    Phase14->>Pulse: AI session complete
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

  • #1641: Modifies Phase 14 AI supervision with concurrent session protection mechanisms, directly aligned with this PR's PID-based approach.
  • #1869: Adjusts supervisor pulse cleanup logic and environment timeout variables, sharing timeout configuration patterns.
  • #429: Touches Phase 4 cleanup and PID handling logic related to process supervision boundaries.

Suggested Labels

needs-review

Poem

🔒 Pulse locks extend to thirty minutes long,
AI sessions wrapped in PID shields strong,
Phase 4e yields to concurrent grace—
No more rc=143 in this space! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the fix for SIGTERM rc=143 kills during AI reasoning phase, which matches the primary changeset focus.
Linked Issues check ✅ Passed All code changes directly address #2119 objectives: increased PULSE_LOCK_TIMEOUT prevents concurrent pulses, ai-supervisor.pid protection guards AI processes, SUPERVISOR_AI_TIMEOUT increase accommodates large prompts, and rc=143 logging aids diagnostics.
Out of Scope Changes check ✅ Passed All functional changes target the t1301 fix scope. TODO.md churn from merge conflict resolution is overhead, not a scope deviation.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1301

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the issue of SIGTERM kills (rc=143) during the AI reasoning phase by increasing timeouts and implementing a protection mechanism for AI-spawned processes. The use of a temporary PID file (ai-supervisor.pid) that the orphan cleanup phase respects is a robust solution for handling concurrent supervisor pulses. I have identified a few style guide violations regarding local variable declarations and blanket error suppression with 2>/dev/null that should be addressed to maintain consistency and improve debuggability, aligning with repository-specific rules.

if [[ "$sweep_basename" == "ai-supervisor" ]] && kill -0 "$sweep_pid" 2>/dev/null; then
protected_pids="${protected_pids} ${sweep_pid}"
local ai_sweep_descendants
ai_sweep_descendants=$(_list_descendants "$sweep_pid" 2>/dev/null || true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid blanket suppression of stderr with 2>/dev/null. If _list_descendants fails, the error message should be visible in the logs for diagnosis, especially since this logic is critical for protecting AI processes.

Suggested change
ai_sweep_descendants=$(_list_descendants "$sweep_pid" 2>/dev/null || true)
ai_sweep_descendants=$(_list_descendants "$sweep_pid" || true)
References
  1. 2>/dev/null is acceptable ONLY when redirecting to log files, not blanket suppression. (link)
  2. Avoid blanket suppression of errors with '2>/dev/null' to ensure that authentication, syntax, or system issues remain visible for debugging.

fi
local sweep_task_status
sweep_task_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$(basename "$pid_file" .pid)")';" 2>/dev/null || echo "")
sweep_task_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$sweep_basename")';" 2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid blanket suppression of stderr with 2>/dev/null. Database errors should be visible in the logs to help diagnose issues with task status lookups during the sweep.

Suggested change
sweep_task_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$sweep_basename")';" 2>/dev/null || echo "")
sweep_task_status=$(db "$SUPERVISOR_DB" "SELECT status FROM tasks WHERE id = '$(sql_escape "$sweep_basename")';" || echo "")
References
  1. 2>/dev/null is acceptable ONLY when redirecting to log files, not blanket suppression. (link)
  2. Avoid blanket suppression of errors with '2>/dev/null' to ensure that authentication, syntax, or system issues remain visible for debugging.

# Write a PID file for the AI session so Phase 4e does not kill
# opencode/claude processes spawned during reasoning or action execution
# (t1301: concurrent pulses can trigger Phase 4e while AI is running).
local ai_pid_file="${SUPERVISOR_DIR}/pids/ai-supervisor.pid"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Style guide violation: declare and assign separately for local variables to ensure exit code safety.

Suggested change
local ai_pid_file="${SUPERVISOR_DIR}/pids/ai-supervisor.pid"
local ai_pid_file
ai_pid_file="${SUPERVISOR_DIR}/pids/ai-supervisor.pid"
References
  1. Use local var="$1" pattern in functions (declare and assign separately for exit code safety). (link)
  2. Separate 'local' declaration from assignment for local variables to ensure exit code safety and consistency.

# opencode/claude processes spawned during reasoning or action execution
# (t1301: concurrent pulses can trigger Phase 4e while AI is running).
local ai_pid_file="${SUPERVISOR_DIR}/pids/ai-supervisor.pid"
echo "$$" >"$ai_pid_file" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid blanket suppression of stderr with 2>/dev/null. This masks potential system errors (e.g., permission denied, directory missing) that are important for debugging, especially since this PID file is critical for the new protection logic.

Suggested change
echo "$$" >"$ai_pid_file" 2>/dev/null || true
echo "$$" >"$ai_pid_file" || true
References
  1. 2>/dev/null is acceptable ONLY when redirecting to log files, not blanket suppression. (link)
  2. Avoid blanket suppression of errors with '2>/dev/null' to ensure that authentication, syntax, or system issues remain visible for debugging.

ai_result=$(run_ai_actions_pipeline "$ai_repo_path" "full" 2>>"$ai_log_file") || ai_rc=$?

# Remove AI session PID file now that the pipeline has completed
rm -f "$ai_pid_file" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid blanket suppression of stderr with 2>/dev/null. rm -f is already silent if the file doesn't exist; suppressing all errors can mask permission issues or other filesystem problems.

Suggested change
rm -f "$ai_pid_file" 2>/dev/null || true
rm -f "$ai_pid_file" || true
References
  1. 2>/dev/null is acceptable ONLY when redirecting to log files, not blanket suppression. (link)
  2. Avoid blanket suppression of errors with '2>/dev/null' to ensure that authentication, syntax, or system issues remain visible for debugging.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 38 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sun Feb 22 02:18:48 UTC 2026: Code review monitoring started
Sun Feb 22 02:18:49 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 38

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 38
  • VULNERABILITIES: 0

Generated on: Sun Feb 22 02:18:51 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

Root cause: concurrent supervisor pulses were breaking stale locks (600s
timeout < actual pulse duration with Phase 14 AI reasoning) and running
Phase 4e orphan cleanup which killed opencode/claude processes mid-reasoning.

Three-part fix:
1. Increase PULSE_LOCK_TIMEOUT from 600s to 1800s (30 min) to prevent
   concurrent pulses when AI reasoning runs long (108KB+ context ~6+ min)
2. Write ai-supervisor.pid during Phase 14 AI pipeline execution so Phase 4e
   protects the pulse process and all its AI-spawned descendants from orphan
   cleanup by concurrent pulses
3. Increase SUPERVISOR_AI_TIMEOUT from 300s to 600s (10 min) via launchd
   plist env var to give AI CLI more time for large context prompts
4. Add explicit rc=143 log message in Phase 14 to aid future diagnosis

Also resolves TODO.md merge conflict from concurrent supervisor edits.
@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 38 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sun Feb 22 02:37:06 UTC 2026: Code review monitoring started
Sun Feb 22 02:37:06 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 38

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 38
  • VULNERABILITIES: 0

Generated on: Sun Feb 22 02:37:09 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

@marcusquinn marcusquinn merged commit 0d424e0 into main Feb 22, 2026
10 of 11 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
.agents/scripts/supervisor/pulse.sh (2)

2930-2946: Core protection logic is correct; descendant matching has a pre-existing newline-delimiter flaw now extended here.

The kill -0 guard, continue-skip of the DB-status path, and sweep_basename refactor are all clean. One subtle issue worth fixing: _list_descendants emits newline-separated PIDs, so when concatenated into protected_pids (space-delimited string) and checked with grep -q " ${opid} ", grep matches line-by-line. Any PID that lands in the middle of a newline-embedded string (preceded by \n rather than a space) won't match PID and won't be skipped by the orphan check. Only the last descendant in the output (which gets the trailing space from echo "...") reliably matches. The same flaw exists in the pre-existing sweep_descendants block at line 2950—this PR duplicates the pattern for ai_sweep_descendants.

🛡️ Proposed fix — normalize newlines to spaces before concatenation (apply to both occurrences)
-			if [[ -n "$ai_sweep_descendants" ]]; then
-				protected_pids="${protected_pids} ${ai_sweep_descendants}"
-			fi
+			while IFS= read -r _desc_pid; do
+				[[ -n "$_desc_pid" ]] && protected_pids="${protected_pids} ${_desc_pid}"
+			done <<< "$ai_sweep_descendants"

Apply the same fix to the pre-existing sweep_descendants block (lines 2950–2953):

-			if [[ -n "$sweep_descendants" ]]; then
-				protected_pids="${protected_pids} ${sweep_descendants}"
-			fi
+			while IFS= read -r _desc_pid; do
+				[[ -n "$_desc_pid" ]] && protected_pids="${protected_pids} ${_desc_pid}"
+			done <<< "$sweep_descendants"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 2930 - 2946, The descendant
PID lists from _list_descendants are newline-delimited and are being
concatenated into the space-delimited protected_pids string, causing later
checks that use grep -q " ${opid} " to miss PIDs; change the concatenation to
normalize newlines to spaces (e.g. convert ai_sweep_descendants and
sweep_descendants output from _list_descendants into a space-separated string
before appending to protected_pids) so every PID is separated by spaces and the
grep check will reliably match; update the blocks that set ai_sweep_descendants
and sweep_descendants accordingly while keeping the rest of the logic (kill -0,
continue, DB status) unchanged.

3674-3694: PID lifecycle is correctly bracketed around the pipeline; consider adding cleanup to the EXIT trap as defense-in-depth.

The write-before/remove-after placement is correct, and || true on both operations prevents write errors from aborting the pulse. If the pulse is killed (SIGTERM/SIGKILL) between lines 3678 and 3694, the file persists until the next successful Phase 14 run overwrites it. The stale file is self-healing (a dead $$ causes kill -0 to fail, so no protection triggers), but adding explicit cleanup to the EXIT trap at line 696 would be cleaner since the path is deterministic:

♻️ Optional: harden the EXIT trap to clean up the PID file on unexpected pulse termination
-	trap "release_pulse_lock; rm -f '${SUPERVISOR_DIR}/MODELS.md.tmp' 2>/dev/null || true" EXIT INT TERM
+	trap "release_pulse_lock; rm -f '${SUPERVISOR_DIR}/MODELS.md.tmp' '${SUPERVISOR_DIR}/pids/ai-supervisor.pid' 2>/dev/null || true" EXIT INT TERM
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 3674 - 3694, Add
deterministic cleanup for the AI session PID file to the existing EXIT trap:
detect the same ai_pid_file name used when creating the PID (variable
ai_pid_file / ai-supervisor.pid), and in the EXIT trap read the file and only
remove it if its contents match the current shell PID ($$) to avoid deleting
another run's PID; ensure removal uses rm -f and errors are ignored (|| true) so
the trap never aborts.
.agents/scripts/supervisor/launchd.sh (1)

150-160: Optional: parameterise the hardcoded timeouts and move the comment before the assignment.

Two minor housekeeping points:

  1. 1800 and 600 are hardcoded literals. A future timeout change requires editing both supervisor-helper.sh (line 206) and this function. Consider accepting them as optional positional args (e.g., $7 / $8 with defaults) so callers can override without touching plist generation logic.

  2. The explanatory comment block sits after the env_dict closing ", making it visually disconnected from the values it documents. Moving it immediately before the env_dict= assignment would read more naturally.

♻️ Suggested refactor (comment placement + optional parameterisation)
 _generate_supervisor_pulse_plist() {
 	local script_path="$1"
 	local interval_seconds="${2:-120}"
 	local log_path="$3"
 	local batch_arg="${4:-}"
 	local env_path="${5:-/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin}"
 	local gh_token="${6:-}"
+	# SUPERVISOR_PULSE_LOCK_TIMEOUT (default 1800s/30min): must exceed max pulse
+	# duration including Phase 14 AI reasoning to prevent concurrent pulses from
+	# breaking a live lock and running Phase 4e orphan cleanup (t1301).
+	# SUPERVISOR_AI_TIMEOUT (default 600s/10min): headroom for large-context AI prompts.
+	local pulse_lock_timeout="${7:-1800}"
+	local ai_timeout="${8:-600}"

 	...

-	# SUPERVISOR_PULSE_LOCK_TIMEOUT=1800s (30 min): prevents concurrent pulses when
-	# ...
 	env_dict="<key>PATH</key>
 		<string>${env_path}</string>
 		<key>SUPERVISOR_PULSE_LOCK_TIMEOUT</key>
-		<string>1800</string>
+		<string>${pulse_lock_timeout}</string>
 		<key>SUPERVISOR_AI_TIMEOUT</key>
-		<string>600</string>"
+		<string>${ai_timeout}</string>"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/launchd.sh around lines 150 - 160, Parameterise
the hardcoded timeouts and move the explanatory comment before the
plist/environment block: update the code that builds env_dict (referencing
env_dict) to accept optional positional args (e.g., $7 for
SUPERVISOR_PULSE_LOCK_TIMEOUT and $8 for SUPERVISOR_AI_TIMEOUT) with defaults of
1800 and 600 respectively, replace the literal "1800" and "600" strings with
those variables when emitting SUPERVISOR_PULSE_LOCK_TIMEOUT and
SUPERVISOR_AI_TIMEOUT, and relocate the explanatory comment block so it appears
immediately before the env_dict= assignment (and document the new parameters so
callers can override timeouts without editing plist generation).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/launchd.sh:
- Around line 151-154: The cron-based installer isn’t being given
SUPERVISOR_AI_TIMEOUT, so update the cron launch code (_cmd_cron_linux) to
export SUPERVISOR_AI_TIMEOUT=600 (or read from an existing config/env) before
executing the supervisor command so ai-reason.sh’s timeout fallback (the code
referenced in ai-reason.sh around line 246) sees the new value; alternatively
expose SUPERVISOR_AI_TIMEOUT as a configurable environment variable that
_cmd_cron_linux will preserve/forward when writing the cron job so Linux cron
installs get the same 600s AI timeout as the macOS launchd plist.

---

Nitpick comments:
In @.agents/scripts/supervisor/launchd.sh:
- Around line 150-160: Parameterise the hardcoded timeouts and move the
explanatory comment before the plist/environment block: update the code that
builds env_dict (referencing env_dict) to accept optional positional args (e.g.,
$7 for SUPERVISOR_PULSE_LOCK_TIMEOUT and $8 for SUPERVISOR_AI_TIMEOUT) with
defaults of 1800 and 600 respectively, replace the literal "1800" and "600"
strings with those variables when emitting SUPERVISOR_PULSE_LOCK_TIMEOUT and
SUPERVISOR_AI_TIMEOUT, and relocate the explanatory comment block so it appears
immediately before the env_dict= assignment (and document the new parameters so
callers can override timeouts without editing plist generation).

In @.agents/scripts/supervisor/pulse.sh:
- Around line 2930-2946: The descendant PID lists from _list_descendants are
newline-delimited and are being concatenated into the space-delimited
protected_pids string, causing later checks that use grep -q " ${opid} " to miss
PIDs; change the concatenation to normalize newlines to spaces (e.g. convert
ai_sweep_descendants and sweep_descendants output from _list_descendants into a
space-separated string before appending to protected_pids) so every PID is
separated by spaces and the grep check will reliably match; update the blocks
that set ai_sweep_descendants and sweep_descendants accordingly while keeping
the rest of the logic (kill -0, continue, DB status) unchanged.
- Around line 3674-3694: Add deterministic cleanup for the AI session PID file
to the existing EXIT trap: detect the same ai_pid_file name used when creating
the PID (variable ai_pid_file / ai-supervisor.pid), and in the EXIT trap read
the file and only remove it if its contents match the current shell PID ($$) to
avoid deleting another run's PID; ensure removal uses rm -f and errors are
ignored (|| true) so the trap never aborts.

Comment on lines +151 to +154
<key>SUPERVISOR_PULSE_LOCK_TIMEOUT</key>
<string>1800</string>
<key>SUPERVISOR_AI_TIMEOUT</key>
<string>600</string>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n 'SUPERVISOR_AI_TIMEOUT' --type sh

Repository: marcusquinn/aidevops

Length of output: 357


🏁 Script executed:

# Check cron.sh for any SUPERVISOR_AI_TIMEOUT handling or environment setup
fd -t f 'cron.sh' --type f

Repository: marcusquinn/aidevops

Length of output: 98


🏁 Script executed:

# Check pulse.sh for SUPERVISOR_AI_TIMEOUT references
fd -t f 'pulse.sh' --type f

Repository: marcusquinn/aidevops

Length of output: 137


🏁 Script executed:

# Check cron.sh to see if it sets up environment variables like SUPERVISOR_AI_TIMEOUT
cat -n .agents/scripts/supervisor/cron.sh | head -100

Repository: marcusquinn/aidevops

Length of output: 3323


🏁 Script executed:

# Check pulse.sh to understand the entry point and environment setup
cat -n .agents/scripts/supervisor/pulse.sh | head -150

Repository: marcusquinn/aidevops

Length of output: 5725


🏁 Script executed:

# Check ai-reason.sh around line 246 to see the timeout usage
sed -n '240,260p' .agents/scripts/supervisor/ai-reason.sh

Repository: marcusquinn/aidevops

Length of output: 722


🏁 Script executed:

# Check _cmd_cron_linux function to see if it handles SUPERVISOR_AI_TIMEOUT
sed -n '100,200p' .agents/scripts/supervisor/cron.sh

Repository: marcusquinn/aidevops

Length of output: 2706


🏁 Script executed:

# Check supervisor-helper.sh to see how it invokes pulse and whether it sets env vars
rg -n 'SUPERVISOR_AI_TIMEOUT|supervisor-pulse' .agents/scripts/supervisor/supervisor-helper.sh -A 3 -B 3

Repository: marcusquinn/aidevops

Length of output: 151


🏁 Script executed:

# Find supervisor-helper.sh
fd -t f 'supervisor-helper.sh'

Repository: marcusquinn/aidevops

Length of output: 100


🏁 Script executed:

# Check the Linux cron backend to understand how environment variables are passed
sed -n '150,200p' .agents/scripts/supervisor/cron.sh

Repository: marcusquinn/aidevops

Length of output: 1638


🏁 Script executed:

# Check supervisor-helper.sh to see how it invokes pulse and whether it sets SUPERVISOR_AI_TIMEOUT
cat -n .agents/scripts/supervisor-helper.sh | head -100

Repository: marcusquinn/aidevops

Length of output: 7703


🏁 Script executed:

# Search for how pulse is called and whether SUPERVISOR_AI_TIMEOUT is set in supervisor-helper.sh
rg -n 'pulse|SUPERVISOR_AI_TIMEOUT' .agents/scripts/supervisor-helper.sh -A 2 -B 2

Repository: marcusquinn/aidevops

Length of output: 9204


🏁 Script executed:

# Check if SUPERVISOR_AI_TIMEOUT is set anywhere else in the codebase (environment files, config, etc.)
rg -n 'SUPERVISOR_AI_TIMEOUT' --type sh --type bash

Repository: marcusquinn/aidevops

Length of output: 92


🏁 Script executed:

# Confirm the exact lines in launchd.sh where SUPERVISOR_AI_TIMEOUT is injected
sed -n '150,165p' .agents/scripts/supervisor/launchd.sh

Repository: marcusquinn/aidevops

Length of output: 789


🏁 Script executed:

# Final verification: check what default SUPERVISOR_AI_TIMEOUT is in ai-reason.sh
sed -n '240,250p' .agents/scripts/supervisor/ai-reason.sh

Repository: marcusquinn/aidevops

Length of output: 354


Cron-based installs do not receive the new SUPERVISOR_AI_TIMEOUT=600 timeout.

The launchd plist injection sets SUPERVISOR_AI_TIMEOUT=600 for macOS, but the Linux cron backend (_cmd_cron_linux) passes only PATH as an environment variable. Linux cron users will fall back to the hardcoded default of 300 seconds in ai-reason.sh:246, retaining the old (shorter) AI timeout and remaining vulnerable to the same large-context timeout kills that this PR addresses on macOS. Either inject SUPERVISOR_AI_TIMEOUT=600 in the cron command or expose it as a configurable environment variable for cron installs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/launchd.sh around lines 151 - 154, The cron-based
installer isn’t being given SUPERVISOR_AI_TIMEOUT, so update the cron launch
code (_cmd_cron_linux) to export SUPERVISOR_AI_TIMEOUT=600 (or read from an
existing config/env) before executing the supervisor command so ai-reason.sh’s
timeout fallback (the code referenced in ai-reason.sh around line 246) sees the
new value; alternatively expose SUPERVISOR_AI_TIMEOUT as a configurable
environment variable that _cmd_cron_linux will preserve/forward when writing the
cron job so Linux cron installs get the same 600s AI timeout as the macOS
launchd plist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

t1301: Investigate rc=143 supervisor pipeline kills during AI reasoning phase

1 participant