t1301: fix rc=143 SIGTERM kills during AI reasoning phase by marcusquinn · Pull Request #2120 · marcusquinn/aidevops

marcusquinn · 2026-02-22T01:53:02Z

Summary

Root cause identified: Concurrent supervisor pulses were breaking stale locks (600s timeout < actual pulse duration including Phase 14 AI reasoning) and running Phase 4e orphan cleanup which killed opencode/claude processes mid-reasoning, producing rc=143 (SIGTERM)
Three-part fix: (1) increase PULSE_LOCK_TIMEOUT 600s→1800s to prevent concurrent pulses, (2) write ai-supervisor.pid during Phase 14 so Phase 4e protects AI-spawned processes, (3) increase SUPERVISOR_AI_TIMEOUT 300s→600s for large context prompts
Diagnostic improvement: explicit rc=143 log message in Phase 14 for future diagnosis

Changes

File	Change
`supervisor-helper.sh`	Default `PULSE_LOCK_TIMEOUT` 600→1800s
`supervisor/launchd.sh`	Set `SUPERVISOR_PULSE_LOCK_TIMEOUT=1800` and `SUPERVISOR_AI_TIMEOUT=600` in launchd plist env
`supervisor/pulse.sh`	Write/remove `ai-supervisor.pid` around Phase 14 pipeline; protect it in Phase 4e orphan sweep; add rc=143 log branch
`supervisor/utility.sh`	Add comment explaining the 1800s timeout rationale
`TODO.md`	Resolve merge conflict from concurrent supervisor edits

Root Cause Analysis

The rc=143 kills at 01:10 and 01:17 on 2026-02-22 were caused by:

Pulse A starts, acquires lock, begins Phase 14 AI reasoning (108KB+ context)
AI reasoning takes >600s (the old PULSE_LOCK_TIMEOUT)
Pulse B starts (launchd fires every 2 min), sees lock age >600s, breaks it as "stale"
Pulse B runs Phase 4e orphan cleanup, finds opencode/claude processes with PPID=1 (reparented after pulse A's subshell exited)
Phase 4e kills those processes → rc=143 in Pulse A's pipeline

The fix ensures: (a) the lock timeout exceeds the maximum pulse duration, and (b) even if a concurrent pulse breaks the lock, the ai-supervisor.pid file causes Phase 4e to protect the AI session's processes.

Testing

Shellcheck clean on all modified files
Logic verified: ai-supervisor.pid protection uses kill -0 liveness check, so stale PID files from crashed pulses don't cause false protection

Closes #2119

Summary by CodeRabbit

Release Notes

Chores
- Optimized supervisor timeout configurations to enhance system stability and reliability
- Enhanced concurrent process protection to prevent race conditions
- Updated environment variables for improved timeout management across supervisor processes
- Refined task orchestration and process lifecycle handling

gemini-code-assist · 2026-02-22T01:53:18Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical issue where AI reasoning processes were being prematurely terminated by the supervisor due to conflicts arising from concurrent pulse executions. By adjusting lock and AI timeouts and introducing a PID file-based protection mechanism, the changes ensure that AI operations can complete reliably without interruption, significantly improving the stability of AI-driven tasks.

Highlights

Root Cause Identified: Concurrent supervisor pulses were breaking stale locks and running orphan cleanup, which led to the premature termination (rc=143 SIGTERM) of opencode/claude processes during AI reasoning phases.
Three-Part Fix Implemented: The fix involves increasing PULSE_LOCK_TIMEOUT from 600s to 1800s, writing ai-supervisor.pid during Phase 14 to protect AI-spawned processes from orphan cleanup, and increasing SUPERVISOR_AI_TIMEOUT from 300s to 600s for large context prompts.
Diagnostic Improvement: An explicit log message for rc=143 (SIGTERM) has been added in Phase 14 for better future diagnosis of similar issues.

Changelog

.agents/scripts/supervisor-helper.sh
- Increased the default PULSE_LOCK_TIMEOUT from 600 seconds to 1800 seconds.
.agents/scripts/supervisor/launchd.sh
- Set SUPERVISOR_PULSE_LOCK_TIMEOUT to 1800 and SUPERVISOR_AI_TIMEOUT to 600 within the launchd plist environment variables.
- Added comments explaining the rationale behind the increased timeout values.
.agents/scripts/supervisor/pulse.sh
- Implemented logic to protect ai-supervisor.pid and its descendant processes during the Phase 4e orphan sweep.
- Added code to write ai-supervisor.pid at the start of the Phase 14 AI pipeline and remove it upon completion.
- Introduced a specific log branch for rc=143 (SIGTERM) in Phase 14 for improved diagnostics.
.agents/scripts/supervisor/utility.sh
- Added a comment explaining the rationale for the 1800s PULSE_LOCK_TIMEOUT.
TODO.md
- Resolved merge conflict markers related to concurrent supervisor edits.

Activity

The root cause of the rc=143 SIGTERM kills during AI reasoning was identified as concurrent supervisor pulses breaking stale locks and killing AI processes.
A three-part fix was designed and implemented to address the identified root cause.
Testing was performed to verify the logic, including shellcheck on modified files and liveness checks for ai-supervisor.pid protection.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-22T01:53:21Z

Walkthrough

This PR addresses rc=143 supervisor pipeline kills during concurrent AI reasoning by extending timeout values from 600s to 1800s and implementing AI session PID protection in Phase 14 and Phase 4e cleanup logic. Changes prevent Phase 4e cleanup from prematurely terminating AI-spawned processes during overlapping pulse cycles.

Changes

Cohort / File(s)	Summary
Timeout Configuration `.agents/scripts/supervisor-helper.sh`, `.agents/scripts/supervisor/launchd.sh`, `.agents/scripts/supervisor/utility.sh`	Extended PULSE_LOCK_TIMEOUT default from 600s to 1800s (30 min); injected SUPERVISOR_PULSE_LOCK_TIMEOUT and SUPERVISOR_AI_TIMEOUT environment variables; added explanatory comments clarifying timeout semantics for concurrent pulse prevention.
AI Session Process Protection `.agents/scripts/supervisor/pulse.sh`	Introduced ai-supervisor.pid file creation in Phase 14 before AI pipeline startup and cleanup after completion; expanded Phase 4e PID-protection logic to guard AI supervisor and descendants from concurrent pulse termination; added sweep-basename alignment and SIGTERM (rc=143) handling for AI-related failures.
Task Tracking `TODO.md`	Resolved merge conflicts and reorganized task entries (t1301, t1302, t1284–t1287); consolidated PR references and stash markers; net reduction in task tracking lines reflecting conflict resolution and cleanup.

Sequence Diagram

sequenceDiagram
    actor Pulse as Pulse Cycle
    participant Phase14 as Phase 14: AI Supervisor
    participant PID as PID Registry
    participant AI as AI Pipeline
    participant Phase4e as Phase 4e: Cleanup
    
    Pulse->>Phase14: Start AI supervision
    Phase14->>PID: Write ai-supervisor.pid
    Phase14->>AI: Launch AI pipeline
    AI->>AI: Execute reasoning/prompt
    
    Note over Pulse,Phase4e: Concurrent pulse arrives
    
    Phase4e->>PID: Check protection (ai-supervisor.pid exists)
    Phase4e->>Phase4e: Skip AI supervisor & descendants
    Phase4e->>PID: Remove other stale PIDs
    
    AI->>AI: Complete work
    Phase14->>PID: Remove ai-supervisor.pid
    Phase14->>Pulse: AI session complete

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly Related PRs

#1641: Modifies Phase 14 AI supervision with concurrent session protection mechanisms, directly aligned with this PR's PID-based approach.
#1869: Adjusts supervisor pulse cleanup logic and environment timeout variables, sharing timeout configuration patterns.
#429: Touches Phase 4 cleanup and PID handling logic related to process supervision boundaries.

Suggested Labels

needs-review

Poem

🔒 Pulse locks extend to thirty minutes long,
AI sessions wrapped in PID shields strong,
Phase 4e yields to concurrent grace—
No more rc=143 in this space! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and specifically describes the fix for SIGTERM rc=143 kills during AI reasoning phase, which matches the primary changeset focus.
Linked Issues check	✅ Passed	All code changes directly address `#2119` objectives: increased PULSE_LOCK_TIMEOUT prevents concurrent pulses, ai-supervisor.pid protection guards AI processes, SUPERVISOR_AI_TIMEOUT increase accommodates large prompts, and rc=143 logging aids diagnostics.
Out of Scope Changes check	✅ Passed	All functional changes target the t1301 fix scope. TODO.md churn from merge conflict resolution is overhead, not a scope deviation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1301

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

The pull request effectively addresses the issue of SIGTERM kills (rc=143) during the AI reasoning phase by increasing timeouts and implementing a protection mechanism for AI-spawned processes. The use of a temporary PID file (ai-supervisor.pid) that the orphan cleanup phase respects is a robust solution for handling concurrent supervisor pulses. I have identified a few style guide violations regarding local variable declarations and blanket error suppression with 2>/dev/null that should be addressed to maintain consistency and improve debuggability, aligning with repository-specific rules.

gemini-code-assist · 2026-02-22T02:11:36Z