Skip to content

refactor: replace deterministic supervisor lifecycle with AI-first decision engine#2206

Merged
marcusquinn merged 1 commit intomainfrom
refactor/supervisor-ai-first
Feb 24, 2026
Merged

refactor: replace deterministic supervisor lifecycle with AI-first decision engine#2206
marcusquinn merged 1 commit intomainfrom
refactor/supervisor-ai-first

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 24, 2026

Summary

  • Remove ~1,000 lines of deterministic shell logic that prevented the supervisor from solving problems autonomously
  • Replace with AI-first architecture: GATHER (data) -> DECIDE (opus model) -> EXECUTE (action)
  • New interactive worker actions: resolve_conflicts, fix_ci, fix_and_push dispatch opus workers with full tool access to actually solve problems instead of just logging and waiting

What was removed

Component Lines Problem
fast_path_decision() ~110 Deterministic shortcuts that skipped AI for "obvious" cases — but couldn't handle edge cases
Phase 3b2 reconciliation ~200 Shell case statements trying to reconcile blocked/verify_failed tasks
Phase 3c issue sync ~50 Deterministic GitHub issue label sync
Phase 3d verified PR cleanup ~90 Deterministic merge/update for verified tasks with open PRs
Phase 3.5 rebase retry ~60 Retry counter with max cap — gave up instead of solving
Phase 3.6 escalation ~150 Hardcoded 10-step prompt that couldn't handle complex conflicts
Phase 4b2 stale pr_review ~50 Called legacy cmd_pr_lifecycle which has its own deterministic logic
process_post_pr_lifecycle() ~110 Legacy parallel lifecycle processor (now redirects to AI)

What replaced it

ai_decide() — a single function that sends the task's real-world state to opus and gets back a JSON action. The AI sees the same state a human would and picks the next step. No case statements, no fast-paths, no retry counters.

_dispatch_ai_worker() — for complex problems (conflicts, CI failures, unknown blockers), dispatches an interactive AI session with full tool access that can read code, understand context, and fix the actual problem.

Testing

  • ShellCheck clean on both files
  • Bash syntax validation passes
  • process_post_pr_lifecycle kept as backward-compatible redirect
  • extract_parent_id, adopt_untracked_prs, Phase 3b (verify queue) preserved

Summary by CodeRabbit

Release Notes

  • Refactor
    • Restructured core task lifecycle architecture with centralized orchestration and decision-making processes
    • Consolidated multiple workflow phases into a unified execution path
    • Enhanced handling of complex operations and task dependencies through improved dispatch capabilities
    • Expanded task state tracking with enriched metadata and comprehensive audit logging

…cision engine

Remove ~1,000 lines of deterministic shell logic (fast_path_decision,
Phase 3b2 reconciliation, Phase 3c/3d/3.5/3.6, process_post_pr_lifecycle)
that prevented the supervisor from solving problems autonomously.

New architecture: GATHER -> DECIDE -> EXECUTE
- gather_task_state(): collects facts from DB, GitHub, git (pure data)
- ai_decide(): sends state to opus model, gets JSON action back
- execute_action(): runs what the AI decided; complex work dispatches
  interactive AI workers with full tool access

Key changes:
- AI model (opus) makes ALL lifecycle decisions, no deterministic shortcuts
- New actions: resolve_conflicts, fix_ci, fix_and_push dispatch interactive
  AI workers that can read code, understand context, and fix problems
- Removed SUPERVISOR_AI_LIFECYCLE toggle (always AI-first now)
- Removed Phase 3b2-3.6 deterministic reconciliation/rebase/escalation
- process_post_pr_lifecycle deprecated (redirects to process_ai_lifecycle)
- Phase 4b2 stale pr_review and 4d stuck deploying simplified
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 24, 2026

Walkthrough

The pull request restructures the supervisor's task lifecycle orchestration from deterministic branching with hard-coded gates to a fully AI-driven decision engine. It centralizes phase 3 flows into a unified process_ai_lifecycle function that gathers enriched task state, queries AI for next actions, and executes those actions in a loop, removing prior mechanical workflows.

Changes

Cohort / File(s) Summary
AI Lifecycle Engine
.agents/scripts/supervisor/ai-lifecycle.sh
Complete architectural overhaul replacing deterministic lifecycle gates with AI-driven orchestration. Functions renamed (decide_next_actionai_decide, execute_lifecycle_actionexecute_action); new _dispatch_ai_worker function for complex task delegation. Introduced enriched task state gathering (worker status, CI summary, PR metadata), structured AI decision prompts with explicit state/actions, and audit logging with decision trails. Updated model selection logic (AI_LIFECYCLE_MODEL_TIERAI_LIFECYCLE_MODEL) and extended action execution to support worker dispatching and status-tag updates.
Pulse Phase Consolidation
.agents/scripts/supervisor/pulse.sh
Simplified Phase 3 by eliminating deterministic sub-phases (3b2, 3c, 3d, 3.5, 3.6) and routing all task decisions exclusively to AI-driven process_ai_lifecycle. Converted process_post_pr_lifecycle to a thin redirect wrapper for backward compatibility. Streamlined Phase 4d recovery logic and collapsed legacy reconciliation branches that previously fed multi-phase workflows.

Sequence Diagram

sequenceDiagram
    participant Pulse as Pulse Orchestrator
    participant AI as AI Lifecycle Engine
    participant TaskState as Task State Gatherer
    participant AIModel as AI Decision Model
    participant ActionExec as Action Executor
    participant Worker as AI Worker Dispatcher
    
    Pulse->>AI: process_ai_lifecycle(tasks)
    loop For each task
        AI->>TaskState: gather_task_state(task_id)
        TaskState->>TaskState: enrich with worker status,<br/>CI summary, PR metadata
        TaskState-->>AI: structured task state
        AI->>AIModel: ai_decide(task_state, task_id)
        AIModel->>AIModel: format prompt with<br/>CURRENT STATE &<br/>AVAILABLE ACTIONS
        AIModel-->>AI: action, reason, status_tag (JSON)
        AI->>ActionExec: execute_action(task_id, action,<br/>reason, status_tag)
        alt action == deploy or merge or complex
            ActionExec->>Worker: _dispatch_ai_worker(task_id,<br/>action_type, repo, worktree)
            Worker-->>ActionExec: worker PID & health tracking
        else action == simple (wait, cancel, retry)
            ActionExec->>ActionExec: execute inline
        end
        ActionExec->>ActionExec: update TODO.md status tags,<br/>commit changes
        ActionExec-->>AI: action result
    end
    AI-->>Pulse: all tasks processed
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested labels

needs-review

Poem

🤖 From rigid gates to flowing streams,

AI now orchestrates the dreams—

Gather state, ask Claude to lead,

Execute actions at machine speed! 🚀

No more hard-coded paths confined,

Just prompts and workers, intertwined. ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: replacing a deterministic supervisor lifecycle with an AI-first decision engine, which aligns with the substantial refactoring of supervisor orchestration in both ai-lifecycle.sh and pulse.sh.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch refactor/supervisor-ai-first

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request fundamentally rearchitects the supervisor's task lifecycle management by shifting from a rule-based, deterministic system to an AI-driven decision engine. The change aims to enhance the supervisor's autonomy and problem-solving capabilities, allowing it to dynamically respond to various task states and issues without relying on hardcoded heuristics. This refactor streamlines the codebase and empowers the system to handle complex scenarios more intelligently.

Highlights

  • AI-First Decision Engine: Replaced approximately 1,000 lines of deterministic shell logic with an AI-first architecture for the supervisor lifecycle, enabling autonomous problem-solving.
  • Interactive AI Worker Actions: Introduced new interactive worker actions such as 'resolve_conflicts', 'fix_ci', and 'fix_and_push' that dispatch AI workers with full tool access to address complex issues.
  • Consolidated Lifecycle Logic: Consolidated various previously deterministic phases (e.g., fast-path decisions, reconciliation, rebase retries, escalation) into a single AI-driven decision process.
Changelog
  • .agents/scripts/supervisor/ai-lifecycle.sh
    • Updated file header comments to reflect the new AI-first architecture and its GATHER → DECIDE → EXECUTE flow.
    • Removed AI_LIFECYCLE_DECISION_TIMEOUT and AI_LIFECYCLE_MODEL_TIER variables, replacing them with AI_LIFECYCLE_MODEL and AI_LIFECYCLE_TIMEOUT.
    • Modified gather_task_state to include worker_pid in the database query and WORKER_ALIVE in the output, and simplified CI status reporting.
    • Renamed decide_next_action to ai_decide and significantly updated its prompt with new available actions and decision rules, removing deterministic shortcuts.
    • Added logging for AI decisions to an audit trail directory.
    • Renamed execute_lifecycle_action to execute_action and refactored its logic to dispatch interactive AI workers for complex actions like resolve_conflicts, fix_ci, and fix_and_push.
    • Removed the fix_ci_failures function, as its functionality is now handled by the generic _dispatch_ai_worker.
    • Removed process_task_lifecycle and fast_path_decision functions, as their logic is integrated into the new AI-first flow.
    • Re-added update_task_status_tag and commit_status_tag_updates functions at the end of the file for backward compatibility with other modules.
    • Updated process_ai_lifecycle to query all task states, remove fast-path decisions, and utilize the new ai_decide and execute_action functions.
  • .agents/scripts/supervisor/pulse.sh
    • Modified Phase 3 to exclusively use process_ai_lifecycle, removing the conditional fallback to process_post_pr_lifecycle.
    • Removed or simplified logic for Phase 3b2 reconciliation, Phase 3c issue sync, Phase 3d verified PR cleanup, Phase 3.5 rebase retry, and Phase 3.6 escalation, noting that these are now handled by the AI lifecycle.
    • Simplified Phase 4d's stuck deploying recovery logic to directly force tasks to 'deployed' after a timeout.
    • Changed the process_post_pr_lifecycle function to a thin redirect that calls process_ai_lifecycle for backward compatibility.
Activity
  • The pull request author, marcusquinn, initiated this significant refactor to transition the supervisor's lifecycle management to an AI-first approach.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 43 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Tue Feb 24 12:40:43 UTC 2026: Code review monitoring started
Tue Feb 24 12:40:44 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 43

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 43
  • VULNERABILITIES: 0

Generated on: Tue Feb 24 12:40:47 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an impressive and significant refactoring that replaces a complex, deterministic shell-based lifecycle engine with a much cleaner and more powerful AI-first architecture. The GATHER -> DECIDE -> EXECUTE pattern is well-implemented, and delegating all decisions to the AI model while using shell scripts for data gathering and execution is a solid design. The introduction of specialized AI workers for complex tasks like conflict resolution and CI fixes is a major step forward in autonomous problem-solving. My review comments focus on improving the robustness and debuggability of the new implementation by adhering to the repository's general rules, specifically around not suppressing stderr for critical command-line tools. This will ensure that any underlying issues with gh, jq, or the AI CLIs are visible in the logs, which is crucial for maintaining such a system.

Comment on lines 250 to 264
if [[ "$ai_cli" == "opencode" ]]; then
ai_result=$(portable_timeout "$ai_timeout" opencode run \
ai_result=$(portable_timeout "$AI_LIFECYCLE_TIMEOUT" opencode run \
-m "$ai_model" \
--format default \
--title "lifecycle-decision-$$" \
--title "lifecycle-${task_id}-$$" \
"$prompt" 2>/dev/null || echo "")
# Strip ANSI codes
ai_result=$(printf '%s' "$ai_result" | sed 's/\x1b\[[0-9;]*[mGKHF]//g; s/\x1b\[[0-9;]*[A-Za-z]//g; s/\x1b\]//g; s/\x07//g')
else
local claude_model="${ai_model#*/}"
ai_result=$(portable_timeout "$ai_timeout" claude \
ai_result=$(portable_timeout "$AI_LIFECYCLE_TIMEOUT" claude \
-p "$prompt" \
--model "$claude_model" \
--output-format text 2>/dev/null || echo "")
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The calls to the AI CLIs (opencode and claude) are suppressing stderr using 2>/dev/null. This is highly risky as it will hide critical errors such as authentication failures (e.g., missing API keys), network issues, or problems with the model endpoint itself. According to the repository's general rules, blanket error suppression should be avoided to maintain debuggability. If the AI decision engine fails silently, the entire lifecycle process will be compromised. Please remove the 2>/dev/null to ensure any errors from the AI CLIs are logged.

Suggested change
if [[ "$ai_cli" == "opencode" ]]; then
ai_result=$(portable_timeout "$ai_timeout" opencode run \
ai_result=$(portable_timeout "$AI_LIFECYCLE_TIMEOUT" opencode run \
-m "$ai_model" \
--format default \
--title "lifecycle-decision-$$" \
--title "lifecycle-${task_id}-$$" \
"$prompt" 2>/dev/null || echo "")
# Strip ANSI codes
ai_result=$(printf '%s' "$ai_result" | sed 's/\x1b\[[0-9;]*[mGKHF]//g; s/\x1b\[[0-9;]*[A-Za-z]//g; s/\x1b\]//g; s/\x07//g')
else
local claude_model="${ai_model#*/}"
ai_result=$(portable_timeout "$ai_timeout" claude \
ai_result=$(portable_timeout "$AI_LIFECYCLE_TIMEOUT" claude \
-p "$prompt" \
--model "$claude_model" \
--output-format text 2>/dev/null || echo "")
fi
if [[ "$ai_cli" == "opencode" ]]; then
ai_result=$(portable_timeout "$AI_LIFECYCLE_TIMEOUT" opencode run \
-m "$ai_model" \
--format default \
--title "lifecycle-${task_id}-$$" \
"$prompt" || echo "")
# Strip ANSI codes
ai_result=$(printf '%s' "$ai_result" | sed 's/\x1b\[[0-9;]*[mGKHF]//g; s/\x1b\[[0-9;]*[A-Za-z]//g; s/\x1b\]//g; s/\x07//g')
else
local claude_model="${ai_model#*/}"
ai_result=$(portable_timeout "$AI_LIFECYCLE_TIMEOUT" claude \
-p "$prompt" \
--model "$claude_model" \
--output-format text || echo "")
fi
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Comment on lines 345 to 346
base_ref=$(gh pr view "$pr_number" --repo "$pr_repo_slug" \
--json baseRefName --jq '.baseRefName' 2>/dev/null) || base_ref=""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The gh pr view command is suppressing stderr with 2>/dev/null. This can hide important errors related to authentication, network connectivity, or the PR not being found. Per the repository's general rules, stderr should not be suppressed for such commands to aid in debugging. Please remove the 2>/dev/null.

Suggested change
base_ref=$(gh pr view "$pr_number" --repo "$pr_repo_slug" \
--json baseRefName --jq '.baseRefName' 2>/dev/null) || base_ref=""
base_ref=$(gh pr view "$pr_number" --repo "$pr_repo_slug" \
--json baseRefName --jq '.baseRefName') || base_ref=""
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Comment on lines +803 to +804
base_ref=$(gh pr view "$tpr" --repo "$(detect_repo_slug "$trepo" 2>/dev/null || echo "")" \
--json baseRefName --jq '.baseRefName' 2>/dev/null) || base_ref="main"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The gh pr view command here suppresses stderr via 2>/dev/null. This is problematic as it can hide errors from the gh CLI, such as authentication failures or if the PR URL is invalid. The repository's general rules advise against suppressing stderr for such commands to ensure errors are visible for debugging. Please remove 2>/dev/null.

Suggested change
base_ref=$(gh pr view "$tpr" --repo "$(detect_repo_slug "$trepo" 2>/dev/null || echo "")" \
--json baseRefName --jq '.baseRefName' 2>/dev/null) || base_ref="main"
base_ref=$(gh pr view "$tpr" --repo "$(detect_repo_slug "$trepo" 2>/dev/null || echo "")" \
--json baseRefName --jq '.baseRefName') || base_ref="main"
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Comment on lines 88 to 115
pr_state=$(printf '%s' "$pr_json" | jq -r '.state // "UNKNOWN"' 2>/dev/null || echo "UNKNOWN")
pr_merge_state=$(printf '%s' "$pr_json" | jq -r '.mergeStateStatus // "UNKNOWN"' 2>/dev/null || echo "UNKNOWN")
pr_review_decision=$(printf '%s' "$pr_json" | jq -r '.reviewDecision // "NONE"' 2>/dev/null || echo "NONE")
pr_base_ref=$(printf '%s' "$pr_json" | jq -r '.baseRefName // "main"' 2>/dev/null || echo "main")

# Retry once if UNKNOWN (GitHub lazy-loads mergeStateStatus)
if [[ "$pr_merge_state" == "UNKNOWN" ]]; then
sleep 2
local retry_json
retry_json=$(gh pr view "$pr_number" --repo "$pr_repo_slug" \
--json mergeable,mergeStateStatus 2>/dev/null || echo "")
if [[ -n "$retry_json" ]]; then
pr_merge_state=$(printf '%s' "$retry_json" | jq -r '.mergeStateStatus // "UNKNOWN"' 2>/dev/null || echo "UNKNOWN")
fi
local is_draft
is_draft=$(printf '%s' "$pr_json" | jq -r '.isDraft // false' 2>/dev/null || echo "false")
if [[ "$is_draft" == "true" ]]; then
pr_state="DRAFT"
fi

pr_review_decision=$(printf '%s' "$pr_json" | jq -r '.reviewDecision // "NONE"' 2>/dev/null || echo "NONE")

# Summarize CI status
# CI summary
local check_rollup
check_rollup=$(printf '%s' "$pr_json" | jq -r '.statusCheckRollup // []' 2>/dev/null || echo "[]")
if [[ "$check_rollup" != "[]" && "$check_rollup" != "null" ]]; then
local pending failed passed
local pending failed passed total
pending=$(printf '%s' "$check_rollup" | jq '[.[] | select(.status == "IN_PROGRESS" or .status == "QUEUED" or .status == "PENDING")] | length' 2>/dev/null || echo "0")
failed=$(printf '%s' "$check_rollup" | jq '[.[] | select((.conclusion | test("FAILURE|TIMED_OUT|ACTION_REQUIRED")) or .state == "FAILURE" or .state == "ERROR")] | length' 2>/dev/null || echo "0")
passed=$(printf '%s' "$check_rollup" | jq '[.[] | select(.conclusion == "SUCCESS" or .state == "SUCCESS")] | length' 2>/dev/null || echo "0")
pr_ci_status="passed:${passed} failed:${failed} pending:${pending}"

# Extract names of failed checks for fix_ci routing
local failed_check_names
failed_check_names=$(printf '%s' "$check_rollup" | jq -r '[.[] | select((.conclusion | test("FAILURE|TIMED_OUT|ACTION_REQUIRED")) or .state == "FAILURE" or .state == "ERROR") | .name] | join(",")' 2>/dev/null || echo "")
if [[ -n "$failed_check_names" ]]; then
pr_ci_failed_checks="$failed_check_names"
total=$(printf '%s' "$check_rollup" | jq 'length' 2>/dev/null || echo "0")
pr_ci_summary="total:${total} passed:${passed} failed:${failed} pending:${pending}"

# Names of failed checks
local failed_names
failed_names=$(printf '%s' "$check_rollup" | jq -r '[.[] | select((.conclusion | test("FAILURE|TIMED_OUT|ACTION_REQUIRED")) or .state == "FAILURE" or .state == "ERROR") | .name] | join(", ")' 2>/dev/null || echo "")
if [[ -n "$failed_names" ]]; then
pr_ci_failed_names="$failed_names"
fi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Throughout this block, jq is called with 2>/dev/null, which suppresses standard error. According to the repository's general rules, stderr should not be suppressed for commands like jq to ensure that syntax or system errors are visible for debugging. While the || echo ... provides a fallback, hiding the actual error from jq makes it harder to diagnose issues with the JSON processing logic or malformed input from the gh command.

Please remove 2>/dev/null from these jq calls.

 					pr_state=$(printf '%s' "$pr_json" | jq -r '.state // "UNKNOWN"' || echo "UNKNOWN")
 					pr_merge_state=$(printf '%s' "$pr_json" | jq -r '.mergeStateStatus // "UNKNOWN"' || echo "UNKNOWN")
 					pr_review_decision=$(printf '%s' "$pr_json" | jq -r '.reviewDecision // "NONE"' || echo "NONE")
 					pr_base_ref=$(printf '%s' "$pr_json" | jq -r '.baseRefName // "main"' || echo "main")

 					local is_draft
 					is_draft=$(printf '%s' "$pr_json" | jq -r '.isDraft // false' || echo "false")
 					if [[ "$is_draft" == "true" ]]; then
 						pr_state="DRAFT"
 					fi

 					# CI summary
 					local check_rollup
 					check_rollup=$(printf '%s' "$pr_json" | jq -r '.statusCheckRollup // []' || echo "[]")
 					if [[ "$check_rollup" != "[]" && "$check_rollup" != "null" ]]; then
 						local pending failed passed total
 						pending=$(printf '%s' "$check_rollup" | jq '[.[] | select(.status == "IN_PROGRESS" or .status == "QUEUED" or .status == "PENDING")] | length' || echo "0")
 						failed=$(printf '%s' "$check_rollup" | jq '[.[] | select((.conclusion | test("FAILURE|TIMED_OUT|ACTION_REQUIRED")) or .state == "FAILURE" or .state == "ERROR")] | length' || echo "0")
 						passed=$(printf '%s' "$check_rollup" | jq '[.[] | select(.conclusion == "SUCCESS" or .state == "SUCCESS")] | length' || echo "0")
 						total=$(printf '%s' "$check_rollup" | jq 'length' || echo "0")
 						pr_ci_summary="total:${total} passed:${passed} failed:${failed} pending:${pending}"

 						# Names of failed checks
 						local failed_names
 						failed_names=$(printf '%s' "$check_rollup" | jq -r '[.[] | select((.conclusion | test("FAILURE|TIMED_OUT|ACTION_REQUIRED")) or .state == "FAILURE" or .state == "ERROR") | .name] | join(", ")' || echo "")
 						if [[ -n "$failed_names" ]]; then
 							pr_ci_failed_names="$failed_names"
 						fi
References
  1. In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.


# Validate required fields
local action
action=$(printf '%s' "$json_block" | jq -r '.action // ""' 2>/dev/null || echo "")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This jq call suppresses stderr using 2>/dev/null, which violates the repository's general rule about not hiding errors from commands. If the JSON block is malformed, this will fail silently, making it harder to debug why an action wasn't parsed correctly. Please remove 2>/dev/null to allow potential jq errors to be logged.

Suggested change
action=$(printf '%s' "$json_block" | jq -r '.action // ""' 2>/dev/null || echo "")
action=$(printf '%s' "$json_block" | jq -r '.action // ""' || echo "")
References
  1. In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/ai-lifecycle.sh:
- Around line 397-406: The rebase_branch case only increments rebase_attempts on
success; move or duplicate the increment logic so rebase_attempts is increased
on every attempt (success or failure). Specifically, ensure the db update that
reads current_attempts and writes rebase_attempts = $((current_attempts + 1))
(the commands using current_attempts, db "$SUPERVISOR_DB" "SELECT ...", and db
"$SUPERVISOR_DB" "UPDATE ...") runs regardless of whether rebase_sibling_pr
succeeds — e.g., perform the SELECT/UPDATE immediately after calling
rebase_sibling_pr (or in a finally-style block) before returning so the counter
always increments.
- Around line 122-131: The current check uses kill -0 on tpid which can be in
remote:host:pid form and will incorrectly report remote workers as dead; update
the block that sets worker_alive to detect remote PID formats (check if tpid
contains ':' or matches a remote pattern) before calling kill -0. If tpid looks
remote (e.g., contains two colons or matches remote:host:pid), set worker_alive
to a remote indicator like "remote (tpid)" or extract the real PID after the
last ':' and handle accordingly; only run kill -0 when tpid is a plain numeric
PID. Ensure changes are applied around the tpid / worker_alive logic that
currently does kill -0.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f414143 and 09c3f22.

📒 Files selected for processing (2)
  • .agents/scripts/supervisor/ai-lifecycle.sh
  • .agents/scripts/supervisor/pulse.sh

Comment on lines +122 to +131
# Worker process state
local worker_alive="unknown"
if [[ -n "$tpid" && "$tpid" != "0" ]]; then
if kill -0 "$tpid" 2>/dev/null; then
worker_alive="yes"
else
worker_alive="no (PID $tpid dead)"
fi
else
worker_alive="no worker"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle remote worker PID formats to avoid premature AI actions.

worker_pid can be stored as remote:host:pid (see pulse.sh remote dispatch handling). kill -0 will mark those as dead and the AI may take corrective actions while the remote worker is still running.

🛠️ Suggested fix to detect remote workers
 	local worker_alive="unknown"
 	if [[ -n "$tpid" && "$tpid" != "0" ]]; then
-		if kill -0 "$tpid" 2>/dev/null; then
-			worker_alive="yes"
-		else
-			worker_alive="no (PID $tpid dead)"
-		fi
+		if [[ "$tpid" == remote:* ]]; then
+			local _remote_host _remote_pid remote_helper
+			_remote_host=$(printf '%s' "$tpid" | cut -d: -f2)
+			_remote_pid=$(printf '%s' "$tpid" | cut -d: -f3)
+			remote_helper="${SCRIPT_DIR}/../remote-dispatch-helper.sh"
+			if [[ -x "$remote_helper" ]] && "$remote_helper" status "$task_id" "$_remote_host" >/dev/null 2>&1; then
+				worker_alive="yes (remote)"
+			else
+				worker_alive="no (remote worker dead)"
+			fi
+		elif kill -0 "$tpid" 2>/dev/null; then
+			worker_alive="yes"
+		else
+			worker_alive="no (PID $tpid dead)"
+		fi
 	else
 		worker_alive="no worker"
 	fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/ai-lifecycle.sh around lines 122 - 131, The
current check uses kill -0 on tpid which can be in remote:host:pid form and will
incorrectly report remote workers as dead; update the block that sets
worker_alive to detect remote PID formats (check if tpid contains ':' or matches
a remote pattern) before calling kill -0. If tpid looks remote (e.g., contains
two colons or matches remote:host:pid), set worker_alive to a remote indicator
like "remote (tpid)" or extract the real PID after the last ':' and handle
accordingly; only run kill -0 when tpid is a plain numeric PID. Ensure changes
are applied around the tpid / worker_alive logic that currently does kill -0.

Comment on lines 397 to +406
rebase_branch)
log_info "ai-lifecycle: rebasing branch for $task_id"
update_task_status_tag "$task_id" "rebasing" "$repo_path"

if rebase_sibling_pr "$task_id" 2>>"$SUPERVISOR_LOG"; then
log_success "ai-lifecycle: rebase succeeded for $task_id"
update_task_status_tag "$task_id" "ci-running" "$repo_path"
# Increment rebase counter
local current_attempts
current_attempts=$(db "$SUPERVISOR_DB" "SELECT rebase_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
db "$SUPERVISOR_DB" "UPDATE tasks SET rebase_attempts = $((current_attempts + 1)) WHERE id = '$escaped_id';" 2>/dev/null || true
return 0
else
log_warn "ai-lifecycle: rebase failed for $task_id"
update_task_status_tag "$task_id" "has-conflicts" "$repo_path"
return 1
fi
log_warn "ai-lifecycle: rebase failed for $task_id"
return 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Increment rebase_attempts on every try, not just success.

Right now failures don’t increment, so the “rebase_attempts > 3 → resolve_conflicts” guard may never trigger, leading to infinite rebase loops.

🛠️ Suggested fix to count every attempt
 rebase_branch)
-		if rebase_sibling_pr "$task_id" 2>>"$SUPERVISOR_LOG"; then
-			log_success "ai-lifecycle: rebase succeeded for $task_id"
-			local current_attempts
-			current_attempts=$(db "$SUPERVISOR_DB" "SELECT rebase_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
-			db "$SUPERVISOR_DB" "UPDATE tasks SET rebase_attempts = $((current_attempts + 1)) WHERE id = '$escaped_id';" 2>/dev/null || true
-			return 0
-		fi
+		local current_attempts
+		current_attempts=$(db "$SUPERVISOR_DB" "SELECT rebase_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
+		db "$SUPERVISOR_DB" "UPDATE tasks SET rebase_attempts = $((current_attempts + 1)) WHERE id = '$escaped_id';" 2>/dev/null || true
+		if rebase_sibling_pr "$task_id" 2>>"$SUPERVISOR_LOG"; then
+			log_success "ai-lifecycle: rebase succeeded for $task_id"
+			return 0
+		fi
 		log_warn "ai-lifecycle: rebase failed for $task_id"
 		return 1
 		;;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
rebase_branch)
log_info "ai-lifecycle: rebasing branch for $task_id"
update_task_status_tag "$task_id" "rebasing" "$repo_path"
if rebase_sibling_pr "$task_id" 2>>"$SUPERVISOR_LOG"; then
log_success "ai-lifecycle: rebase succeeded for $task_id"
update_task_status_tag "$task_id" "ci-running" "$repo_path"
# Increment rebase counter
local current_attempts
current_attempts=$(db "$SUPERVISOR_DB" "SELECT rebase_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
db "$SUPERVISOR_DB" "UPDATE tasks SET rebase_attempts = $((current_attempts + 1)) WHERE id = '$escaped_id';" 2>/dev/null || true
return 0
else
log_warn "ai-lifecycle: rebase failed for $task_id"
update_task_status_tag "$task_id" "has-conflicts" "$repo_path"
return 1
fi
log_warn "ai-lifecycle: rebase failed for $task_id"
return 1
rebase_branch)
local current_attempts
current_attempts=$(db "$SUPERVISOR_DB" "SELECT rebase_attempts FROM tasks WHERE id = '$escaped_id';" 2>/dev/null || echo "0")
db "$SUPERVISOR_DB" "UPDATE tasks SET rebase_attempts = $((current_attempts + 1)) WHERE id = '$escaped_id';" 2>/dev/null || true
if rebase_sibling_pr "$task_id" 2>>"$SUPERVISOR_LOG"; then
log_success "ai-lifecycle: rebase succeeded for $task_id"
return 0
fi
log_warn "ai-lifecycle: rebase failed for $task_id"
return 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/ai-lifecycle.sh around lines 397 - 406, The
rebase_branch case only increments rebase_attempts on success; move or duplicate
the increment logic so rebase_attempts is increased on every attempt (success or
failure). Specifically, ensure the db update that reads current_attempts and
writes rebase_attempts = $((current_attempts + 1)) (the commands using
current_attempts, db "$SUPERVISOR_DB" "SELECT ...", and db "$SUPERVISOR_DB"
"UPDATE ...") runs regardless of whether rebase_sibling_pr succeeds — e.g.,
perform the SELECT/UPDATE immediately after calling rebase_sibling_pr (or in a
finally-style block) before returning so the counter always increments.

@marcusquinn marcusquinn merged commit dee2573 into main Feb 24, 2026
26 of 27 checks passed
@marcusquinn marcusquinn deleted the refactor/supervisor-ai-first branch February 24, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant