Skip to content

t1050: Escalate rebase-blocked PRs to opus worker (Phase 3.6)#1483

Merged
marcusquinn merged 1 commit intomainfrom
feature/t1050
Feb 14, 2026
Merged

t1050: Escalate rebase-blocked PRs to opus worker (Phase 3.6)#1483
marcusquinn merged 1 commit intomainfrom
feature/t1050

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 14, 2026

Summary

  • Adds Phase 3.6 to the supervisor pulse cycle that escalates rebase-blocked PRs to an opus worker when auto-rebase exhausts retries (default 5 attempts)
  • Sequential processing: only ONE escalation runs at a time with 5-minute cooldown, ensuring each subsequent rebase has a clean base
  • Completes the self-healing rebase pipeline: Phase 3.5 (auto-rebase) → Phase 3.6 (opus escalation) → resolved

Context

After PRs #1477 (AI-completed rebase detection), #1480 (stale rebase cleanup), and #1481 (dirty worktree stash), the auto-rebase pipeline handles most conflicts. However, genuinely complex conflicts (e.g., 21 PRs all touching the same files) exhaust retries and stay permanently blocked. Phase 3.6 provides the escalation path.

Implementation

  • Escalation lock file prevents concurrent opus workers
  • SQL query finds ONE candidate with exhausted retries and an existing PR
  • Opus worker receives detailed prompt: fetch latest main, rebase, resolve conflicts, run shellcheck, push, request merge
  • DB tracks escalation worker PID for monitoring
  • Mail notification sent on dispatch
  • Uses cmd_transition so GitHub issue labels sync automatically

Testing

  • ShellCheck clean (no new warnings in Phase 3.6 code)
  • All pre-existing SC2034 warnings are in unrelated code sections

Summary by CodeRabbit

Release Notes

  • New Features
    • Introduced automatic escalation of blocked rebase pull requests to specialized workers when standard auto-rebase attempts fail beyond the retry threshold, improving handling of complex rebase scenarios.

When auto-rebase exhausts retries (default 5), Phase 3.6 dispatches a
single opus worker to manually rebase, resolve conflicts, and merge.
Sequential processing ensures each subsequent rebase has a clean base.

Includes:
- Escalation lock file to prevent concurrent escalations
- 5-minute cooldown between escalations
- DB tracking of escalation worker PID
- Mail notification on escalation dispatch
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the automated rebase pipeline by introducing an intelligent escalation mechanism for pull requests that encounter complex merge conflicts. It integrates an AI worker to resolve conflicts that automated tools cannot handle, ensuring that blocked PRs can progress. This new phase improves the system's ability to self-heal and maintain a clean, up-to-date codebase without manual intervention for most rebase issues.

Highlights

  • New Escalation Phase: Introduced "Phase 3.6" into the supervisor pulse cycle to handle pull requests that are persistently blocked by rebase conflicts after automated attempts.
  • AI-Powered Conflict Resolution: Implemented a mechanism to dispatch an "opus" AI worker to manually rebase and resolve complex merge conflicts for PRs that have exhausted auto-rebase retries.
  • Sequential Processing & Cooldown: Ensured that only one rebase escalation runs at a time with a 5-minute cooldown period, providing a clean base for subsequent rebase attempts.
  • Enhanced Self-Healing Pipeline: Completed the self-healing rebase pipeline by adding this escalation path, addressing complex conflicts that previously caused permanent blocks.
Changelog
  • .agents/scripts/supervisor/pulse.sh
    • Added a new section for "Phase 3.6: Escalate rebase-blocked PRs to opus worker".
    • Implemented logic to check for an existing escalation lock and apply a cooldown.
    • Added a SQL query to identify a single task that has exhausted auto-rebase retries.
    • Included logic to create an escalation lock file upon dispatch.
    • Developed a detailed prompt for the AI worker, outlining steps for conflict resolution, rebasing, pushing, and merging.
    • Integrated resolve_ai_cli and resolve_model functions to determine the AI client and model.
    • Added code to dispatch the AI worker in a background process, logging its output.
    • Updated the task status in the database to 'running' and recorded the worker PID.
    • Included a notification mechanism to send an email upon worker dispatch.
Activity
  • No specific activity (comments, reviews, progress updates) has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

Walkthrough

Introduces Phase 3.6 escalation flow in pulse.sh that autonomously escalates blocked rebase pull requests to an opus worker when auto-rebase failures exceed retry thresholds. Implements lock-based cooldown to prevent concurrent escalations, selects escalation candidates, constructs detailed prompts, resolves AI CLI, determines worktree paths, and dispatches worker with logging and database updates.

Changes

Cohort / File(s) Summary
Phase 3.6 Escalation Flow
.agents/scripts/supervisor/pulse.sh
Adds escalation logic for blocked rebase PRs with lock-based concurrency control, candidate selection, AI CLI resolution, worktree path handling, opus worker dispatch, and DB/notification updates. Code block appears duplicated in two script locations.

Sequence Diagram

sequenceDiagram
    participant Pulse as Pulse Script
    participant Lock as Lock File<br/>(Cooldown)
    participant DB as Database
    participant CLI as AI CLI
    participant Worker as Opus Worker
    participant Notify as Notifications

    Pulse->>Lock: Check escalation lock<br/>(prevent concurrent)
    alt Lock exists and valid
        Pulse->>Pulse: Wait/skip escalation
    else Lock absent or expired
        Pulse->>Lock: Create lock file
        Pulse->>DB: Query blocked PR candidate
        Pulse->>Pulse: Construct escalation prompt<br/>(with PR context)
        Pulse->>CLI: Resolve AI CLI path
        alt CLI available
            Pulse->>Pulse: Determine worktree path
            Pulse->>Worker: Dispatch opus worker<br/>(with escalation prompt)
            Pulse->>DB: Update task status
            Pulse->>Notify: Emit success notification
        else CLI unavailable
            Pulse->>Lock: Cleanup lock
            Pulse->>Notify: Log escalation abort
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related issues

Possibly related PRs

Poem

🚀 When rebases rebel and retries run dry,
A pulse beats strong—Phase 3.6 springs high!
Opus workers summoned to save the day,
Locks prevent chaos, automation's way! 🔐✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly summarizes the main change: adding Phase 3.6 escalation logic to escalate rebase-blocked PRs to an opus worker, which aligns with the core purpose of the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1050

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 5 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Feb 14 19:12:22 UTC 2026: Code review monitoring started
Sat Feb 14 19:12:26 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 5

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 5
  • VULNERABILITIES: 0

Generated on: Sat Feb 14 19:12:29 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Phase 3.6 to the supervisor, which escalates rebase-blocked PRs to an opus worker. The implementation is well-structured and follows a sequential processing model with a cooldown to ensure stability. My review focuses on adherence to the repository's shell scripting style guide and long-term maintainability. I've identified a repeated violation of the error redirection rule and a potential maintainability issue with string-based logic in an SQL query. Overall, this is a valuable addition to the self-healing capabilities of the system.

log_verbose " Phase 3.6: escalation cooldown (${lock_age}s/${escalation_cooldown}s)"
else
# Stale lock — remove it
rm -f "$escalation_lock" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using 2>/dev/null here violates the repository's style guide (rule 50), which states that stderr should only be redirected to a log file. This pattern is repeated throughout the new code block (e.g., lines 651, 673, 683, 687, 690, 695, 736, 740, and 766). Suppressing errors can make debugging more difficult.

Please redirect to the supervisor log file instead to adhere to the style guide and improve observability.

Suggested change
rm -f "$escalation_lock" 2>/dev/null || true
rm -f "$escalation_lock" 2>>"$SUPERVISOR_LOG" || true
References
  1. The style guide specifies that 2>/dev/null is only acceptable when redirecting to log files, not for blanket suppression of errors. This change violates that rule. (link)

SELECT t.id, t.repo, t.pr_url, t.branch, t.rebase_attempts
FROM tasks t
WHERE t.status = 'blocked'
AND t.error LIKE '%Merge conflict%auto-rebase failed%'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Relying on a specific error string with LIKE can be brittle. If the error message from the rebase process changes in the future, this query will fail to find candidates for escalation.

For better long-term maintainability, consider adding a dedicated column (e.g., rebase_failure_reason) to the tasks table to store a structured error code instead of relying on string matching in the error text.

@marcusquinn marcusquinn merged commit 1a65e61 into main Feb 14, 2026
18 of 19 checks passed
@marcusquinn marcusquinn deleted the feature/t1050 branch February 14, 2026 19:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.agents/scripts/supervisor/pulse.sh:
- Around line 661-673: The bug is that max_retry_cycles is declared only inside
the Phase 3.5 while loop (local at the Phase 3.5 scope) so it can be unset when
Phase 3.5 doesn't run, breaking the Phase 3.6 SQL that uses it; fix by defining
max_retry_cycles in the shared scope before Phase 3.5/3.6 (e.g., immediately
above the Phase 3.5 loop) and remove the duplicate local declaration inside the
while loop so both Phase 3.5 logic and the escalation SQL (the db query that
checks AND t.rebase_attempts >= $max_retry_cycles) always sees a defined value.
- Around line 760-766: The block that directly runs db "$SUPERVISOR_DB" "UPDATE
tasks SET status = 'running' ..." bypasses cmd_transition and therefore skips
GitHub label sync and state_log; replace this raw UPDATE with a call to
cmd_transition so the task with id "$esc_id" is transitioned to 'running' and
the worker_pid is set to $esc_pid (use the same variables esc_id and esc_pid and
ensure any returned errors are handled similarly to the original db call),
preserving the updated_at behavior if cmd_transition doesn't set it
automatically.
- Around line 752-769: The escalation branch records esc_pid in the DB but never
writes a PID file or updates the log_file column; add code after capturing
esc_pid to create a PID file (e.g. "$SUPERVISOR_DIR/pids/$esc_pid.pid"
containing the PID) so Phase 1/4b can detect the live worker, and modify the DB
UPDATE (the db(...) call that sets status/error/worker_pid for esc_id) to also
set log_file = '$(sql_escape "$esc_log_file")'; ensure the PID file is created
atomically and is associated with esc_pid/esc_id so cleanup logic can remove it
when the worker exits.
- Around line 700-732: The escalation prompt stored in esc_prompt instructs the
worker to emit outcome markers (ESCALATION_MERGED, ESCALATION_REBASED,
ESCALATION_FAILED) but the supervisor never reads or parses that worker output;
update the supervisor to capture and parse the worker's escalation log (or wait
for the worker to finish) and detect those markers to transition the task state
accordingly, mapping ESCALATION_MERGED/REBASED to success states and
ESCALATION_FAILED to failure with the provided reason; also harden esc_prompt by
quoting/interpolating $esc_branch, $esc_pr, and $esc_repo in the prompt text to
avoid word-splitting/injection, and add an explicit supervisor-side CI check
(e.g., call gh pr view --json checkSuites/commit/status to ensure required
checks passed) before performing gh pr merge so merges are only attempted when
CI is green.
🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)

649-658: Lock age uses file mtime, but content is epoch — consider using the content instead.

Line 651 uses stat -f %m / stat -c %Y to get the lock file's mtime, while line 683 writes the current epoch as the file's content. Using mtime is fragile (filesystem caching, NFS, etc.). Since you're already writing the epoch into the file, reading it back is simpler and more portable:

♻️ Suggested simplification
 	if [[ -f "$escalation_lock" ]]; then
 		local lock_age
-		lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0")))
+		local lock_ts
+		lock_ts=$(cat "$escalation_lock" 2>/dev/null || echo "0")
+		lock_age=$(($(date +%s) - lock_ts))
 		if [[ "$lock_age" -lt "$escalation_cooldown" ]]; then

Comment on lines +661 to +673
if [[ "$should_escalate" == "true" ]]; then
# Find ONE task that has exhausted auto-rebase retries
local escalation_candidate
escalation_candidate=$(db "$SUPERVISOR_DB" "
SELECT t.id, t.repo, t.pr_url, t.branch, t.rebase_attempts
FROM tasks t
WHERE t.status = 'blocked'
AND t.error LIKE '%Merge conflict%auto-rebase failed%'
AND t.rebase_attempts >= $max_retry_cycles
AND t.pr_url IS NOT NULL AND t.pr_url != '' AND t.pr_url != 'no_pr'
ORDER BY t.rebase_attempts ASC, t.id ASC
LIMIT 1;
" 2>/dev/null || echo "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: $max_retry_cycles is unset when no Phase 3.5 blocked tasks exist.

max_retry_cycles is declared local at line 597 inside the Phase 3.5 while loop body, which only executes when blocked_tasks is non-empty. If no merge-conflict-blocked tasks exist (the if at line 592 is false), $max_retry_cycles is never assigned. The SQL at line 669 then expands to AND t.rebase_attempts >= — a SQLite syntax error, silently swallowed by || echo "".

This means Phase 3.6 escalation never fires when Phase 3.5 has nothing to process, which is precisely when escalation candidates (retries already exhausted) are most likely to exist.

🐛 Proposed fix: define the constant at Phase 3.5/3.6 shared scope
+	# Shared constant for Phase 3.5 and 3.6
+	local max_retry_cycles=3
+
 	# Phase 3.5: Auto-retry blocked merge-conflict tasks (t1029)
 	# When a task is blocked with "Merge conflict — auto-rebase failed", periodically
 	# re-attempt the rebase after main advances. Other PRs merging often resolve conflicts.
 	local blocked_tasks

And remove the duplicate declaration inside the loop:

-			# Cap at 3 total retry cycles to prevent infinite loops
-			local max_retry_cycles=3
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 661 - 673, The bug is that
max_retry_cycles is declared only inside the Phase 3.5 while loop (local at the
Phase 3.5 scope) so it can be unset when Phase 3.5 doesn't run, breaking the
Phase 3.6 SQL that uses it; fix by defining max_retry_cycles in the shared scope
before Phase 3.5/3.6 (e.g., immediately above the Phase 3.5 loop) and remove the
duplicate local declaration inside the while loop so both Phase 3.5 logic and
the escalation SQL (the db query that checks AND t.rebase_attempts >=
$max_retry_cycles) always sees a defined value.

Comment on lines +700 to +732
# Build the escalation prompt
local esc_prompt="You are resolving a merge conflict that automated tools could not handle.

TASK: $esc_id
BRANCH: $esc_branch
PR: $esc_pr
REPO: $esc_repo
WORKTREE: ${esc_worktree:-$esc_repo}

STEPS:
1. cd to the worktree (or repo if no worktree)
2. Run: git fetch origin main
3. Abort any in-progress rebase: git rebase --abort (ignore errors)
4. Clean any dirty state: git stash push -m 'pre-escalation' (ignore errors)
5. Run: git rebase origin/main
6. If conflicts occur, resolve ALL of them:
- Read each conflicting file
- Understand both sides' intent
- Merge intelligently (keep both sides' changes where possible)
- Remove ALL conflict markers
- git add each resolved file
- git rebase --continue
- Repeat for each commit in the rebase
7. After rebase completes: git push --force-with-lease origin $esc_branch
8. Verify the PR is no longer in conflict: gh pr view $esc_pr --json mergeStateStatus
9. If CI passes, merge: gh pr merge $esc_pr --squash
10. Output ONLY: 'ESCALATION_MERGED: $esc_id' if merged, 'ESCALATION_REBASED: $esc_id' if rebased but not merged, or 'ESCALATION_FAILED: reason' if failed

RULES:
- Do NOT modify the intent of any code — only resolve conflicts
- Prefer the feature branch for new functionality, main for structural changes
- If a file has been deleted on main but modified on the branch, keep the branch version
- Do NOT create new commits beyond what the rebase produces"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Read the file around the specified lines
wc -l .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 104


🏁 Script executed:

# Get context around lines 700-732
sed -n '690,740p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 2190


🏁 Script executed:

# Check if ESCALATION_MERGED/REBASED/FAILED markers are referenced anywhere
rg -n 'ESCALATION_MERGED|ESCALATION_REBASED|ESCALATION_FAILED' --type=sh

Repository: marcusquinn/aidevops

Length of output: 258


🏁 Script executed:

# Check how esc_branch, esc_pr, esc_repo are sourced and quoted
rg -n 'esc_branch\|esc_pr\|esc_repo' --type=sh -B2 -A2 | head -100

Repository: marcusquinn/aidevops

Length of output: 46


🏁 Script executed:

# Look for evaluate_worker function
rg -n 'evaluate_worker' --type=sh -A20

Repository: marcusquinn/aidevops

Length of output: 20028


🏁 Script executed:

# Check if there's any CI status checking before merge
rg -n 'mergeStateStatus|ciStatus|check.*pass' --type=sh

Repository: marcusquinn/aidevops

Length of output: 6926


🏁 Script executed:

# Check where the escalation worker is dispatched and what happens with its output
rg -n 'escalation' .agents/scripts/supervisor/pulse.sh -A5 -B5 | grep -E '(dispatch|worker|eval|outcome|log|ESCALATION)' | head -60

Repository: marcusquinn/aidevops

Length of output: 1124


🏁 Script executed:

# Look for where the escalation process is expected to complete or be handled
rg -n 'esc_log_dir|esc_model|dispatch.*esc' .agents/scripts/supervisor/pulse.sh -A10

Repository: marcusquinn/aidevops

Length of output: 2570


🏁 Script executed:

# Check if there's a comment about handling escalation outcomes
rg -n 'escalation.*outcome|ESCALATION.*handling|phase.*escalation' .agents/scripts/supervisor/ -i

Repository: marcusquinn/aidevops

Length of output: 296


🏁 Script executed:

# Verify variable quoting in actual shell commands within the prompt  
sed -n '710,735p' .agents/scripts/supervisor/pulse.sh | grep -E 'git|gh|branch|pr|repo'

Repository: marcusquinn/aidevops

Length of output: 817


🏁 Script executed:

# Look at the full escalation section to see variable sourcing
sed -n '640,750p' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 4709


🏁 Script executed:

# Check where esc_branch, esc_pr, esc_repo are first set
rg -n 'esc_branch=|esc_pr=|esc_repo=' .agents/scripts/supervisor/pulse.sh

Repository: marcusquinn/aidevops

Length of output: 46


🏁 Script executed:

# Check if escalation outcome is parsed anywhere after dispatch
rg -n 'esc_log_file|escalation.*log' .agents/scripts/supervisor/pulse.sh -A5

Repository: marcusquinn/aidevops

Length of output: 750


🏁 Script executed:

# Look for any subsequent handling of escalated tasks
rg -n 'escalation_candidate|esc_id' .agents/scripts/supervisor/pulse.sh | tail -20

Repository: marcusquinn/aidevops

Length of output: 1227


Escalation outcome markers are defined in the prompt but never parsed by the supervisor.

The prompt (line 726) instructs the worker to output ESCALATION_MERGED: $esc_id, ESCALATION_REBASED: $esc_id, or ESCALATION_FAILED: reason, but the supervisor never reads the escalation log file or acts on these markers. After dispatch, the worker runs in the background and the outcome is discarded — the task stays in running state until a timeout or health check transitions it. Phase 1 evaluation will need to handle outcome parsing.

Also: $esc_branch, $esc_pr, and $esc_repo are interpolated unquoted into the prompt text (steps 7–9). Low risk since they originate from your DB, but defensive quoting would harden it. Additionally, the prompt relies on the worker to verify CI passes before merging with no supervisor-side validation — gh pr merge will reject if required checks fail, but explicit pre-merge CI validation in the supervisor would be safer.

🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 700 - 732, The escalation
prompt stored in esc_prompt instructs the worker to emit outcome markers
(ESCALATION_MERGED, ESCALATION_REBASED, ESCALATION_FAILED) but the supervisor
never reads or parses that worker output; update the supervisor to capture and
parse the worker's escalation log (or wait for the worker to finish) and detect
those markers to transition the task state accordingly, mapping
ESCALATION_MERGED/REBASED to success states and ESCALATION_FAILED to failure
with the provided reason; also harden esc_prompt by quoting/interpolating
$esc_branch, $esc_pr, and $esc_repo in the prompt text to avoid
word-splitting/injection, and add an explicit supervisor-side CI check (e.g.,
call gh pr view --json checkSuites/commit/status to ensure required checks
passed) before performing gh pr merge so merges are only attempted when CI is
green.

Comment on lines +752 to +769
local esc_pid=$!
else
(cd "$esc_workdir" && $esc_ai_cli -p "$esc_prompt" \
${esc_model:+--model "$esc_model"} \
>"$esc_log_file" 2>&1) &
local esc_pid=$!
fi

# Record the escalation in the DB
db "$SUPERVISOR_DB" "UPDATE tasks SET
status = 'running',
error = 'Escalation: opus rebase worker (PID $esc_pid)',
worker_pid = $esc_pid,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true

log_success " Phase 3.6: dispatched opus worker PID $esc_pid for $esc_id"
send_task_notification "$esc_id" "escalated" "Opus rebase worker dispatched (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: No PID file written — next pulse will orphan-kill the escalation worker.

The background worker PID is captured (esc_pid=$!) and recorded in the DB (worker_pid), but no PID file is created under $SUPERVISOR_DIR/pids/. Phase 1 (line 147–156) and Phase 4b (line 866–883) both rely on PID files to track live workers. Without one:

  • Phase 4b will immediately flag this task as a "DB orphan" (running status, no PID file) and force-fail it.
  • The opus worker gets killed or ignored before it can finish the rebase.

Additionally, the log_file column is not updated in the DB, so evaluation won't find the worker's output.

🐛 Proposed fix: write PID file and update log_file
 					local esc_pid=$!
 				fi

+				# Write PID file so Phase 1/4 can track the worker
+				mkdir -p "$SUPERVISOR_DIR/pids" 2>/dev/null || true
+				echo "$esc_pid" > "$SUPERVISOR_DIR/pids/${esc_id}.pid"
+
 				# Record the escalation in the DB
 				db "$SUPERVISOR_DB" "UPDATE tasks SET
 					status = 'running',
 					error = 'Escalation: opus rebase worker (PID $esc_pid)',
 					worker_pid = $esc_pid,
+					log_file = '$(sql_escape "$esc_log_file")',
 					updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
 				WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
local esc_pid=$!
else
(cd "$esc_workdir" && $esc_ai_cli -p "$esc_prompt" \
${esc_model:+--model "$esc_model"} \
>"$esc_log_file" 2>&1) &
local esc_pid=$!
fi
# Record the escalation in the DB
db "$SUPERVISOR_DB" "UPDATE tasks SET
status = 'running',
error = 'Escalation: opus rebase worker (PID $esc_pid)',
worker_pid = $esc_pid,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
log_success " Phase 3.6: dispatched opus worker PID $esc_pid for $esc_id"
send_task_notification "$esc_id" "escalated" "Opus rebase worker dispatched (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
local esc_pid=$!
else
(cd "$esc_workdir" && $esc_ai_cli -p "$esc_prompt" \
${esc_model:+--model "$esc_model"} \
>"$esc_log_file" 2>&1) &
local esc_pid=$!
fi
# Write PID file so Phase 1/4 can track the worker
mkdir -p "$SUPERVISOR_DIR/pids" 2>/dev/null || true
echo "$esc_pid" > "$SUPERVISOR_DIR/pids/${esc_id}.pid"
# Record the escalation in the DB
db "$SUPERVISOR_DB" "UPDATE tasks SET
status = 'running',
error = 'Escalation: opus rebase worker (PID $esc_pid)',
worker_pid = $esc_pid,
log_file = '$(sql_escape "$esc_log_file")',
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
log_success " Phase 3.6: dispatched opus worker PID $esc_pid for $esc_id"
send_task_notification "$esc_id" "escalated" "Opus rebase worker dispatched (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 752 - 769, The escalation
branch records esc_pid in the DB but never writes a PID file or updates the
log_file column; add code after capturing esc_pid to create a PID file (e.g.
"$SUPERVISOR_DIR/pids/$esc_pid.pid" containing the PID) so Phase 1/4b can detect
the live worker, and modify the DB UPDATE (the db(...) call that sets
status/error/worker_pid for esc_id) to also set log_file = '$(sql_escape
"$esc_log_file")'; ensure the PID file is created atomically and is associated
with esc_pid/esc_id so cleanup logic can remove it when the worker exits.

Comment on lines +760 to +766
# Record the escalation in the DB
db "$SUPERVISOR_DB" "UPDATE tasks SET
status = 'running',
error = 'Escalation: opus rebase worker (PID $esc_pid)',
worker_pid = $esc_pid,
updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Status update bypasses cmd_transition — GitHub issue labels won't sync.

The PR description states "Uses cmd_transition so GitHub issue labels sync automatically," but line 761–766 performs a raw UPDATE tasks SET status = 'running' instead of calling cmd_transition. This skips label synchronization, state_log recording, and any other side effects that cmd_transition provides.

♻️ Proposed fix: use cmd_transition
-				db "$SUPERVISOR_DB" "UPDATE tasks SET
-					status = 'running',
-					error = 'Escalation: opus rebase worker (PID $esc_pid)',
-					worker_pid = $esc_pid,
-					updated_at = strftime('%Y-%m-%dT%H:%M:%SZ','now')
-				WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
+				# Transition via cmd_transition for label sync + state_log
+				cmd_transition "$esc_id" "running" --error "Escalation: opus rebase worker (PID $esc_pid)" 2>>"$SUPERVISOR_LOG" || true
+				# Update worker_pid separately (cmd_transition doesn't handle this field)
+				db "$SUPERVISOR_DB" "UPDATE tasks SET
+					worker_pid = $esc_pid,
+					log_file = '$(sql_escape "$esc_log_file")'
+				WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true
🤖 Prompt for AI Agents
In @.agents/scripts/supervisor/pulse.sh around lines 760 - 766, The block that
directly runs db "$SUPERVISOR_DB" "UPDATE tasks SET status = 'running' ..."
bypasses cmd_transition and therefore skips GitHub label sync and state_log;
replace this raw UPDATE with a call to cmd_transition so the task with id
"$esc_id" is transitioned to 'running' and the worker_pid is set to $esc_pid
(use the same variables esc_id and esc_pid and ensure any returned errors are
handled similarly to the original db call), preserving the updated_at behavior
if cmd_transition doesn't set it automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant