Skip to content

t1050: Fix escalation lock — PID tracking and post-dispatch creation#1485

Merged
marcusquinn merged 1 commit intomainfrom
bugfix/t1050-lock-fix
Feb 14, 2026
Merged

t1050: Fix escalation lock — PID tracking and post-dispatch creation#1485
marcusquinn merged 1 commit intomainfrom
bugfix/t1050-lock-fix

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 14, 2026

Summary

  • Fixes Phase 3.6 escalation lock that was blocking all escalations after a killed pulse
  • Lock now stores worker PID instead of timestamp, enabling stale detection via process liveness
  • Lock created AFTER successful dispatch instead of before, preventing orphaned locks from failed dispatches

Root Cause

The escalation lock was created before the worker dispatch. When a pulse was killed (hung in post-PR processing), the lock remained with no worker running. The 5-minute cooldown then blocked all subsequent escalation attempts. This is why Phase 3.6 never fired despite the code being correct.

Fix

  1. Lock stores worker PID (checked with kill -0)
  2. Lock created only after successful background dispatch
  3. Stale detection: dead PID + past cooldown = immediate removal
  4. Alive PID + past cooldown = warning + removal (runaway worker)

Summary by CodeRabbit

  • Bug Fixes

    • Improved escalation lock handling with process liveliness verification and cooldown enforcement to prevent duplicate escalations.
    • Optimized lock file creation timing and lifecycle management to reduce stale lock scenarios.
  • Chores

    • Enhanced observability with detailed logging for lock state transitions, cleanup operations, and dispatch completion.

…ch creation

The escalation lock was created BEFORE dispatch, so if the pulse died
between lock creation and worker launch, the lock blocked all future
escalations for the full cooldown period. Now:

- Lock stores worker PID (not timestamp) for stale detection
- Lock created AFTER successful dispatch
- Stale lock detection checks if PID is alive, not just age
- Dead PID + past cooldown = stale lock removed immediately
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the escalation lock mechanism could block all subsequent escalations after a worker process was unexpectedly terminated. The changes refine the lock management by integrating PID tracking, ensuring locks are only established upon successful worker dispatch, and implementing more intelligent stale lock detection. This prevents false positives and ensures escalations proceed as intended, improving the reliability of the system's automated responses.

Highlights

  • Escalation Lock Mechanism: The escalation lock now stores the worker's Process ID (PID) instead of just a timestamp, enabling robust liveness checks.
  • Lock Creation Timing: The escalation lock is now created only after a successful worker dispatch, preventing orphaned locks if dispatch fails.
  • Stale Lock Detection: Improved logic for detecting and removing stale locks, differentiating between locks held by active processes, recently completed processes, and truly crashed or runaway workers based on PID status and cooldown period.
Changelog
  • .agents/scripts/supervisor/pulse.sh
    • Implemented PID tracking for escalation locks, allowing for liveness checks using kill -0.
    • Modified escalation lock logic to differentiate between active, recently completed, and stale locks based on PID status and age.
    • Removed the premature creation of the escalation lock before worker dispatch.
    • Ensured the escalation lock is created only after a successful worker dispatch, storing the worker's PID.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 14, 2026

Walkthrough

Enhanced escalation lock handling in the supervisor's Phase 3.6 with process liveliness verification and cooldown-aware suppression logic. Lock creation moved from pre-dispatch to post-dispatch, replacing stale-lock-age-only checks with PID-alive verification. Added observability logging for lock state transitions.

Changes

Cohort / File(s) Summary
Escalation Lock Liveliness & Cooldown
.agents/scripts/supervisor/pulse.sh
Replaces simple lock-age checks with process liveliness verification; reads PID from lock file and verifies process status before deciding escalation suppression. Moves lock file creation to post-dispatch phase. Adds explicit logging for lock-alive, cooldown-active, and stale-lock-removal scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Supervisor as Supervisor (Phase 3.6)
    participant LockFile as Escalation Lock File
    participant Process as Worker Process
    participant Logger as Logging System

    Supervisor->>LockFile: Check if escalation lock exists
    alt Lock Exists
        Supervisor->>LockFile: Read worker PID from lock
        Supervisor->>Process: Check if PID process is alive
        alt Process is Alive
            Supervisor->>Supervisor: Check if within cooldown period
            alt Within Cooldown
                Supervisor->>Logger: Log: suppress escalation (alive & cooldown)
                Note over Supervisor: Escalation suppressed
            else Past Cooldown
                Supervisor->>LockFile: Remove stale lock
                Supervisor->>Logger: Log: stale lock removed
                Note over Supervisor: Proceed with escalation
            end
        else Process is Dead
            Supervisor->>Supervisor: Check if within cooldown period
            alt Within Cooldown
                Supervisor->>Logger: Log: suppress escalation (dead & cooldown)
                Note over Supervisor: Escalation suppressed
            else Past Cooldown
                Supervisor->>LockFile: Remove stale lock
                Supervisor->>Logger: Log: stale lock removed
                Note over Supervisor: Proceed with escalation
            end
        end
    else No Lock Exists
        Supervisor->>Supervisor: Dispatch opus rebase worker
        Supervisor->>Supervisor: Wait for dispatch completion
        Supervisor->>LockFile: Create escalation lock with worker PID
        Supervisor->>Logger: Log: lock created post-dispatch
    end
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

Poem

🔒 The lock now whispers of living things,
Process breath checked, cooldown rings,
Dispatch done—then lock takes flight,
Zero debt shines ever bright. ✨

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main changes: fixing an escalation lock mechanism with PID tracking and post-dispatch creation, which aligns with the primary objectives of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/t1050-lock-fix

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments
.agents/scripts/supervisor/pulse.sh (2)

650-672: Solid lock-checking logic — PID-based liveness is a significant improvement over timestamp-only checks.

The four-way branching (alive+fresh, alive+stale, dead+fresh, dead+stale) covers the state space well, and moving away from pure age-based staleness detection directly addresses the orphaned-lock scenario from the bug report.

One minor edge case: if the lock file exists but is empty or corrupt, lock_pid will be "". The kill -0 check is correctly skipped (guarded by -n "$lock_pid"), but the fallthrough to line 670 will log "PID dead" (empty PID string), which slightly degrades log forensics.

🛡️ Optional: guard the stale-lock log message
-			log_info "  Phase 3.6: removing stale escalation lock (PID $lock_pid dead, ${lock_age}s old)"
+			log_info "  Phase 3.6: removing stale escalation lock (PID ${lock_pid:-unknown} dead, ${lock_age}s old)"

659-663: Alive-but-stale removal allows overlapping escalations — verify this is acceptable.

When a worker is alive but past cooldown (runaway), the lock is removed, which allows the next pulse to dispatch a different escalation candidate while the runaway worker is still active. The Phase 3.6 comment at line 643 states "Only ONE escalation runs at a time (sequential) so each subsequent rebase has a clean base."

Removing the lock here breaks that sequential guarantee. If two concurrent rebase workers push force-with-lease on different branches, you likely won't hit conflicts, but it's worth confirming this is the intended trade-off (liveness over strict serialization).

If strict serialization is desired even for runaway workers, consider killing the stale process before removing the lock:

♻️ Optional: kill runaway worker before unlocking
 			# Running too long — stale, remove
 			log_warn "  Phase 3.6: escalation lock stale (PID $lock_pid alive but ${lock_age}s old), removing"
+			kill "$lock_pid" 2>/dev/null || true
 			rm -f "$escalation_lock" 2>/dev/null || true

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 5 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Feb 14 21:17:27 UTC 2026: Code review monitoring started
Sat Feb 14 21:17:27 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 5

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 5
  • VULNERABILITIES: 0

Generated on: Sat Feb 14 21:17:29 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant improvement to the escalation locking mechanism by using the worker's PID for liveness checks and creating the lock only after a successful dispatch. This correctly addresses the issue of orphaned locks from killed processes. The logic is sound, but I've identified a potential race condition in the new implementation. My review includes two related comments with suggestions to make the lock check atomic by storing the timestamp within the lock file itself, which will make the mechanism more robust.

Comment on lines +650 to 652
local lock_pid lock_age
lock_pid=$(head -1 "$escalation_lock" 2>/dev/null || echo "")
lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0")))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential race condition here. The PID is read from the lock file's content, while the lock's age is calculated from the file's modification timestamp (stat). It's possible for another process to replace the lock file between reading the PID and stat-ing the file. This could lead to using a PID from an old lock with a timestamp from a new lock, causing incorrect stale lock detection.

To fix this, the timestamp should be stored inside the lock file along with the PID (as suggested in a corresponding comment) and both should be read atomically here. This avoids relying on filesystem metadata and makes the check robust.

Suggested change
local lock_pid lock_age
lock_pid=$(head -1 "$escalation_lock" 2>/dev/null || echo "")
lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0")))
local lock_pid lock_ts lock_age
read -r lock_pid lock_ts < "$escalation_lock" 2>/dev/null || true
lock_age=$(($(date +%s) - ${lock_ts:-0}))

WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true

# Create lock file AFTER successful dispatch (stores worker PID for stale detection)
echo "$esc_pid" >"$escalation_lock" 2>/dev/null || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To prevent a race condition when checking the lock, it's safer to store the creation timestamp inside the lock file along with the PID. This makes reading the lock status an atomic operation.

I've added another comment where the lock is read, explaining the race condition and how to parse this new lock format.

Suggested change
echo "$esc_pid" >"$escalation_lock" 2>/dev/null || true
echo "$esc_pid $(date +%s)" >"$escalation_lock" 2>/dev/null || true

@marcusquinn marcusquinn merged commit 7107186 into main Feb 14, 2026
14 of 16 checks passed
@marcusquinn marcusquinn deleted the bugfix/t1050-lock-fix branch February 14, 2026 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant