t1050: Fix escalation lock — PID tracking and post-dispatch creation by marcusquinn · Pull Request #1485 · marcusquinn/aidevops

marcusquinn · 2026-02-14T21:17:05Z

Summary

Fixes Phase 3.6 escalation lock that was blocking all escalations after a killed pulse
Lock now stores worker PID instead of timestamp, enabling stale detection via process liveness
Lock created AFTER successful dispatch instead of before, preventing orphaned locks from failed dispatches

Root Cause

The escalation lock was created before the worker dispatch. When a pulse was killed (hung in post-PR processing), the lock remained with no worker running. The 5-minute cooldown then blocked all subsequent escalation attempts. This is why Phase 3.6 never fired despite the code being correct.

Fix

Lock stores worker PID (checked with kill -0)
Lock created only after successful background dispatch
Stale detection: dead PID + past cooldown = immediate removal
Alive PID + past cooldown = warning + removal (runaway worker)

Summary by CodeRabbit

Bug Fixes
- Improved escalation lock handling with process liveliness verification and cooldown enforcement to prevent duplicate escalations.
- Optimized lock file creation timing and lifecycle management to reduce stale lock scenarios.
Chores
- Enhanced observability with detailed logging for lock state transitions, cleanup operations, and dispatch completion.

…ch creation The escalation lock was created BEFORE dispatch, so if the pulse died between lock creation and worker launch, the lock blocked all future escalations for the full cooldown period. Now: - Lock stores worker PID (not timestamp) for stale detection - Lock created AFTER successful dispatch - Stale lock detection checks if PID is alive, not just age - Dead PID + past cooldown = stale lock removed immediately

gemini-code-assist · 2026-02-14T21:17:17Z

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the escalation lock mechanism could block all subsequent escalations after a worker process was unexpectedly terminated. The changes refine the lock management by integrating PID tracking, ensuring locks are only established upon successful worker dispatch, and implementing more intelligent stale lock detection. This prevents false positives and ensures escalations proceed as intended, improving the reliability of the system's automated responses.

Highlights

Escalation Lock Mechanism: The escalation lock now stores the worker's Process ID (PID) instead of just a timestamp, enabling robust liveness checks.
Lock Creation Timing: The escalation lock is now created only after a successful worker dispatch, preventing orphaned locks if dispatch fails.
Stale Lock Detection: Improved logic for detecting and removing stale locks, differentiating between locks held by active processes, recently completed processes, and truly crashed or runaway workers based on PID status and cooldown period.

Changelog

.agents/scripts/supervisor/pulse.sh
- Implemented PID tracking for escalation locks, allowing for liveness checks using kill -0.
- Modified escalation lock logic to differentiate between active, recently completed, and stale locks based on PID status and age.
- Removed the premature creation of the escalation lock before worker dispatch.
- Ensured the escalation lock is created only after a successful worker dispatch, storing the worker's PID.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-14T21:17:23Z

Walkthrough

Enhanced escalation lock handling in the supervisor's Phase 3.6 with process liveliness verification and cooldown-aware suppression logic. Lock creation moved from pre-dispatch to post-dispatch, replacing stale-lock-age-only checks with PID-alive verification. Added observability logging for lock state transitions.

Changes

Cohort / File(s)	Summary
Escalation Lock Liveliness & Cooldown `.agents/scripts/supervisor/pulse.sh`	Replaces simple lock-age checks with process liveliness verification; reads PID from lock file and verifies process status before deciding escalation suppression. Moves lock file creation to post-dispatch phase. Adds explicit logging for lock-alive, cooldown-active, and stale-lock-removal scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Supervisor as Supervisor (Phase 3.6)
    participant LockFile as Escalation Lock File
    participant Process as Worker Process
    participant Logger as Logging System

    Supervisor->>LockFile: Check if escalation lock exists
    alt Lock Exists
        Supervisor->>LockFile: Read worker PID from lock
        Supervisor->>Process: Check if PID process is alive
        alt Process is Alive
            Supervisor->>Supervisor: Check if within cooldown period
            alt Within Cooldown
                Supervisor->>Logger: Log: suppress escalation (alive & cooldown)
                Note over Supervisor: Escalation suppressed
            else Past Cooldown
                Supervisor->>LockFile: Remove stale lock
                Supervisor->>Logger: Log: stale lock removed
                Note over Supervisor: Proceed with escalation
            end
        else Process is Dead
            Supervisor->>Supervisor: Check if within cooldown period
            alt Within Cooldown
                Supervisor->>Logger: Log: suppress escalation (dead & cooldown)
                Note over Supervisor: Escalation suppressed
            else Past Cooldown
                Supervisor->>LockFile: Remove stale lock
                Supervisor->>Logger: Log: stale lock removed
                Note over Supervisor: Proceed with escalation
            end
        end
    else No Lock Exists
        Supervisor->>Supervisor: Dispatch opus rebase worker
        Supervisor->>Supervisor: Wait for dispatch completion
        Supervisor->>LockFile: Create escalation lock with worker PID
        Supervisor->>Logger: Log: lock created post-dispatch
    end

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431: Both PRs modify supervisor dispatch and PID/lock handling—fix(supervisor): prevent worker death on cron exit and fix dispatch deadlock #431 updates background worker startup and PID recording while this PR moves lock creation post-dispatch and adds process liveliness verification.
t1034: Fix PULSE_LOCK_DIR unbound variable — t1031 modularisation regression #1373: Related lock-file infrastructure—this PR updates escalation lock logic while #1373 ensures PULSE_LOCK_DIR is properly defined before module sourcing.
t1035: Fix ALL_STATUS_LABELS unbound variable — t1031 modularisation regression #1375: Both affect pulse.sh reconciliation—this PR changes escalation lock behavior while #1375 restores ALL_STATUS_LABELS used in Phase 8b.

Poem

🔒 The lock now whispers of living things,
Process breath checked, cooldown rings,
Dispatch done—then lock takes flight,
Zero debt shines ever bright. ✨

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main changes: fixing an escalation lock mechanism with PID tracking and post-dispatch creation, which aligns with the primary objectives of the pull request.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Merge Conflict Detection	✅ Passed	✅ No merge conflicts detected when merging into `main`

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch bugfix/t1050-lock-fix

No actionable comments were generated in the recent review. 🎉

🧹 Recent nitpick comments

.agents/scripts/supervisor/pulse.sh (2)
650-672: Solid lock-checking logic — PID-based liveness is a significant improvement over timestamp-only checks.

The four-way branching (alive+fresh, alive+stale, dead+fresh, dead+stale) covers the state space well, and moving away from pure age-based staleness detection directly addresses the orphaned-lock scenario from the bug report.

One minor edge case: if the lock file exists but is empty or corrupt, lock_pid will be "". The kill -0 check is correctly skipped (guarded by -n "$lock_pid"), but the fallthrough to line 670 will log "PID dead" (empty PID string), which slightly degrades log forensics.
🛡️ Optional: guard the stale-lock log message
-			log_info "  Phase 3.6: removing stale escalation lock (PID $lock_pid dead, ${lock_age}s old)"
+			log_info "  Phase 3.6: removing stale escalation lock (PID ${lock_pid:-unknown} dead, ${lock_age}s old)"
659-663: Alive-but-stale removal allows overlapping escalations — verify this is acceptable.

When a worker is alive but past cooldown (runaway), the lock is removed, which allows the next pulse to dispatch a different escalation candidate while the runaway worker is still active. The Phase 3.6 comment at line 643 states "Only ONE escalation runs at a time (sequential) so each subsequent rebase has a clean base."

Removing the lock here breaks that sequential guarantee. If two concurrent rebase workers push force-with-lease on different branches, you likely won't hit conflicts, but it's worth confirming this is the intended trade-off (liveness over strict serialization).

If strict serialization is desired even for runaway workers, consider killing the stale process before removing the lock:
♻️ Optional: kill runaway worker before unlocking
 			# Running too long — stale, remove
 			log_warn "  Phase 3.6: escalation lock stale (PID $lock_pid alive but ${lock_age}s old), removing"
+			kill "$lock_pid" 2>/dev/null || true
 			rm -f "$escalation_lock" 2>/dev/null || true

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-14T21:17:30Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 5 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Sat Feb 14 21:17:27 UTC 2026: Code review monitoring started
Sat Feb 14 21:17:27 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 5

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 5
VULNERABILITIES: 0

Generated on: Sat Feb 14 21:17:29 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-14T21:18:18Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

gemini-code-assist

Code Review

This pull request introduces a significant improvement to the escalation locking mechanism by using the worker's PID for liveness checks and creating the lock only after a successful dispatch. This correctly addresses the issue of orphaned locks from killed processes. The logic is sound, but I've identified a potential race condition in the new implementation. My review includes two related comments with suggestions to make the lock check atomic by storing the timestamp within the lock file itself, which will make the mechanism more robust.

gemini-code-assist · 2026-02-14T21:19:06Z

.agents/scripts/supervisor/pulse.sh

+		local lock_pid lock_age
+		lock_pid=$(head -1 "$escalation_lock" 2>/dev/null || echo "")
 		lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0")))


There's a potential race condition here. The PID is read from the lock file's content, while the lock's age is calculated from the file's modification timestamp (stat). It's possible for another process to replace the lock file between reading the PID and stat-ing the file. This could lead to using a PID from an old lock with a timestamp from a new lock, causing incorrect stale lock detection.

To fix this, the timestamp should be stored inside the lock file along with the PID (as suggested in a corresponding comment) and both should be read atomically here. This avoids relying on filesystem metadata and makes the check robust.

Suggested change

local lock_pid lock_age

lock_pid=$(head -1 "$escalation_lock" 2>/dev/null || echo "")

lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0")))

local lock_pid lock_ts lock_age

read -r lock_pid lock_ts < "$escalation_lock" 2>/dev/null || true

lock_age=$(($(date +%s) - ${lock_ts:-0}))

gemini-code-assist · 2026-02-14T21:19:07Z

.agents/scripts/supervisor/pulse.sh

 					WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true

+					# Create lock file AFTER successful dispatch (stores worker PID for stale detection)
+					echo "$esc_pid" >"$escalation_lock" 2>/dev/null || true


To prevent a race condition when checking the lock, it's safer to store the creation timestamp inside the lock file along with the PID. This makes reading the lock status an atomic operation.

I've added another comment where the lock is read, explaining the race condition and how to parse this new lock format.

Suggested change

echo "$esc_pid" >"$escalation_lock" 2>/dev/null || true

echo "$esc_pid $(date +%s)" >"$escalation_lock" 2>/dev/null || true

gemini-code-assist bot reviewed Feb 14, 2026

View reviewed changes

marcusquinn merged commit 7107186 into main Feb 14, 2026
14 of 16 checks passed

marcusquinn deleted the bugfix/t1050-lock-fix branch February 14, 2026 21:19

coderabbitai bot approved these changes Feb 14, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Feb 15, 2026

fix: swap stat -f %m to Linux-first order for cross-platform compat #1492

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t1050: Fix escalation lock — PID tracking and post-dispatch creation#1485

t1050: Fix escalation lock — PID tracking and post-dispatch creation#1485
marcusquinn merged 1 commit intomainfrom
bugfix/t1050-lock-fix

marcusquinn commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Uh oh!

coderabbitai bot commented Feb 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

sonarqubecloud bot commented Feb 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

gemini-code-assist bot Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	echo "$esc_pid" >"$escalation_lock" 2>/dev/null \|\| true
	echo "$esc_pid $(date +%s)" >"$escalation_lock" 2>/dev/null \|\| true

Conversation

marcusquinn commented Feb 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 14, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

Possibly Related PRs

Poem

Uh oh!

github-actions bot commented Feb 14, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 14, 2026

Quality Gate passed

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 14, 2026 •

edited

Loading