t1050: Fix escalation lock — PID tracking and post-dispatch creation#1485
t1050: Fix escalation lock — PID tracking and post-dispatch creation#1485marcusquinn merged 1 commit intomainfrom
Conversation
…ch creation The escalation lock was created BEFORE dispatch, so if the pulse died between lock creation and worker launch, the lock blocked all future escalations for the full cooldown period. Now: - Lock stores worker PID (not timestamp) for stale detection - Lock created AFTER successful dispatch - Stale lock detection checks if PID is alive, not just age - Dead PID + past cooldown = stale lock removed immediately
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where the escalation lock mechanism could block all subsequent escalations after a worker process was unexpectedly terminated. The changes refine the lock management by integrating PID tracking, ensuring locks are only established upon successful worker dispatch, and implementing more intelligent stale lock detection. This prevents false positives and ensures escalations proceed as intended, improving the reliability of the system's automated responses. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughEnhanced escalation lock handling in the supervisor's Phase 3.6 with process liveliness verification and cooldown-aware suppression logic. Lock creation moved from pre-dispatch to post-dispatch, replacing stale-lock-age-only checks with PID-alive verification. Added observability logging for lock state transitions. Changes
Sequence Diagram(s)sequenceDiagram
participant Supervisor as Supervisor (Phase 3.6)
participant LockFile as Escalation Lock File
participant Process as Worker Process
participant Logger as Logging System
Supervisor->>LockFile: Check if escalation lock exists
alt Lock Exists
Supervisor->>LockFile: Read worker PID from lock
Supervisor->>Process: Check if PID process is alive
alt Process is Alive
Supervisor->>Supervisor: Check if within cooldown period
alt Within Cooldown
Supervisor->>Logger: Log: suppress escalation (alive & cooldown)
Note over Supervisor: Escalation suppressed
else Past Cooldown
Supervisor->>LockFile: Remove stale lock
Supervisor->>Logger: Log: stale lock removed
Note over Supervisor: Proceed with escalation
end
else Process is Dead
Supervisor->>Supervisor: Check if within cooldown period
alt Within Cooldown
Supervisor->>Logger: Log: suppress escalation (dead & cooldown)
Note over Supervisor: Escalation suppressed
else Past Cooldown
Supervisor->>LockFile: Remove stale lock
Supervisor->>Logger: Log: stale lock removed
Note over Supervisor: Proceed with escalation
end
end
else No Lock Exists
Supervisor->>Supervisor: Dispatch opus rebase worker
Supervisor->>Supervisor: Wait for dispatch completion
Supervisor->>LockFile: Create escalation lock with worker PID
Supervisor->>Logger: Log: lock created post-dispatch
end
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly Related PRs
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
No actionable comments were generated in the recent review. 🎉 🧹 Recent nitpick comments
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Sat Feb 14 21:17:29 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant improvement to the escalation locking mechanism by using the worker's PID for liveness checks and creating the lock only after a successful dispatch. This correctly addresses the issue of orphaned locks from killed processes. The logic is sound, but I've identified a potential race condition in the new implementation. My review includes two related comments with suggestions to make the lock check atomic by storing the timestamp within the lock file itself, which will make the mechanism more robust.
| local lock_pid lock_age | ||
| lock_pid=$(head -1 "$escalation_lock" 2>/dev/null || echo "") | ||
| lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0"))) |
There was a problem hiding this comment.
There's a potential race condition here. The PID is read from the lock file's content, while the lock's age is calculated from the file's modification timestamp (stat). It's possible for another process to replace the lock file between reading the PID and stat-ing the file. This could lead to using a PID from an old lock with a timestamp from a new lock, causing incorrect stale lock detection.
To fix this, the timestamp should be stored inside the lock file along with the PID (as suggested in a corresponding comment) and both should be read atomically here. This avoids relying on filesystem metadata and makes the check robust.
| local lock_pid lock_age | |
| lock_pid=$(head -1 "$escalation_lock" 2>/dev/null || echo "") | |
| lock_age=$(($(date +%s) - $(stat -f %m "$escalation_lock" 2>/dev/null || stat -c %Y "$escalation_lock" 2>/dev/null || echo "0"))) | |
| local lock_pid lock_ts lock_age | |
| read -r lock_pid lock_ts < "$escalation_lock" 2>/dev/null || true | |
| lock_age=$(($(date +%s) - ${lock_ts:-0})) |
| WHERE id = '$(sql_escape "$esc_id")';" 2>/dev/null || true | ||
|
|
||
| # Create lock file AFTER successful dispatch (stores worker PID for stale detection) | ||
| echo "$esc_pid" >"$escalation_lock" 2>/dev/null || true |
There was a problem hiding this comment.
To prevent a race condition when checking the lock, it's safer to store the creation timestamp inside the lock file along with the PID. This makes reading the lock status an atomic operation.
I've added another comment where the lock is read, explaining the race condition and how to parse this new lock format.
| echo "$esc_pid" >"$escalation_lock" 2>/dev/null || true | |
| echo "$esc_pid $(date +%s)" >"$escalation_lock" 2>/dev/null || true |



Summary
Root Cause
The escalation lock was created before the worker dispatch. When a pulse was killed (hung in post-PR processing), the lock remained with no worker running. The 5-minute cooldown then blocked all subsequent escalation attempts. This is why Phase 3.6 never fired despite the code being correct.
Fix
kill -0)Summary by CodeRabbit
Bug Fixes
Chores