t1222: Add two-phase worker hang detection with graceful termination#1869
t1222: Add two-phase worker hang detection with graceful termination#1869marcusquinn merged 2 commits intomainfrom
Conversation
…t1222) At 50% of hung timeout, send SIGTERM for graceful shutdown via wrapper trap. If worker survives grace period (min 120s, max 240s), escalate to hard SIGKILL. Saves ~15 minutes per hung worker vs waiting full timeout before killing. Chose marker-file approach over DB field — stateless across pulse cycles, no schema migration needed, matches existing PID file pattern in supervisor/pids/. Env: SUPERVISOR_HANG_GRACEFUL=true|false (default: true) to toggle behavior.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
WalkthroughAdds SUPERVISOR_HANG_GRACEFUL env var and implements hang-warn marker lifecycle changes: creation/cleanup and two-phase hang handling (SIGTERM at ~50% then SIGKILL at timeout) across supervisor helper, pulse, and cleanup scripts. Changes
Sequence DiagramsequenceDiagram
participant HangDetector as Hang Detector
participant Worker as Worker Process
participant FileSystem as File System
participant Supervisor as Supervisor
Note over HangDetector,FileSystem: Two-phase hang detection (SUPERVISOR_HANG_GRACEFUL=true)
HangDetector->>Worker: Monitor / check liveness
alt 50% timeout reached (warn)
HangDetector->>FileSystem: Create `${task}.hang-warned` marker
HangDetector->>Worker: Send SIGTERM (graceful)
HangDetector->>HangDetector: Wait remaining timeout
else Worker dies before warn
Worker-->>Supervisor: exit / dead PID
Supervisor->>FileSystem: Remove `${task}.hang-warned` marker (cleanup)
end
alt 100% timeout reached (still hung)
HangDetector->>Worker: Send SIGKILL (force)
HangDetector->>FileSystem: Remove `${task}.hang-warned` marker
HangDetector->>Supervisor: Report worker dead
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 02:36:11 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)
2109-2116: Comment "enough for 2 pulse cycles" is only accurate at 2-minute cron cadence; at the documented default of 5-minute cron the grace period (max 240 s) is shorter than one pulse interval (300 s), so the hard kill fires on the very next pulse.The logic itself is correct — the grace period bounds are safe either way — but the comment sets a misleading expectation for operators configuring longer cron intervals.
📝 Clearer comment
-# Grace period: min(240s, 25% of hung timeout) — enough for 2 pulse cycles +# Grace period: min(240s, max(120s, 25% of hung timeout)). +# At 2-min cron this spans 1-2 cycles; at 5-min cron the hard kill fires on the next cycle.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.agents/scripts/supervisor/pulse.sh around lines 2109 - 2116, Update the misleading inline comment around the grace_period calculation (variable grace_period and task_hung_timeout) to accurately describe behavior for different cron cadences: replace "enough for 2 pulse cycles" with a note that the grace_period is min(240s, 25% of task_hung_timeout) and that at the default 5-minute pulse interval (300s) the capped 240s may be shorter than one pulse interval, so a hard kill can occur on the next pulse; keep the existing bounds logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor-helper.sh:
- Line 470: The help text for SUPERVISOR_HANG_GRACEFUL currently contains "50%%"
and "100%%" inside a quoted heredoc, which renders literal "%%"; update the
heredoc content so the percentages are single percent signs by replacing "50%%"
with "50%" and "100%%" with "100%" in the SUPERVISOR_HANG_GRACEFUL help string
to produce the intended "50% / 100%".
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2101-2103: The write to the marker file in the pulse loop uses
echo "$now_epoch" >"$hang_warn_marker" without suppressing errors, which can
break the script under set -euo pipefail or produce noisy stderr; change the
write to mirror other file-mutating ops by redirecting stderr and swallowing
errors (e.g., add 2>/dev/null || true) so failures to create/update the marker
won't abort the script or spam logs before the subsequent kill -TERM
"$health_pid" call.
- Around line 2107-2108: The variable warn_epoch can be empty if the marker file
is zero bytes, causing the arithmetic $((now_epoch - warn_epoch)) to fail;
modify the code that reads hang_warn_marker (the warn_epoch assignment) to
defensively default to 0 when the file is missing, empty, or contains
non-numeric data (e.g., read the file into warn_epoch and then check if
warn_epoch is empty or not a number and set warn_epoch=0), then proceed to
compute grace_elapsed as before using now_epoch and warn_epoch.
- Around line 2028-2029: cleanup_worker_processes currently removes only the PID
file (rm -f "$pid_file") and therefore leaves behind the .hang-warned marker,
which causes recovered workers to be killed later; update
cleanup_worker_processes (the function named cleanup_worker_processes in
cleanup.sh) to also remove the corresponding hang marker after removing the PID
file by invoking rm -f on
"$SUPERVISOR_DIR/pids/${<task_id_or_health_task_variable>}.hang-warned" (use the
same task identifier variable already used to build $pid_file), so Phase 0.7,
Phase 0.8 and all Phase 1 callers automatically clear the .hang-warned marker
whenever they remove the PID file.
---
Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2109-2116: Update the misleading inline comment around the
grace_period calculation (variable grace_period and task_hung_timeout) to
accurately describe behavior for different cron cadences: replace "enough for 2
pulse cycles" with a note that the grace_period is min(240s, 25% of
task_hung_timeout) and that at the default 5-minute pulse interval (300s) the
capped 240s may be shorter than one pulse interval, so a hard kill can occur on
the next pulse; keep the existing bounds logic unchanged.
Auto-dismissed: bot review does not block autonomous pipeline
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Thu Feb 19 03:06:38 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2028-2029: The rm cleanup for the hang-warned marker is currently
a best-effort mutation but can abort the script under set -euo if it returns a
non-zero status; change the cleanup to tolerate failures by suppressing errors
(e.g., run the rm command in a way that it never causes a non-zero exit for the
script) for the specific invocation that removes
"$SUPERVISOR_DIR/pids/${health_task}.hang-warned" (and the similar cleanup at
the other occurrence around the same block), ensuring the removal remains
attempted but will not abort the pulse on permission/IO errors.
Auto-dismissed: bot review does not block autonomous pipeline



Summary
Add intermediate hang detection at 50% of the hung timeout that sends SIGTERM for graceful shutdown before escalating to hard SIGKILL at 100% timeout. This saves ~15 minutes of wasted compute per hung worker occurrence (t311.2, t311.3, t303 all failed with 30-minute waits).
Changes
task_hung_timeout: send SIGTERM to worker (triggers wrapper cleanup_children trap for graceful shutdown)_kill_descendants+ SIGKILL$SUPERVISOR_DIR/pids/<task>.hang-warned) tracks warning state across pulse cycles — stateless, no DB schema change neededSUPERVISOR_HANG_GRACEFULenv var (default:true).hang-warnedmarkers when PID files are cleaned upDesign Decisions
trap cleanup_children EXIT INT TERM— SIGTERM triggers proper child cleanupSUPERVISOR_HANG_GRACEFUL=falsereverts to original single-phase behavior for debuggingTesting
SUPERVISOR_HANG_GRACEFUL=false, behavior is identical to pre-t1222Summary by CodeRabbit