Skip to content

t1222: Add two-phase worker hang detection with graceful termination#1869

Merged
marcusquinn merged 2 commits intomainfrom
feature/t1222
Feb 19, 2026
Merged

t1222: Add two-phase worker hang detection with graceful termination#1869
marcusquinn merged 2 commits intomainfrom
feature/t1222

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 19, 2026

Summary

Add intermediate hang detection at 50% of the hung timeout that sends SIGTERM for graceful shutdown before escalating to hard SIGKILL at 100% timeout. This saves ~15 minutes of wasted compute per hung worker occurrence (t311.2, t311.3, t303 all failed with 30-minute waits).

Changes

  • pulse.sh Phase 4 Check 2: Two-phase hang detection
    • At 50% of task_hung_timeout: send SIGTERM to worker (triggers wrapper cleanup_children trap for graceful shutdown)
    • Grace period (min 120s, max 240s, or 25% of timeout): wait for worker to terminate
    • At grace period expiry or 100% timeout: hard kill with _kill_descendants + SIGKILL
    • Marker file ($SUPERVISOR_DIR/pids/<task>.hang-warned) tracks warning state across pulse cycles — stateless, no DB schema change needed
  • supervisor-helper.sh: Document SUPERVISOR_HANG_GRACEFUL env var (default: true)
  • Dead worker cleanup: Remove stale .hang-warned markers when PID files are cleaned up

Design Decisions

  • Marker file over DB field: Avoids schema migration, matches existing PID file pattern, automatically cleaned up with PID files
  • SIGTERM over SIGUSR1: The wrapper already has trap cleanup_children EXIT INT TERM — SIGTERM triggers proper child cleanup
  • Grace period bounds (120-240s): Minimum 120s ensures at least 1 pulse cycle can detect the dead worker; maximum 240s prevents excessive waiting
  • Env var toggle: SUPERVISOR_HANG_GRACEFUL=false reverts to original single-phase behavior for debugging

Testing

  • ShellCheck: zero violations
  • Bash syntax: verified on both modified files
  • Backward compatible: when SUPERVISOR_HANG_GRACEFUL=false, behavior is identical to pre-t1222

Summary by CodeRabbit

  • Chores
    • Enabled a two-phase hang detection that attempts graceful termination before hard killing timed-out tasks (enabled by default).
    • Improved cleanup of hang-warning state so stale warnings are cleared and won't affect subsequent task runs.

…t1222)

At 50% of hung timeout, send SIGTERM for graceful shutdown via wrapper trap.
If worker survives grace period (min 120s, max 240s), escalate to hard SIGKILL.
Saves ~15 minutes per hung worker vs waiting full timeout before killing.

Chose marker-file approach over DB field — stateless across pulse cycles,
no schema migration needed, matches existing PID file pattern in supervisor/pids/.

Env: SUPERVISOR_HANG_GRACEFUL=true|false (default: true) to toggle behavior.
@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@marcusquinn marcusquinn marked this pull request as ready for review February 19, 2026 02:35
@gemini-code-assist
Copy link

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Walkthrough

Adds SUPERVISOR_HANG_GRACEFUL env var and implements hang-warn marker lifecycle changes: creation/cleanup and two-phase hang handling (SIGTERM at ~50% then SIGKILL at timeout) across supervisor helper, pulse, and cleanup scripts.

Changes

Cohort / File(s) Summary
Hang Detection Config
.agents/scripts/supervisor-helper.sh
Adds SUPERVISOR_HANG_GRACEFUL environment variable to enable two-phase hang detection (graceful SIGTERM at ~50% timeout, SIGKILL at 100%).
Per-task Hang Handling
.agents/scripts/supervisor/pulse.sh
Adds hang-warn marker creation/cleanup and control flow for two-phase detection: issue SIGTERM at warn threshold, SIGKILL at final timeout, and ensure marker removal on dead-worker and post-kill paths.
Cleanup Enhancements
.agents/scripts/supervisor/cleanup.sh
Removes per-task hang-warn marker during worker PID cleanup to avoid stale marker state on redispatch.

Sequence Diagram

sequenceDiagram
    participant HangDetector as Hang Detector
    participant Worker as Worker Process
    participant FileSystem as File System
    participant Supervisor as Supervisor

    Note over HangDetector,FileSystem: Two-phase hang detection (SUPERVISOR_HANG_GRACEFUL=true)

    HangDetector->>Worker: Monitor / check liveness
    alt 50% timeout reached (warn)
        HangDetector->>FileSystem: Create `${task}.hang-warned` marker
        HangDetector->>Worker: Send SIGTERM (graceful)
        HangDetector->>HangDetector: Wait remaining timeout
    else Worker dies before warn
        Worker-->>Supervisor: exit / dead PID
        Supervisor->>FileSystem: Remove `${task}.hang-warned` marker (cleanup)
    end
    alt 100% timeout reached (still hung)
        HangDetector->>Worker: Send SIGKILL (force)
        HangDetector->>FileSystem: Remove `${task}.hang-warned` marker
        HangDetector->>Supervisor: Report worker dead
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Poem

🕰️ Watchers set a gentle, warned alarm,
A SIGTERM plea to end with calm,
If silence stays and time runs thin,
SIGKILL finishes what must end,
Markers cleared — the cycle starts again.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main change: implementing two-phase worker hang detection with graceful termination (t1222), which directly corresponds to all the modifications across the three affected files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1222

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 02:36:08 UTC 2026: Code review monitoring started
Thu Feb 19 02:36:08 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 28
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 02:36:11 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 19, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
.agents/scripts/supervisor/pulse.sh (1)

2109-2116: Comment "enough for 2 pulse cycles" is only accurate at 2-minute cron cadence; at the documented default of 5-minute cron the grace period (max 240 s) is shorter than one pulse interval (300 s), so the hard kill fires on the very next pulse.

The logic itself is correct — the grace period bounds are safe either way — but the comment sets a misleading expectation for operators configuring longer cron intervals.

📝 Clearer comment
-# Grace period: min(240s, 25% of hung timeout) — enough for 2 pulse cycles
+# Grace period: min(240s, max(120s, 25% of hung timeout)).
+# At 2-min cron this spans 1-2 cycles; at 5-min cron the hard kill fires on the next cycle.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 2109 - 2116, Update the
misleading inline comment around the grace_period calculation (variable
grace_period and task_hung_timeout) to accurately describe behavior for
different cron cadences: replace "enough for 2 pulse cycles" with a note that
the grace_period is min(240s, 25% of task_hung_timeout) and that at the default
5-minute pulse interval (300s) the capped 240s may be shorter than one pulse
interval, so a hard kill can occur on the next pulse; keep the existing bounds
logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor-helper.sh:
- Line 470: The help text for SUPERVISOR_HANG_GRACEFUL currently contains "50%%"
and "100%%" inside a quoted heredoc, which renders literal "%%"; update the
heredoc content so the percentages are single percent signs by replacing "50%%"
with "50%" and "100%%" with "100%" in the SUPERVISOR_HANG_GRACEFUL help string
to produce the intended "50% / 100%".

In @.agents/scripts/supervisor/pulse.sh:
- Around line 2101-2103: The write to the marker file in the pulse loop uses
echo "$now_epoch" >"$hang_warn_marker" without suppressing errors, which can
break the script under set -euo pipefail or produce noisy stderr; change the
write to mirror other file-mutating ops by redirecting stderr and swallowing
errors (e.g., add 2>/dev/null || true) so failures to create/update the marker
won't abort the script or spam logs before the subsequent kill -TERM
"$health_pid" call.
- Around line 2107-2108: The variable warn_epoch can be empty if the marker file
is zero bytes, causing the arithmetic $((now_epoch - warn_epoch)) to fail;
modify the code that reads hang_warn_marker (the warn_epoch assignment) to
defensively default to 0 when the file is missing, empty, or contains
non-numeric data (e.g., read the file into warn_epoch and then check if
warn_epoch is empty or not a number and set warn_epoch=0), then proceed to
compute grace_elapsed as before using now_epoch and warn_epoch.
- Around line 2028-2029: cleanup_worker_processes currently removes only the PID
file (rm -f "$pid_file") and therefore leaves behind the .hang-warned marker,
which causes recovered workers to be killed later; update
cleanup_worker_processes (the function named cleanup_worker_processes in
cleanup.sh) to also remove the corresponding hang marker after removing the PID
file by invoking rm -f on
"$SUPERVISOR_DIR/pids/${<task_id_or_health_task_variable>}.hang-warned" (use the
same task identifier variable already used to build $pid_file), so Phase 0.7,
Phase 0.8 and all Phase 1 callers automatically clear the .hang-warned marker
whenever they remove the PID file.

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2109-2116: Update the misleading inline comment around the
grace_period calculation (variable grace_period and task_hung_timeout) to
accurately describe behavior for different cron cadences: replace "enough for 2
pulse cycles" with a note that the grace_period is min(240s, 25% of
task_hung_timeout) and that at the default 5-minute pulse interval (300s) the
capped 240s may be shorter than one pulse interval, so a hard kill can occur on
the next pulse; keep the existing bounds logic unchanged.

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 19, 2026 02:52

Auto-dismissed: bot review does not block autonomous pipeline

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 03:06:36 UTC 2026: Code review monitoring started
Thu Feb 19 03:06:36 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 28
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 03:06:38 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

coderabbitai[bot]
coderabbitai bot previously requested changes Feb 19, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2028-2029: The rm cleanup for the hang-warned marker is currently
a best-effort mutation but can abort the script under set -euo if it returns a
non-zero status; change the cleanup to tolerate failures by suppressing errors
(e.g., run the rm command in a way that it never causes a non-zero exit for the
script) for the specific invocation that removes
"$SUPERVISOR_DIR/pids/${health_task}.hang-warned" (and the similar cleanup at
the other occurrence around the same block), ensuring the removal remains
attempted but will not abort the pulse on permission/IO errors.

@marcusquinn marcusquinn dismissed coderabbitai[bot]’s stale review February 19, 2026 03:26

Auto-dismissed: bot review does not block autonomous pipeline

@marcusquinn marcusquinn merged commit 42809e0 into main Feb 19, 2026
31 checks passed
@marcusquinn marcusquinn deleted the feature/t1222 branch February 19, 2026 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant