t1222: Add two-phase worker hang detection with graceful termination by marcusquinn · Pull Request #1869 · marcusquinn/aidevops

marcusquinn · 2026-02-19T02:35:44Z

Summary

Add intermediate hang detection at 50% of the hung timeout that sends SIGTERM for graceful shutdown before escalating to hard SIGKILL at 100% timeout. This saves ~15 minutes of wasted compute per hung worker occurrence (t311.2, t311.3, t303 all failed with 30-minute waits).

Changes

pulse.sh Phase 4 Check 2: Two-phase hang detection
- At 50% of task_hung_timeout: send SIGTERM to worker (triggers wrapper cleanup_children trap for graceful shutdown)
- Grace period (min 120s, max 240s, or 25% of timeout): wait for worker to terminate
- At grace period expiry or 100% timeout: hard kill with _kill_descendants + SIGKILL
- Marker file ($SUPERVISOR_DIR/pids/<task>.hang-warned) tracks warning state across pulse cycles — stateless, no DB schema change needed
supervisor-helper.sh: Document SUPERVISOR_HANG_GRACEFUL env var (default: true)
Dead worker cleanup: Remove stale .hang-warned markers when PID files are cleaned up

Design Decisions

Marker file over DB field: Avoids schema migration, matches existing PID file pattern, automatically cleaned up with PID files
SIGTERM over SIGUSR1: The wrapper already has trap cleanup_children EXIT INT TERM — SIGTERM triggers proper child cleanup
Grace period bounds (120-240s): Minimum 120s ensures at least 1 pulse cycle can detect the dead worker; maximum 240s prevents excessive waiting
Env var toggle: SUPERVISOR_HANG_GRACEFUL=false reverts to original single-phase behavior for debugging

Testing

ShellCheck: zero violations
Bash syntax: verified on both modified files
Backward compatible: when SUPERVISOR_HANG_GRACEFUL=false, behavior is identical to pre-t1222

Summary by CodeRabbit

Chores
- Enabled a two-phase hang detection that attempts graceful termination before hard killing timed-out tasks (enabled by default).
- Improved cleanup of hang-warning state so stale warnings are cleared and won't affect subsequent task runs.

…t1222) At 50% of hung timeout, send SIGTERM for graceful shutdown via wrapper trap. If worker survives grace period (min 120s, max 240s), escalate to hard SIGKILL. Saves ~15 minutes per hung worker vs waiting full timeout before killing. Chose marker-file approach over DB field — stateless across pulse cycles, no schema migration needed, matches existing PID file pattern in supervisor/pids/. Env: SUPERVISOR_HANG_GRACEFUL=true|false (default: true) to toggle behavior.

gemini-code-assist · 2026-02-19T02:35:48Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-02-19T02:35:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

coderabbitai · 2026-02-19T02:36:03Z

Walkthrough

Adds SUPERVISOR_HANG_GRACEFUL env var and implements hang-warn marker lifecycle changes: creation/cleanup and two-phase hang handling (SIGTERM at ~50% then SIGKILL at timeout) across supervisor helper, pulse, and cleanup scripts.

Changes

Cohort / File(s)	Summary
Hang Detection Config `.agents/scripts/supervisor-helper.sh`	Adds `SUPERVISOR_HANG_GRACEFUL` environment variable to enable two-phase hang detection (graceful SIGTERM at ~50% timeout, SIGKILL at 100%).
Per-task Hang Handling `.agents/scripts/supervisor/pulse.sh`	Adds hang-warn marker creation/cleanup and control flow for two-phase detection: issue SIGTERM at warn threshold, SIGKILL at final timeout, and ensure marker removal on dead-worker and post-kill paths.
Cleanup Enhancements `.agents/scripts/supervisor/cleanup.sh`	Removes per-task hang-warn marker during worker PID cleanup to avoid stale marker state on redispatch.

Sequence Diagram

sequenceDiagram
    participant HangDetector as Hang Detector
    participant Worker as Worker Process
    participant FileSystem as File System
    participant Supervisor as Supervisor

    Note over HangDetector,FileSystem: Two-phase hang detection (SUPERVISOR_HANG_GRACEFUL=true)

    HangDetector->>Worker: Monitor / check liveness
    alt 50% timeout reached (warn)
        HangDetector->>FileSystem: Create `${task}.hang-warned` marker
        HangDetector->>Worker: Send SIGTERM (graceful)
        HangDetector->>HangDetector: Wait remaining timeout
    else Worker dies before warn
        Worker-->>Supervisor: exit / dead PID
        Supervisor->>FileSystem: Remove `${task}.hang-warned` marker (cleanup)
    end
    alt 100% timeout reached (still hung)
        HangDetector->>Worker: Send SIGKILL (force)
        HangDetector->>FileSystem: Remove `${task}.hang-warned` marker
        HangDetector->>Supervisor: Report worker dead
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

t1222: Add worker hang detection with graceful termination before timeout #1868: Matches t1222 two-phase hang detection feature (graceful SIGTERM at ~50% then SIGKILL at timeout) and marker-based state management introduced here.

Poem

🕰️ Watchers set a gentle, warned alarm,
A SIGTERM plea to end with calm,
If silence stays and time runs thin,
SIGKILL finishes what must end,
Markers cleared — the cycle starts again.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately summarizes the main change: implementing two-phase worker hang detection with graceful termination (t1222), which directly corresponds to all the modifications across the three affected files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/t1222

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-02-19T02:36:12Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 02:36:08 UTC 2026: Code review monitoring started
Thu Feb 19 02:36:08 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 28
VULNERABILITIES: 0

Generated on: Thu Feb 19 02:36:11 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

.agents/scripts/supervisor/pulse.sh (1)
2109-2116: Comment "enough for 2 pulse cycles" is only accurate at 2-minute cron cadence; at the documented default of 5-minute cron the grace period (max 240 s) is shorter than one pulse interval (300 s), so the hard kill fires on the very next pulse.

The logic itself is correct — the grace period bounds are safe either way — but the comment sets a misleading expectation for operators configuring longer cron intervals.
📝 Clearer comment
-# Grace period: min(240s, 25% of hung timeout) — enough for 2 pulse cycles
+# Grace period: min(240s, max(120s, 25% of hung timeout)).
+# At 2-min cron this spans 1-2 cycles; at 5-min cron the hard kill fires on the next cycle.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/pulse.sh around lines 2109 - 2116, Update the
misleading inline comment around the grace_period calculation (variable
grace_period and task_hung_timeout) to accurately describe behavior for
different cron cadences: replace "enough for 2 pulse cycles" with a note that
the grace_period is min(240s, 25% of task_hung_timeout) and that at the default
5-minute pulse interval (300s) the capped 240s may be shorter than one pulse
interval, so a hard kill can occur on the next pulse; keep the existing bounds
logic unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor-helper.sh:
- Line 470: The help text for SUPERVISOR_HANG_GRACEFUL currently contains "50%%"
and "100%%" inside a quoted heredoc, which renders literal "%%"; update the
heredoc content so the percentages are single percent signs by replacing "50%%"
with "50%" and "100%%" with "100%" in the SUPERVISOR_HANG_GRACEFUL help string
to produce the intended "50% / 100%".

In @.agents/scripts/supervisor/pulse.sh:
- Around line 2101-2103: The write to the marker file in the pulse loop uses
echo "$now_epoch" >"$hang_warn_marker" without suppressing errors, which can
break the script under set -euo pipefail or produce noisy stderr; change the
write to mirror other file-mutating ops by redirecting stderr and swallowing
errors (e.g., add 2>/dev/null || true) so failures to create/update the marker
won't abort the script or spam logs before the subsequent kill -TERM
"$health_pid" call.
- Around line 2107-2108: The variable warn_epoch can be empty if the marker file
is zero bytes, causing the arithmetic $((now_epoch - warn_epoch)) to fail;
modify the code that reads hang_warn_marker (the warn_epoch assignment) to
defensively default to 0 when the file is missing, empty, or contains
non-numeric data (e.g., read the file into warn_epoch and then check if
warn_epoch is empty or not a number and set warn_epoch=0), then proceed to
compute grace_elapsed as before using now_epoch and warn_epoch.
- Around line 2028-2029: cleanup_worker_processes currently removes only the PID
file (rm -f "$pid_file") and therefore leaves behind the .hang-warned marker,
which causes recovered workers to be killed later; update
cleanup_worker_processes (the function named cleanup_worker_processes in
cleanup.sh) to also remove the corresponding hang marker after removing the PID
file by invoking rm -f on
"$SUPERVISOR_DIR/pids/${<task_id_or_health_task_variable>}.hang-warned" (use the
same task identifier variable already used to build $pid_file), so Phase 0.7,
Phase 0.8 and all Phase 1 callers automatically clear the .hang-warned marker
whenever they remove the PID file.

---

Nitpick comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2109-2116: Update the misleading inline comment around the
grace_period calculation (variable grace_period and task_hung_timeout) to
accurately describe behavior for different cron cadences: replace "enough for 2
pulse cycles" with a note that the grace_period is min(240s, 25% of
task_hung_timeout) and that at the default 5-minute pulse interval (300s) the
capped 240s may be shorter than one pulse interval, so a hard kill can occur on
the next pulse; keep the existing bounds logic unchanged.

.agents/scripts/supervisor-helper.sh

.agents/scripts/supervisor/pulse.sh

Auto-dismissed: bot review does not block autonomous pipeline

.agents/scripts/supervisor/pulse.sh

github-actions · 2026-02-19T03:06:39Z

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 28 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 03:06:36 UTC 2026: Code review monitoring started
Thu Feb 19 03:06:36 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 28

📈 Current Quality Metrics

BUGS: 0
CODE SMELLS: 28
VULNERABILITIES: 0

Generated on: Thu Feb 19 03:06:38 UTC 2026

Generated by AI DevOps Framework Code Review Monitoring

sonarqubecloud · 2026-02-19T03:07:34Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/pulse.sh:
- Around line 2028-2029: The rm cleanup for the hang-warned marker is currently
a best-effort mutation but can abort the script under set -euo if it returns a
non-zero status; change the cleanup to tolerate failures by suppressing errors
(e.g., run the rm command in a way that it never causes a non-zero exit for the
script) for the specific invocation that removes
"$SUPERVISOR_DIR/pids/${health_task}.hang-warned" (and the similar cleanup at
the other occurrence around the same block), ensuring the removal remains
attempted but will not abort the pulse on permission/IO errors.

.agents/scripts/supervisor/pulse.sh

Auto-dismissed: bot review does not block autonomous pipeline

marcusquinn marked this pull request as ready for review February 19, 2026 02:35

marcusquinn mentioned this pull request Feb 19, 2026

[Supervisor:marcusquinn] 0 queued, 0 working, 1 in review at 14:46 UTC #1502

Closed

coderabbitai bot previously requested changes Feb 19, 2026

View reviewed changes

.agents/scripts/supervisor-helper.sh Outdated Show resolved Hide resolved

.agents/scripts/supervisor/pulse.sh Show resolved Hide resolved

.agents/scripts/supervisor/pulse.sh Outdated Show resolved Hide resolved

.agents/scripts/supervisor/pulse.sh Show resolved Hide resolved

fix: address review feedback for t1222

bcf4932

marcusquinn commented Feb 19, 2026

View reviewed changes

.agents/scripts/supervisor/pulse.sh Show resolved Hide resolved

coderabbitai bot previously requested changes Feb 19, 2026

View reviewed changes

.agents/scripts/supervisor/pulse.sh Show resolved Hide resolved

coderabbitai bot approved these changes Feb 19, 2026

View reviewed changes

marcusquinn merged commit 42809e0 into main Feb 19, 2026
31 checks passed

marcusquinn deleted the feature/t1222 branch February 19, 2026 03:42

marcusquinn added a commit that referenced this pull request Feb 19, 2026

chore: mark t1222 complete in TODO.md (#1869)

0287375

This was referenced Feb 19, 2026

t1222: Add worker hang detection with graceful termination before timeout #1868

Closed

t1248: Fix success rate metric — exclude cancelled tasks from failure count #1983

Merged

coderabbitai bot mentioned this pull request Feb 22, 2026

t1301: fix rc=143 SIGTERM kills during AI reasoning phase #2120

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

t1222: Add two-phase worker hang detection with graceful termination#1869

t1222: Add two-phase worker hang detection with graceful termination#1869
marcusquinn merged 2 commits intomainfrom
feature/t1222

marcusquinn commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2026

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marcusquinn commented Feb 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Design Decisions

Testing

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Poem

Uh oh!

github-actions bot commented Feb 19, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2026

🔍 Code Quality Report

📈 Current Quality Metrics

Uh oh!

sonarqubecloud bot commented Feb 19, 2026

Quality Gate passed

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

marcusquinn commented Feb 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 19, 2026 •

edited

Loading