t1331: Supervisor circuit breaker — pause on consecutive failures#2271
t1331: Supervisor circuit breaker — pause on consecutive failures#2271marcusquinn merged 3 commits intomainfrom
Conversation
|
Caution Review failedThe pull request is closed. ℹ️ Recent review infoConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
WalkthroughImplements a comprehensive circuit-breaker pattern for supervisor task dispatch. The system tracks consecutive task failures, trips dispatch prevention at a configurable threshold, persists state atomically to JSON, integrates with GitHub issue creation/updates, and supports both manual and automatic reset with cooldown periods. Changes
Sequence DiagramssequenceDiagram
participant Pulse as Pulse Coordinator
participant Circuit as Circuit Breaker
participant State as State File
participant GitHub as GitHub API
participant Dispatch as Phase 2 Dispatch
rect rgba(100, 150, 200, 0.5)
Note over Pulse,Dispatch: Task Failure Path
Pulse->>Circuit: cb_record_failure()
Circuit->>State: Read current state
State-->>Circuit: state JSON
Circuit->>Circuit: Increment consecutive_failures
Circuit->>Circuit: Check if threshold met
Circuit->>State: Write updated state (tripped=true)
Circuit->>GitHub: _cb_create_or_update_issue()
GitHub-->>Circuit: Issue created/updated
Circuit-->>Pulse: Breaker tripped
end
rect rgba(150, 200, 100, 0.5)
Note over Pulse,Dispatch: Dispatch Gate Check
Pulse->>Circuit: cb_check()
Circuit->>State: Read state
State-->>Circuit: state JSON
Circuit->>Circuit: If tripped & cooldown elapsed: auto-reset
Circuit-->>Pulse: Dispatch allowed or blocked
alt Breaker Open
Pulse->>Dispatch: Skip dispatch (log message)
else Breaker Closed
Pulse->>Dispatch: Proceed with batch/non-batch logic
end
end
rect rgba(200, 150, 100, 0.5)
Note over Pulse,GitHub: Task Success Path
Pulse->>Circuit: cb_record_success()
Circuit->>State: Reset consecutive_failures to 0
Circuit->>State: Clear tripped flag
Circuit->>GitHub: _cb_close_issue()
GitHub-->>Circuit: Issue closed
Circuit-->>Pulse: Breaker reset
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~55 minutes Possibly related PRs
Poem
✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust circuit breaker mechanism for the supervisor, designed to enhance system resilience and operational efficiency. By automatically detecting and responding to repeated task failures, it prevents resource exhaustion and provides clear, automated notifications through GitHub issues, allowing for timely intervention and resolution of underlying problems. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 25 03:02:52 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
There was a problem hiding this comment.
Code Review
This pull request introduces a robust circuit breaker mechanism for the supervisor, which is a valuable addition for preventing resource wastage on consecutive task failures. The implementation correctly handles failure tracking, auto-reset with a configurable cooldown, manual reset, and integrates with GitHub issues for notifications. The overall logic is sound and addresses the problem effectively. However, there are several instances where command error output is suppressed using 2>/dev/null. This practice can hide critical debugging information related to jq parsing errors, date command failures, or gh CLI issues (e.g., authentication, network problems). The comments provided suggest modifications to remove 2>/dev/null or replace it with || true or || echo "0" to ensure that actual syntax or system errors remain visible for debugging, aligning with repository guidelines.
|
|
||
| _cb_state_file() { | ||
| local dir="${SUPERVISOR_DIR:-${HOME}/.aidevops/.agent-workspace/supervisor}" | ||
| mkdir -p "$dir" 2>/dev/null || true |
There was a problem hiding this comment.
Suppressing all errors from mkdir -p with 2>/dev/null can hide important issues, such as permission problems if the directory already exists but is not writable. It's better to allow stderr to be visible for debugging.
| mkdir -p "$dir" 2>/dev/null || true | |
| mkdir -p "$dir" || true |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| state=$(cb_read_state) | ||
|
|
||
| local current_count | ||
| current_count=$(printf '%s' "$state" | jq -r '.consecutive_failures // 0' 2>/dev/null || echo "0") |
There was a problem hiding this comment.
Suppressing jq errors with 2>/dev/null can hide syntax errors in the jq filter or issues with the jq executable itself. The // operator handles missing keys, so 2>/dev/null is not needed for that purpose. It's better to let jq errors be visible for debugging.
| current_count=$(printf '%s' "$state" | jq -r '.consecutive_failures // 0' 2>/dev/null || echo "0") | |
| current_count=$(printf '%s' "$state" | jq -r '.consecutive_failures // 0' || echo "0") |
References
- In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
| now=$(date -u +%Y-%m-%dT%H:%M:%SZ) | ||
|
|
||
| local tripped | ||
| tripped=$(printf '%s' "$state" | jq -r '.tripped // false' 2>/dev/null || echo "false") |
There was a problem hiding this comment.
Suppressing jq errors with 2>/dev/null can hide syntax errors in the jq filter or issues with the jq executable itself. The // operator handles missing keys, so 2>/dev/null is not needed for that purpose. It's better to let jq errors be visible for debugging.
| tripped=$(printf '%s' "$state" | jq -r '.tripped // false' 2>/dev/null || echo "false") | |
| tripped=$(printf '%s' "$state" | jq -r '.tripped // false' || echo "false") |
References
- In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
| --arg task "$task_id" \ | ||
| --arg reason "$failure_reason" \ | ||
| '.consecutive_failures = $count | .last_failure_at = $now | .last_failure_task = $task | .last_failure_reason = $reason' \ | ||
| 2>/dev/null) |
There was a problem hiding this comment.
Suppressing jq errors with 2>/dev/null can hide syntax errors in the jq filter or issues with the jq executable itself. It's better to let jq errors be visible for debugging.
| 2>/dev/null) | |
| new_state=$(printf '%s' "$state" | jq \ | |
| --argjson count "$new_count" \ | |
| --arg now "$now" \ | |
| --arg task "$task_id" \ | |
| --arg reason "$failure_reason" \ | |
| '.consecutive_failures = $count | .last_failure_at = $now | .last_failure_task = $task | .last_failure_reason = $reason') |
References
- In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
| new_state=$(printf '%s' "$new_state" | jq \ | ||
| --arg now "$now" \ | ||
| '.tripped = true | .tripped_at = $now' \ | ||
| 2>/dev/null) |
There was a problem hiding this comment.
Suppressing jq errors with 2>/dev/null can hide syntax errors in the jq filter or issues with the jq executable itself. It's better to let jq errors be visible for debugging.
| 2>/dev/null) | |
| new_state=$(printf '%s' "$new_state" | jq \ | |
| --arg now "$now" \ | |
| '.tripped = true | .tripped_at = $now') |
References
- In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
| 2>/dev/null || true | ||
|
|
||
| local issue_url | ||
| issue_url=$(gh issue create \ | ||
| --title "Supervisor circuit breaker tripped — ${failure_count} consecutive failures" \ |
There was a problem hiding this comment.
Suppressing gh CLI errors with 2>/dev/null can hide issues with GitHub API access, authentication, or network problems. While || true makes it non-blocking, it's better to see the errors for debugging.
| 2>/dev/null || true | |
| local issue_url | |
| issue_url=$(gh issue create \ | |
| --title "Supervisor circuit breaker tripped — ${failure_count} consecutive failures" \ | |
| gh label create "circuit-breaker" \ | |
| --description "Supervisor circuit breaker tripped — dispatch paused" \ | |
| --color "D93F0B" \ | |
| --force || true |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| 2>/dev/null) || { | ||
| log_warn "circuit-breaker: failed to create GitHub issue" | ||
| return 1 | ||
| } | ||
| log_info "circuit-breaker: created GitHub issue: $issue_url" |
There was a problem hiding this comment.
Suppressing gh CLI errors with 2>/dev/null can hide issues with GitHub API access, authentication, or network problems. While the || { ... } block catches the failure, 2>/dev/null prevents seeing the actual error message from gh.
| 2>/dev/null) || { | |
| log_warn "circuit-breaker: failed to create GitHub issue" | |
| return 1 | |
| } | |
| log_info "circuit-breaker: created GitHub issue: $issue_url" | |
| issue_url=$(gh issue create \ | |
| --title "Supervisor circuit breaker tripped — ${failure_count} consecutive failures" \ | |
| --body "$body" \ | |
| --label "circuit-breaker") || { |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| --state open \ | ||
| --json number \ | ||
| --jq '.[0].number // empty' \ | ||
| 2>/dev/null) || existing_issue="" | ||
|
|
There was a problem hiding this comment.
Suppressing gh CLI errors with 2>/dev/null can hide issues with GitHub API access, authentication, or network problems. It's better to let these errors be visible for debugging.
| --state open \ | |
| --json number \ | |
| --jq '.[0].number // empty' \ | |
| 2>/dev/null) || existing_issue="" | |
| existing_issue=$(gh issue list \ | |
| --label "circuit-breaker" \ | |
| --state open \ | |
| --json number \ | |
| --jq '.[0].number // empty') || existing_issue="" |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| fi | ||
|
|
||
| gh issue close "$existing_issue" \ | ||
| --comment "Circuit breaker reset: ${reason}" \ |
There was a problem hiding this comment.
Suppressing gh CLI errors with 2>/dev/null can hide issues with GitHub API access, authentication, or network problems. While the || { ... } block catches the failure, 2>/dev/null prevents seeing the actual error message from gh.
| --comment "Circuit breaker reset: ${reason}" \ | |
| gh issue close "$existing_issue" \ | |
| --comment "Circuit breaker reset: ${reason}" || { |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.
| state=$(jq -n \ | ||
| --argjson count "$CIRCUIT_BREAKER_THRESHOLD" \ | ||
| --arg now "$now" \ | ||
| --arg task "$task_id" \ | ||
| --arg reason "$reason" \ | ||
| '{consecutive_failures: $count, tripped: true, tripped_at: $now, last_failure_at: $now, last_failure_task: $task, last_failure_reason: $reason}') |
There was a problem hiding this comment.
Suppressing jq errors with 2>/dev/null can hide syntax errors in the jq filter or issues with the jq executable itself. It's better to let jq errors be visible for debugging.
| state=$(jq -n \ | |
| --argjson count "$CIRCUIT_BREAKER_THRESHOLD" \ | |
| --arg now "$now" \ | |
| --arg task "$task_id" \ | |
| --arg reason "$reason" \ | |
| '{consecutive_failures: $count, tripped: true, tripped_at: $now, last_failure_at: $now, last_failure_task: $task, last_failure_reason: $reason}') | |
| state=$(jq -n \ | |
| --argjson count "$CIRCUIT_BREAKER_THRESHOLD" \ | |
| --arg now "$now" \ | |
| --arg task "$task_id" \ | |
| --arg reason "$reason" \ | |
| '{consecutive_failures: $count, tripped: true, tripped_at: $now, last_failure_at: $now, last_failure_task: $task, last_failure_reason: $reason}') |
References
- In shell scripts with 'set -e' enabled, use '|| true' to prevent the script from exiting when a command like 'jq' fails on an optional lookup. Do not suppress stderr with '2>/dev/null' so that actual syntax or system errors remain visible for debugging.
da2f1d5 to
a309c59
Compare
New file: .agents/scripts/supervisor/circuit-breaker.sh - Tracks consecutive task failures globally via state file - Trips after N failures (default: 3, configurable via SUPERVISOR_CIRCUIT_BREAKER_THRESHOLD) - Pauses dispatch when tripped, creates/updates GitHub issue with circuit-breaker label - Manual reset via supervisor-helper.sh circuit-breaker reset - Auto-reset after configurable cooldown (SUPERVISOR_CIRCUIT_BREAKER_COOLDOWN_SECS) - Counter resets on any task success
- Remove unused 'cooldown_secs' variable declaration in cb_status() - Remove unused 'repo_path' variable in _cb_create_or_update_issue() - Add blank lines before fenced code blocks in t1331-brief.md (MD031)
… into pulse Remove blanket 2>/dev/null suppression from jq, gh, date, and mkdir commands in circuit-breaker.sh per Codacy review. Errors now flow to stderr for debugging while || fallbacks prevent script abort. Integrate cb_record_success/cb_record_failure/cb_check into pulse.sh: - Phase 1: record success on task completion, failure on retry/blocked - Phase 2: gate dispatch on cb_check (skip if circuit breaker tripped) Wire circuit-breaker subcommand in supervisor-helper.sh main router. Ref #2264
a309c59 to
a150e85
Compare
🤖 Augment PR SummarySummary: This PR introduces a supervisor-wide circuit breaker to pause dispatch after consecutive task failures. Changes:
Technical Notes: Threshold and cooldown are configurable via env vars; dispatch can auto-resume after cooldown or via manual reset. 🤖 Was this summary useful? React with 👍 or 👎 |
| state_file=$(_cb_state_file) | ||
|
|
||
| if [[ -f "$state_file" ]]; then | ||
| cat "$state_file" |
There was a problem hiding this comment.
Because supervisor-helper.sh runs with set -euo pipefail, a read failure here (e.g., state file exists but isn’t readable) can abort the entire supervisor run even though the circuit breaker is intended to be non-blocking. Consider making state reads resilient (treat read errors as “no state” and continue).
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| # Atomic write via temp file + mv | ||
| local tmp_file="${state_file}.tmp.$$" | ||
| printf '%s\n' "$state_json" >"$tmp_file" |
There was a problem hiding this comment.
With set -euo pipefail enabled in the caller, a write failure (permissions/disk full) on this redirection can exit the supervisor before the circuit-breaker functions can return 0. Consider explicitly handling write/move failures so the breaker can fail-open without killing dispatch.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| # Update state with new failure count | ||
| local new_state | ||
| new_state=$(printf '%s' "$state" | jq \ |
There was a problem hiding this comment.
This jq call is unguarded, so if it fails (missing jq, corrupted JSON, etc.) it can terminate the supervisor under set -euo pipefail before the subsequent -z "$new_state" check runs. Consider ensuring JSON update failures are non-fatal in all cb_* paths.
Severity: high
Other Locations
.agents/scripts/supervisor/circuit-breaker.sh:161.agents/scripts/supervisor/circuit-breaker.sh:246
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| local tripped count tripped_at last_failure last_reset | ||
| tripped=$(printf '%s' "$state" | jq -r '.tripped // false' || echo "false") | ||
| count=$(printf '%s' "$state" | jq -r '.consecutive_failures // 0' || echo "0") | ||
| tripped_at=$(printf '%s' "$state" | jq -r '.tripped_at // "never"' || echo "never") |
There was a problem hiding this comment.
The state defaults use empty-string timestamps, but jq’s // won’t replace empty strings, so tripped_at/last_reset_at may print blank (and the cooldown math may try to parse an empty date). Consider normalizing empty strings to a sentinel like never before display/math.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
| retry) | ||
| log_warn " $tid: RETRY ($outcome_detail)" | ||
| # t1331: Record failure for circuit breaker (consecutive failure tracking) | ||
| cb_record_failure "$tid" "$outcome_detail" 2>>"$SUPERVISOR_LOG" || true |
There was a problem hiding this comment.
Right now the breaker increments on retry/blocked, but the failed) outcome path later in this same case doesn’t record a failure, so repeated hard failures may never trip the breaker. Consider whether failed (or at least non-environment failed) should also feed cb_record_failure to match “consecutive task failures” semantics.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Wed Feb 25 14:27:21 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|



WIP - incremental commits
Ref #2264
Summary by CodeRabbit