Skip to content

fix: resolve three supervisor pulse-blocking bugs#2185

Merged
marcusquinn merged 1 commit intomainfrom
bugfix/supervisor-pulse-blockers
Feb 23, 2026
Merged

fix: resolve three supervisor pulse-blocking bugs#2185
marcusquinn merged 1 commit intomainfrom
bugfix/supervisor-pulse-blockers

Conversation

@marcusquinn
Copy link
Owner

@marcusquinn marcusquinn commented Feb 23, 2026

Summary

Fixes three bugs that caused the supervisor pulse to die before reaching Phase 3+ (PR lifecycle, health issue update), creating a loop where every 2-minute pulse would crash at the same point.

Bugs Fixed

1. Dedup guard invalid state transition (dispatch.sh)

The dedup guard tried queued -> blocked which is not in VALID_TRANSITIONS. The transition silently failed, but downstream effects (TODO.md commit, mail send) caused the pulse to exit under set -e. Changed to queued -> cancelled which is a valid transition.

2. --reason invalid option for cmd_transition (todo-sync.sh)

Phase 0.6 (queue-dispatchability reconciliation) and Phase 7b (DB-TODO reconciliation) passed --reason to cmd_transition, but the only valid option is --error. The unknown option handler returned 1, causing the || { continue } error path. Fixed 4 occurrences.

3. queued:blocked added to VALID_TRANSITIONS (supervisor-helper.sh)

Defense-in-depth: even though the dedup guard now uses cancelled, queued -> blocked is semantically valid (e.g., a queued task discovering unresolved dependencies). Added to prevent future code from hitting the same invalid transition error.

Impact

Verification

  • ShellCheck: zero warnings on all 3 changed files
  • rg 'cmd_transition.*--reason' returns zero matches
  • queued:blocked now in VALID_TRANSITIONS array

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed task failure handling: tasks are now properly cancelled after reaching maximum consecutive failures instead of being blocked
    • Fixed task queue state transition logic to prevent invalid state combinations and improve system stability
    • Refined error reporting mechanism to accurately track and communicate failure context across task transitions

- Dedup guard: use cancelled instead of blocked for queued tasks (queued->blocked was invalid)
- Phase 0.6/7b: change --reason to --error (--reason is not a valid cmd_transition option)
- Add queued:blocked to VALID_TRANSITIONS as defense-in-depth for future code paths
@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical bugs within the supervisor pulse mechanism that were causing it to prematurely terminate. By rectifying invalid state transitions, correcting command-line options, and enhancing state validation, the changes ensure the supervisor pulse now reliably completes all its operational phases, leading to consistent health issue updates, proper PR lifecycle processing, and uninterrupted task dispatching.

Highlights

  • Dedup guard state transition fix: Corrected an invalid state transition from 'queued' to 'blocked' in the dispatch dedup guard by changing it to 'queued' to 'cancelled', preventing silent failures and ensuring proper task handling.
  • cmd_transition option correction: Replaced the invalid '--reason' option with the correct '--error' option in 'cmd_transition' calls within reconciliation phases, resolving errors caused by unknown option handling.
  • Added 'queued:blocked' to valid transitions: Included 'queued:blocked' in the 'VALID_TRANSITIONS' array as a defense-in-depth measure, acknowledging its semantic validity for future use cases despite the immediate fix using 'cancelled'.
Changelog
  • .agents/scripts/supervisor-helper.sh
    • Added "queued:blocked" to the VALID_TRANSITIONS array.
  • .agents/scripts/supervisor/dispatch.sh
    • Updated the dedup guard logic to transition tasks from 'queued' to 'cancelled' instead of 'blocked' when consecutive failures occur.
    • Modified log messages, task notifications, and failure pattern storage to reflect the 'cancelled' state.
  • .agents/scripts/supervisor/todo-sync.sh
    • Replaced the '--reason' option with '--error' in four instances of 'cmd_transition' calls within 'cmd_reconcile_db_todo' and 'cmd_reconcile_queue_dispatchability' functions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

Walkthrough

Three supervisor scripts updated to refine state machine transitions and error reporting. Added queued:blocked transition, changed dispatch to cancel instead of block on max consecutive failures, and replaced --reason with --error flags for error reporting across multiple transition calls.

Changes

Cohort / File(s) Summary
State Transitions
.agents/scripts/supervisor-helper.sh
Added "queued:blocked" to VALID_TRANSITIONS array, expanding permitted state transitions.
Dispatch Deduplication Logic
.agents/scripts/supervisor/dispatch.sh
Modified max-failure handling to transition tasks to CANCELLED instead of BLOCKED, with comment clarifying that queued->blocked is invalid and manual re-queuing remains possible.
Error Reporting Flags
.agents/scripts/supervisor/todo-sync.sh
Replaced --reason CLI flag with --error flag across Phase 7b, Phase 0.6 queue reconciliation, and phantom/dispatch transition checks for consistent error categorization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🔄 States flow with grace, from queued to blocked at last,
But cancelled now when failures strike too fast,
Error flags rename, --reason becomes --error true,
The supervisor dances—old transitions made anew! 🎭

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: resolve three supervisor pulse-blocking bugs' directly and accurately summarizes the main change—fixing three critical bugs in supervisor scripts that were causing pulse failures.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bugfix/supervisor-pulse-blockers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 40 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Mon Feb 23 01:35:09 UTC 2026: Code review monitoring started
Mon Feb 23 01:35:09 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 40

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 40
  • VULNERABILITIES: 0

Generated on: Mon Feb 23 01:35:12 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

@marcusquinn marcusquinn merged commit d81009d into main Feb 23, 2026
10 of 11 checks passed
@marcusquinn marcusquinn deleted the bugfix/supervisor-pulse-blockers branch February 23, 2026 01:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1531-1540: The TODO sync call uses update_todo_on_blocked after
transitioning the task to "cancelled", causing TODO.md to show BLOCKED instead
of CANCELLED; replace the call to update_todo_on_blocked with
update_todo_on_cancelled (referencing cmd_transition, update_todo_on_cancelled,
send_task_notification, store_failure_pattern, and the local block_reason
variable) so the TODO state matches the DB, and update or remove the stale
comment about "queued->blocked is not a valid transition" to reflect the current
VALID_TRANSITIONS or explain why cancellation is still chosen here.

Comment on lines +1531 to +1540
# Rule 2: Cancel after max_consecutive identical failures
# Note: queued->blocked is not a valid transition; use cancelled instead.
# The task can be manually re-queued after investigation.
if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then
local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)"
log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true
log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true
update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true
send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use cancellation-specific TODO updates to avoid state drift.

The task now transitions to cancelled, but the TODO sync still uses update_todo_on_blocked, which will annotate BLOCKED in TODO.md and diverge from the DB state. Please switch to update_todo_on_cancelled. Also, the note about queued->blocked being invalid is now stale with the new VALID_TRANSITIONS entry—either update it or clarify why cancellation is still preferred here.

🛠️ Suggested fix
-		update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true
+		update_todo_on_cancelled "$task_id" "$block_reason" 2>/dev/null || true
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Rule 2: Cancel after max_consecutive identical failures
# Note: queued->blocked is not a valid transition; use cancelled instead.
# The task can be manually re-queued after investigation.
if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then
local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)"
log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true
log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true
update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true
send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
# Rule 2: Cancel after max_consecutive identical failures
# Note: queued->blocked is not a valid transition; use cancelled instead.
# The task can be manually re-queued after investigation.
if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then
local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)"
log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true
update_todo_on_cancelled "$task_id" "$block_reason" 2>/dev/null || true
send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.agents/scripts/supervisor/dispatch.sh around lines 1531 - 1540, The TODO
sync call uses update_todo_on_blocked after transitioning the task to
"cancelled", causing TODO.md to show BLOCKED instead of CANCELLED; replace the
call to update_todo_on_blocked with update_todo_on_cancelled (referencing
cmd_transition, update_todo_on_cancelled, send_task_notification,
store_failure_pattern, and the local block_reason variable) so the TODO state
matches the DB, and update or remove the stale comment about "queued->blocked is
not a valid transition" to reflect the current VALID_TRANSITIONS or explain why
cancellation is still chosen here.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses critical pulse-blocking bugs by correcting invalid state transitions and fixing incorrect CLI options. However, there are some inconsistencies in the implementation of the dedup guard in dispatch.sh and several violations of the repository's shell scripting guidelines regarding variable declarations and error suppression. Specifically, the PR makes queued:blocked a valid transition in the state machine but then avoids using it in the dedup guard logic, opting for cancelled instead. Additionally, blanket error suppression with 2>/dev/null is used in several places, which violates Rule 1 of the repository guidelines, and variable declarations do not follow Rule 9.

Comment on lines +1531 to 1541
# Rule 2: Cancel after max_consecutive identical failures
# Note: queued->blocked is not a valid transition; use cancelled instead.
# The task can be manually re-queued after investigation.
if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then
local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)"
log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true
log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true
update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true
send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
return 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a logic inconsistency here. The PR adds queued:blocked to VALID_TRANSITIONS in supervisor-helper.sh (line 248), but this code switches to cancelled because it claims queued:blocked is invalid. Since the transition is now valid, blocked should be used as it correctly reflects the 'manual intervention required' status mentioned in the logs. Furthermore, the use of 2>/dev/null || true for blanket error suppression violates the repository guidelines (Rule 1), which states that 2>/dev/null should be avoided for blanket suppression to ensure errors remain visible for debugging. Errors should be redirected to ${SUPERVISOR_LOG} to maintain visibility for debugging.

Suggested change
# Rule 2: Cancel after max_consecutive identical failures
# Note: queued->blocked is not a valid transition; use cancelled instead.
# The task can be manually re-queued after investigation.
if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then
local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)"
log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true
log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true
update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true
send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true
store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true
return 1
# Rule 2: Block after max_consecutive identical failures
# Note: queued->blocked is now a valid transition in VALID_TRANSITIONS.
# The task can be manually re-queued after investigation.
if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then
local block_reason
block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)"
log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'"
cmd_transition "$task_id" "blocked" --error "$block_reason" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true
update_todo_on_blocked "$task_id" "$block_reason" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true
send_task_notification "$task_id" "blocked" "$block_reason" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true
store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true
return 1
fi
References
  1. Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant