fix: resolve three supervisor pulse-blocking bugs#2185
Conversation
- Dedup guard: use cancelled instead of blocked for queued tasks (queued->blocked was invalid) - Phase 0.6/7b: change --reason to --error (--reason is not a valid cmd_transition option) - Add queued:blocked to VALID_TRANSITIONS as defense-in-depth for future code paths
Summary of ChangesHello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses critical bugs within the supervisor pulse mechanism that were causing it to prematurely terminate. By rectifying invalid state transitions, correcting command-line options, and enhancing state validation, the changes ensure the supervisor pulse now reliably completes all its operational phases, leading to consistent health issue updates, proper PR lifecycle processing, and uninterrupted task dispatching. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThree supervisor scripts updated to refine state machine transitions and error reporting. Added Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🔍 Code Quality Report�[0;35m[MONITOR]�[0m Code Review Monitoring Report �[0;34m[INFO]�[0m Latest Quality Status: �[0;34m[INFO]�[0m Recent monitoring activity: 📈 Current Quality Metrics
Generated on: Mon Feb 23 01:35:12 UTC 2026 Generated by AI DevOps Framework Code Review Monitoring |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.agents/scripts/supervisor/dispatch.sh:
- Around line 1531-1540: The TODO sync call uses update_todo_on_blocked after
transitioning the task to "cancelled", causing TODO.md to show BLOCKED instead
of CANCELLED; replace the call to update_todo_on_blocked with
update_todo_on_cancelled (referencing cmd_transition, update_todo_on_cancelled,
send_task_notification, store_failure_pattern, and the local block_reason
variable) so the TODO state matches the DB, and update or remove the stale
comment about "queued->blocked is not a valid transition" to reflect the current
VALID_TRANSITIONS or explain why cancellation is still chosen here.
| # Rule 2: Cancel after max_consecutive identical failures | ||
| # Note: queued->blocked is not a valid transition; use cancelled instead. | ||
| # The task can be manually re-queued after investigation. | ||
| if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then | ||
| local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)" | ||
| log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | ||
| cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true | ||
| log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | ||
| cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true | ||
| update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true | ||
| send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true | ||
| store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true | ||
| send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true | ||
| store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true |
There was a problem hiding this comment.
Use cancellation-specific TODO updates to avoid state drift.
The task now transitions to cancelled, but the TODO sync still uses update_todo_on_blocked, which will annotate BLOCKED in TODO.md and diverge from the DB state. Please switch to update_todo_on_cancelled. Also, the note about queued->blocked being invalid is now stale with the new VALID_TRANSITIONS entry—either update it or clarify why cancellation is still preferred here.
🛠️ Suggested fix
- update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true
+ update_todo_on_cancelled "$task_id" "$block_reason" 2>/dev/null || true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Rule 2: Cancel after max_consecutive identical failures | |
| # Note: queued->blocked is not a valid transition; use cancelled instead. | |
| # The task can be manually re-queued after investigation. | |
| if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then | |
| local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)" | |
| log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | |
| cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true | |
| log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | |
| cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true | |
| update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true | |
| send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true | |
| store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true | |
| send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true | |
| store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true | |
| # Rule 2: Cancel after max_consecutive identical failures | |
| # Note: queued->blocked is not a valid transition; use cancelled instead. | |
| # The task can be manually re-queued after investigation. | |
| if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then | |
| local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)" | |
| log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | |
| cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true | |
| update_todo_on_cancelled "$task_id" "$block_reason" 2>/dev/null || true | |
| send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true | |
| store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.agents/scripts/supervisor/dispatch.sh around lines 1531 - 1540, The TODO
sync call uses update_todo_on_blocked after transitioning the task to
"cancelled", causing TODO.md to show BLOCKED instead of CANCELLED; replace the
call to update_todo_on_blocked with update_todo_on_cancelled (referencing
cmd_transition, update_todo_on_cancelled, send_task_notification,
store_failure_pattern, and the local block_reason variable) so the TODO state
matches the DB, and update or remove the stale comment about "queued->blocked is
not a valid transition" to reflect the current VALID_TRANSITIONS or explain why
cancellation is still chosen here.
There was a problem hiding this comment.
Code Review
The pull request addresses critical pulse-blocking bugs by correcting invalid state transitions and fixing incorrect CLI options. However, there are some inconsistencies in the implementation of the dedup guard in dispatch.sh and several violations of the repository's shell scripting guidelines regarding variable declarations and error suppression. Specifically, the PR makes queued:blocked a valid transition in the state machine but then avoids using it in the dedup guard logic, opting for cancelled instead. Additionally, blanket error suppression with 2>/dev/null is used in several places, which violates Rule 1 of the repository guidelines, and variable declarations do not follow Rule 9.
| # Rule 2: Cancel after max_consecutive identical failures | ||
| # Note: queued->blocked is not a valid transition; use cancelled instead. | ||
| # The task can be manually re-queued after investigation. | ||
| if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then | ||
| local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)" | ||
| log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | ||
| cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true | ||
| log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | ||
| cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true | ||
| update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true | ||
| send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true | ||
| store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true | ||
| send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true | ||
| store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true | ||
| return 1 |
There was a problem hiding this comment.
There is a logic inconsistency here. The PR adds queued:blocked to VALID_TRANSITIONS in supervisor-helper.sh (line 248), but this code switches to cancelled because it claims queued:blocked is invalid. Since the transition is now valid, blocked should be used as it correctly reflects the 'manual intervention required' status mentioned in the logs. Furthermore, the use of 2>/dev/null || true for blanket error suppression violates the repository guidelines (Rule 1), which states that 2>/dev/null should be avoided for blanket suppression to ensure errors remain visible for debugging. Errors should be redirected to ${SUPERVISOR_LOG} to maintain visibility for debugging.
| # Rule 2: Cancel after max_consecutive identical failures | |
| # Note: queued->blocked is not a valid transition; use cancelled instead. | |
| # The task can be manually re-queued after investigation. | |
| if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then | |
| local block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)" | |
| log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | |
| cmd_transition "$task_id" "blocked" --error "$block_reason" 2>/dev/null || true | |
| log_warn " $task_id: CANCELLED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | |
| cmd_transition "$task_id" "cancelled" --error "$block_reason" 2>/dev/null || true | |
| update_todo_on_blocked "$task_id" "$block_reason" 2>/dev/null || true | |
| send_task_notification "$task_id" "blocked" "$block_reason" 2>/dev/null || true | |
| store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true | |
| send_task_notification "$task_id" "cancelled" "$block_reason" 2>/dev/null || true | |
| store_failure_pattern "$task_id" "cancelled" "$block_reason" "dispatch-dedup-guard" 2>/dev/null || true | |
| return 1 | |
| # Rule 2: Block after max_consecutive identical failures | |
| # Note: queued->blocked is now a valid transition in VALID_TRANSITIONS. | |
| # The task can be manually re-queued after investigation. | |
| if [[ "$consecutive_count" -ge "$max_consecutive" ]]; then | |
| local block_reason | |
| block_reason="Dispatch dedup guard: $consecutive_count consecutive identical failures (error: ${last_error:-unknown}) — manual intervention required (t1206)" | |
| log_warn " $task_id: BLOCKED by dedup guard — $consecutive_count consecutive identical failures with error '${last_error:-unknown}'" | |
| cmd_transition "$task_id" "blocked" --error "$block_reason" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true | |
| update_todo_on_blocked "$task_id" "$block_reason" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true | |
| send_task_notification "$task_id" "blocked" "$block_reason" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true | |
| store_failure_pattern "$task_id" "blocked" "$block_reason" "dispatch-dedup-guard" 2>>"${SUPERVISOR_LOG:-/dev/null}" || true | |
| return 1 | |
| fi |
References
- Avoid using '2>/dev/null' for blanket suppression of command errors in shell scripts to ensure that authentication, syntax, or system issues remain visible for debugging.



Summary
Fixes three bugs that caused the supervisor pulse to die before reaching Phase 3+ (PR lifecycle, health issue update), creating a loop where every 2-minute pulse would crash at the same point.
Bugs Fixed
1. Dedup guard invalid state transition (dispatch.sh)
The dedup guard tried
queued -> blockedwhich is not in VALID_TRANSITIONS. The transition silently failed, but downstream effects (TODO.md commit, mail send) caused the pulse to exit underset -e. Changed toqueued -> cancelledwhich is a valid transition.2.
--reasoninvalid option forcmd_transition(todo-sync.sh)Phase 0.6 (queue-dispatchability reconciliation) and Phase 7b (DB-TODO reconciliation) passed
--reasontocmd_transition, but the only valid option is--error. The unknown option handler returned 1, causing the|| { continue }error path. Fixed 4 occurrences.3.
queued:blockedadded to VALID_TRANSITIONS (supervisor-helper.sh)Defense-in-depth: even though the dedup guard now uses
cancelled,queued -> blockedis semantically valid (e.g., a queued task discovering unresolved dependencies). Added to prevent future code from hitting the same invalid transition error.Impact
Verification
rg 'cmd_transition.*--reason'returns zero matchesqueued:blockednow in VALID_TRANSITIONS arraySummary by CodeRabbit
Release Notes