Skip to content

t1248: Fix success rate metric — exclude cancelled tasks from failure count#1983

Merged
marcusquinn merged 1 commit intomainfrom
feature/t1248
Feb 19, 2026
Merged

t1248: Fix success rate metric — exclude cancelled tasks from failure count#1983
marcusquinn merged 1 commit intomainfrom
feature/t1248

Conversation

@marcusquinn
Copy link
Owner

Investigation Findings

Root cause analysis of the 7-day success rate drop from 94% overall to 89%.

Hung Workers (Feb 12) — ALREADY FIXED

Tasks t302, t303, t311.2, t311.3 all timed out at ~1800s (30min default).

Root cause: These tasks had ~1h or ~3h estimates but the hung timeout was a fixed 1800s default that didn't read the estimate field. The workers were legitimately busy (large refactors of 14,644-line supervisor-helper.sh) but got killed as false-positive hangs.

Common characteristics: All were large shell script refactoring tasks (#refactor), dispatched at opus tier, on the aidevops repo. None were actually hung — they were doing real work.

Fixes already merged:

worker_never_started:no_sentinel (Feb 13) — FIX PENDING

12 failures at 14:40-14:56 UTC for tasks t1010, t1030, t1032.1, t1032.2.

Root cause: Concurrent dispatches used fixed-filename wrapper scripts (e.g., t1010-wrapper.sh). A second dispatch overwrote the script before the first wrapper process read it. The first wrapper executed the new script, writing WORKER_STARTED to a different log file, leaving the original log with only metadata (no sentinel → no_sentinel failure).

Model availability was healthy during the failure window (opencode cache_check: healthy, 32 models).

Fix: t1190 (PR #1981, open) — timestamped filenames prevent overwrite race, WRAPPER_STARTED sentinel added for sub-classification.

Metric Accuracy Issue — FIXED IN THIS PR

Root cause of the apparent 89% rate: The build_health_context() function in ai-context.sh included cancelled tasks in the failure count. Cancelled tasks are administrative cleanup (orphaned DB entries, superseded tasks, cross-repo misregistration) — not worker failures.

Actual numbers:

  • 7-day: 473 completed, 2 actually failed, 55 cancelled
  • True failure rate: 2/475 = 0.4% (not 11%)
  • Pattern tracker overall: 94% (977/1037) — counts retry attempts, not final status

Fix: Split failed and cancelled into separate metric rows. Success rate denominator now only includes status='failed'.

Cancellation Breakdown (Feb 18-19)

  • 13 tasks: orphaned DB entries not in TODO.md
  • 8 tasks: superseded by feature/supervisor-self-healing branch
  • 12 tasks: cross-repo misregistration cleanup (t1237)
  • 5 tasks: pre-dispatch already-completed detection
  • 1 task: duplicate of another task

None of these are worker failures.

Ref #1944

Cancelled tasks are administrative cleanup actions (orphaned tasks, superseded
work, cross-repo misregistration cleanup) — not worker failures. Including them
in the failure count inflated the 7-day failure rate from <1% to 11%, causing
false alarms and masking the true worker reliability signal.

Root cause analysis (t1248):
- Feb 12: 5 hung workers (t302, t303, t311.2x2, t311.3) — all hit the 1800s
  default timeout because their ~1h estimates weren't parsed. Fixed by t1199
  (estimate-based timeout) and t1222 (graceful SIGTERM at 50% timeout).
- Feb 13: 12 no_sentinel failures (t1010, t1030, t1032.1x4, t1032.2x2) at
  14:40-14:56 UTC — caused by a race condition where concurrent dispatches
  overwrote fixed-filename wrapper scripts. Fixed by t1190 (timestamped
  filenames, WRAPPER_STARTED sentinel) — PR #1981 pending merge.
- Feb 18-19: 67 cancellations (orphaned DB tasks, superseded tasks, cross-repo
  misregistration cleanup via t1237) — these are intentional, not failures.

Fix: split 'failed' and 'cancelled' into separate metrics. Success rate now
only counts status='failed' in the denominator, giving an accurate signal.
Cancelled tasks get their own row for visibility without distorting the rate.

True 7-day worker failure rate: 2/475 = 0.4% (not 11%)
Pattern tracker overall rate: 94% (977/1037) — unchanged, counts retries
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 19, 2026

Warning

Rate limit exceeded

@marcusquinn has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 17 minutes and 47 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/t1248

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @marcusquinn, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refines the task health metrics by correcting how task failures and cancellations are reported. The primary goal is to provide a more precise success rate by excluding administratively cancelled tasks from the failure count, which previously misrepresented the system's operational health. This change ensures that the reported metrics accurately reflect genuine worker performance issues versus intentional administrative cleanups.

Highlights

  • Metric Accuracy Fix: The success rate metric now accurately distinguishes between actual task failures and administrative cancellations. Previously, cancelled tasks were incorrectly included in the failure count, leading to an inflated failure rate.
  • Separate Cancellation Metric: A new metric has been introduced to track the number of cancelled tasks separately, providing clearer insight into administrative actions without skewing the failure rate.
  • Updated Success Rate Calculation: The calculation for the 7-day success rate has been adjusted to exclude cancelled tasks from the denominator, ensuring it reflects only completed and genuinely failed tasks.
Changelog
  • .agents/scripts/supervisor/ai-context.sh
    • Modified the failed_7d query to count only tasks with a 'failed' status, excluding 'cancelled' tasks.
    • Introduced a new cancelled_7d variable and corresponding database query to count tasks with a 'cancelled' status.
    • Updated the total_7d calculation to correctly sum completed_7d and the newly defined failed_7d.
    • Added a new output line to display the 'Cancelled (7d)' metric in the health context report.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

🔍 Code Quality Report

�[0;35m[MONITOR]�[0m Code Review Monitoring Report

�[0;34m[INFO]�[0m Latest Quality Status:
SonarCloud: 0 bugs, 0 vulnerabilities, 30 code smells

�[0;34m[INFO]�[0m Recent monitoring activity:
Thu Feb 19 23:10:09 UTC 2026: Code review monitoring started
Thu Feb 19 23:10:09 UTC 2026: SonarCloud - Bugs: 0, Vulnerabilities: 0, Code Smells: 30

📈 Current Quality Metrics

  • BUGS: 0
  • CODE SMELLS: 30
  • VULNERABILITIES: 0

Generated on: Thu Feb 19 23:10:11 UTC 2026


Generated by AI DevOps Framework Code Review Monitoring

@sonarqubecloud
Copy link

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses the issue of inflated failure rates by correctly distinguishing between 'failed' and 'cancelled' tasks. The changes accurately reflect the true success rate by excluding administrative cancellations from the failure count and introducing a separate metric for cancelled tasks. The code is clear and follows the established shell scripting patterns for database queries and variable handling.

Comment on lines +847 to +849
# Tasks failed in last 7 days (t1248: exclude cancelled — cancelled tasks are
# administrative cleanup, not worker failures; including them inflates the failure
# rate and causes false alarms. Cancelled tasks have their own metric row.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment correctly explains the rationale, but it is verbose. Condensing it improves readability and aligns with the rule for providing a single, clear justification.

Suggested change
# Tasks failed in last 7 days (t1248: exclude cancelled — cancelled tasks are
# administrative cleanup, not worker failures; including them inflates the failure
# rate and causes false alarms. Cancelled tasks have their own metric row.)
# Tasks failed in last 7 days (t1248: exclude cancelled tasks from failure count;
# cancelled tasks are administrative cleanup, not worker failures.)
References
  1. Ensure comments provide a single, clear justification for a design choice, avoiding the presentation of multiple, potentially conflicting rationales.

Comment on lines +857 to +858
# Tasks cancelled in last 7 days (separate from failures — cancellations are
# intentional administrative actions: orphaned tasks, superseded work, cleanup)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for cancelled_7d is verbose. Shortening it while retaining clarity aligns with the rule for providing a single, clear justification.

Suggested change
# Tasks cancelled in last 7 days (separate from failures — cancellations are
# intentional administrative actions: orphaned tasks, superseded work, cleanup)
# Tasks cancelled in last 7 days (administrative actions, not worker failures)
References
  1. Ensure comments provide a single, clear justification for a design choice, avoiding the presentation of multiple, potentially conflicting rationales.

@marcusquinn marcusquinn merged commit 8b36325 into main Feb 19, 2026
19 checks passed
@marcusquinn marcusquinn deleted the feature/t1248 branch February 19, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant