-
Notifications
You must be signed in to change notification settings - Fork 260
Description
Performance Summary
- Agents analyzed: 16 (from 31 total runs sampled, past 2 days)
- Total tokens (sample): ~165M (includes Codex high-parallelism runs)
- Total cost (today): ~$5.94 | yesterday: ~$6.14
- Average quality score: 86/100 (↓ 3 from 89)
- Average effectiveness score: 87/100 (↓ 1 from 88)
- Top performers: The Great Escapi, Contribution Check, Daily Safe Outputs Conformance Checker
- Needs attention: AI Moderator (missing tool regression), Chroma Issue Indexer (extreme token usage), Semantic Function Refactoring (elevated cost)
Critical Findings
❌ P0 Ongoing: Lockdown Token Failures (3+ weeks)
4 workflows remain locked out — Issue Monster, PR Triage Agent, Daily Issues Report, Org Health Report. All fix paths closed (#17414, #17807 both rejected as "not_planned"). Manual repo admin intervention required. These failures continue to skew ecosystem quality metrics.
1 of 3 runs today (run §22453521501) reported missing GitHub MCP (read issue/comment content) tool — identical to the Docker MCP intermittency pattern last seen 2026-02-24 that was believed resolved by switching to mode: remote. With mode: remote now also showing intermittency, the root cause may be upstream GitHub MCP availability rather than Docker-specific. The other 2 runs succeeded but had very low turn counts (1–2 turns), which may indicate noop runs rather than full processing.
Today's run consumed 3.6M tokens in 10.5 minutes with 102 blocked firewall requests — the highest blocked count of any workflow today. If the issue index is growing, this trend will worsen. The 47% firewall block rate across the ecosystem (439/926 requests blocked) is driven primarily by this workflow and Semantic Function Refactoring.
View Detailed Quality Analysis
Agent Quality Scores (Today)
| Agent | Engine | Quality | Duration | Tokens | Cost | Notes |
|---|---|---|---|---|---|---|
| The Great Escapi | copilot | 94/100 | 3.5m | 74k | — | Ultra-efficient |
| Contribution Check | copilot | 93/100 | 2.8m | 181k | — | Fast, clean |
| Daily Safe Outputs Conformance Checker | claude | 92/100 | 3.1m | 134k | $0.33 | Efficient |
| Auto-Triage Issues | copilot | 90/100 | 3.5m | 136k | — | Success |
| Agent Container Smoke Test | copilot | 90/100 | 4.4m | 174k | — | Clean |
| Smoke Copilot | copilot | 90/100 | 6.7m | — | — | 49 turns, passing |
| Smoke Claude | claude | 87/100 | 12.9m | 991k | $1.47 | 42 turns, long |
| Lockfile Statistics Analysis Agent | claude | 87/100 | 5.0m | 456k | $0.82 | 14 turns, normal |
| AI Moderator (×3) | codex | 82/100 | 7.5–8.9m | 210–372k | — | 1/3 missing tool |
| Scout | claude | 80/100 | 4.9m | 613k | $0.81 | 19 turns |
| Smoke Codex | codex | 80/100 | 6.8m | 32M | — | 17 turns, Codex tokens |
| Slide Deck Maintainer | copilot | 78/100 | 6.7m | 1.5M | — | High tokens |
| Changeset Generator | codex | 75/100 | 8.2m | 123M | — | Codex parallelism |
| Semantic Function Refactoring | claude | 72/100 | 9.1m | 295k | $3.97 | High cost, 12 turns |
| Chroma Issue Indexer | copilot | 68/100 | 10.5m | 3.6M | — | Extreme tokens |
Cancelled Runs Analysis
14 runs were cancelled in a batch (runs 22450833xxx–22450834xxx). This is expected behavior from a Release workflow trigger — these represent staggered workflow starts that were cancelled before the new release artifacts were ready. Not a quality issue.
View Effectiveness Metrics
Task Completion Rates (Sampled Agent Runs)
- High completion (>80%): 13/15 agent workflows (87%)
- Partial/Degraded: AI Moderator (1/3 runs degraded), Chroma Issue Indexer (functional but inefficient)
- Infrastructure failures (not quality): Issue Monster, PR Triage Agent, Daily Issues Report, Org Health Report (lockdown)
Cost Efficiency Trends
| Agent | Today | Yesterday | Δ |
|---|---|---|---|
| Semantic Function Refactoring | $3.97 | $4.82 | ↓ $0.85 ✅ |
| Scout | $0.81 | — | New data point |
| Daily Safe Outputs Conformance Checker | $0.33 | — | Consistent |
| Lockfile Statistics Analysis Agent | $0.82 | — | Consistent |
| Smoke Claude | $1.47 | — | Long duration |
| Total (metered) | $5.94 | $6.14 | ↓ $0.20 ✅ |
Firewall Request Analysis
Total 926 requests across all workflows: 487 allowed (53%), 439 blocked (47%).
Top blocked workflows:
- Chroma Issue Indexer: 102 blocked — likely local socket connections (Serena MCP pattern)
- Semantic Function Refactoring: 72 blocked — consistent with
"-"domain pattern - Changeset Generator: 61 blocked — Codex parallelism reaching out broadly
- Slide Deck Maintainer: 43 blocked — investigating
- Smoke Codex: 38 blocked — expected for engine behavior
The "-" domain appearing in blocked list is a known Serena MCP local socket artifact (see issue #18388).
View Behavioral Patterns
Productive Patterns ✅
- Release → Smoke cancellation → Re-run: Expected orchestration behavior, not a failure
- Daily Safe Outputs Conformance Checker: Continues to be highly efficient (3 turns, $0.33)
- The Great Escapi: Maintaining minimal footprint, high reliability across 2+ weeks
Problematic Patterns ⚠️
- AI Moderator GitHub MCP intermittency: 3rd occurrence of missing tool issue. Pattern:
mode: remotewas supposed to fix this (2026-02-24), but 1/3 runs today missing GitHub MCP again. Silent failures — moderation trigger runs but does nothing. Impact: ~33% of moderation events missed. - Semantic Function Refactoring high cost: 12th consecutive day of elevated cost. Despite slight improvement ($4.82→$3.97), still 12× more expensive than most claude workflows. Root cause under investigation via issue [refactor] Semantic Function Clustering Analysis: Misplaced Functions and Duplicate Patterns in pkg/workflow #18388.
- Chroma Issue Indexer token growth: 3.6M tokens is abnormally high for an issue indexer. If the issue backlog is growing, this will continue to scale up linearly. No issue yet created.
- Codex extreme token counts: Changeset Generator (123M) and Smoke Codex (32M) show Codex engine's parallel-context behavior. Not quality issues but skew overall token metrics significantly.
Ecosystem Coverage Assessment
- ✅ Security: The Great Escapi active and efficient
- ✅ Code quality: Smoke tests (Copilot/Claude/Codex) passing on main
- ✅ Documentation: Slide Deck Maintainer running (high tokens, worth monitoring)
- ✅ Release: Workflow completed successfully today
⚠️ Issue triage: AI Moderator intermittent (33% miss rate today)- ❌ Issue monitoring: Issue Monster, Daily Issues Report locked out
Recommendations
High Priority
-
Investigate AI Moderator GitHub MCP reliability — 3rd incident in a week
- The 1/3 miss rate today suggests
mode: remoteis not a reliable fix - Consider: adding retry logic, fallback to
mode: localif remote unavailable, or alert on noop runs - Affected run: §22453521501
- The 1/3 miss rate today suggests
-
Chroma Issue Indexer token usage investigation — 3.6M tokens is a new high
- Determine if issue backlog growth is expected or indicates runaway indexing
- 102 blocked firewall requests also the highest in ecosystem — understand what it's attempting to reach
- Consider creating issue to track and cap maximum tokens per run
Medium Priority
-
Semantic Function Refactoring cost — Slight improvement ($3.97) but still high
- Issue [refactor] Semantic Function Clustering Analysis: Misplaced Functions and Duplicate Patterns in pkg/workflow #18388 exists — check if any action has been taken
- 72 blocked requests suggest scope creep beyond allowed network
-
Lockdown P0 escalation — All programmatic fix paths closed ([P1] Lockdown mode failing: GH_AW_GITHUB_TOKEN not configured — 5 workflows affected #17414, [q] fix(workflows): remove explicit lockdown:true to stop recurring failures #17807 both "not_planned")
- 4 workflows generating failure noise daily
- Recommend direct escalation to repository maintainers (not via issue)
Low Priority
- Smoke Claude duration — 12.9m and 42 turns is the longest smoke test
- All other smokes complete in <7m — investigate if Smoke Claude is testing more or stuck in retry loops
Trends (7-day)
- Agent quality: 86/100 (↓ from 89 — AI Moderator regression and Chroma concern)
- Total metered cost: $5.94 (↓ from $6.14 — small improvement)
- Firewall block rate: 47% (stable/elevated — "-" domain artifacts persist)
- Smoke test health: ✅ All passing on main
- Lockdown failures: 4 workflows (→ unchanged, 3+ weeks)
Actions Taken This Run
- Updated
agent-performance-latest.mdin shared repo memory - Updated
shared-alerts.mdwith AI Moderator regression and Chroma concern - Generated this performance report discussion
Analysis period: 2026-02-25 → 2026-02-26
Next report: 2026-02-27
References: §22453850435 | §22408567616 | §22453521501
Warning
This was intended to be a discussion, but discussions could not be created due to permissions issues. This issue was created as a fallback.
Discussion creation may fail if the specified category is not announcement-capable. Consider using the "Announcements" category or another announcement-capable category in your workflow configuration.
Generated by Agent Performance Analyzer - Meta-Orchestrator
- expires on Feb 27, 2026, 5:48 PM UTC