Skip to content

Commit f7f92c4

Browse files
sjarmakclaude
andcommitted
chore: add filtered selection file for MCP-unique tasks 121-141
Subset of selected_mcp_unique_tasks.json containing only the 21 new tasks (use case IDs 121-141) for targeted staging runs. Also regenerates stale script registry and agent navigation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent dba9a39 commit f7f92c4

File tree

136 files changed

+3526
-546
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

136 files changed

+3526
-546
lines changed

.github/workflows/docs-consistency.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,6 @@ jobs:
2020

2121
- name: Validate docs references
2222
run: python3 scripts/docs_consistency_check.py
23+
24+
- name: Verify generated agent navigation artifacts are fresh
25+
run: python3 scripts/refresh_agent_navigation.py --check

.github/workflows/repo_health.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,3 +21,6 @@ jobs:
2121

2222
- name: Repo health gate (full)
2323
run: python3 scripts/repo_health.py
24+
25+
- name: Verify generated agent navigation artifacts are fresh
26+
run: python3 scripts/refresh_agent_navigation.py --check

AGENTS.md

Lines changed: 48 additions & 195 deletions
Original file line numberDiff line numberDiff line change
@@ -1,197 +1,50 @@
1-
# CodeContextBench Operations Guide
2-
3-
This file is the operational quick-reference for benchmark maintenance.
4-
`CLAUDE.md` mirrors this file.
5-
6-
## Benchmark Overview
7-
8 SDLC phase suites + 10 MCP-unique suites. SDLC tasks measure code quality
8-
across phases: build, debug, design, document, fix, secure, test, understand.
9-
MCP-unique tasks measure org-scale cross-repo discovery and retrieval.
10-
See `README.md` for the full suite table and `docs/TASK_CATALOG.md` for
11-
per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
12-
13-
## Canonical References
14-
- `README.md` - repo overview and quick start
15-
- `docs/CONFIGS.md` - config matrix and MCP behavior
16-
- `docs/QA_PROCESS.md` - pre-run, run-time, post-run validation
17-
- `docs/ERROR_CATALOG.md` - known failures and remediation
18-
- `docs/TASK_SELECTION.md` - curation/difficulty policy
19-
- `docs/TASK_CATALOG.md` - current task inventory
20-
- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation (incl. oracle checks + hybrid scoring)
21-
- `docs/EVALUATION_PIPELINE.md` - unified eval: verifier → LLM judge → statistics → report
22-
- `docs/RETRIEVAL_EVAL_SPEC.md` - full retrieval/IR evaluation pipeline (normalized events → metrics/probes/taxonomy artifacts)
23-
- `docs/MCP_UNIQUE_TASKS.md` - MCP-unique task system (suites, authoring, oracle, DS tasks)
24-
- `docs/MCP_UNIQUE_CALIBRATION.md` - oracle coverage analysis and threshold calibration data
25-
- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions
26-
- `docs/AGENT_INTERFACE.md` - runtime I/O contract
27-
- `docs/EXTENSIBILITY.md` - safe suite/task/config extension
28-
- `docs/REPO_HEALTH.md` - health gate and branch hygiene (reduce drift)
29-
- `docs/LEADERBOARD.md` - ranking policy
30-
- `docs/SUBMISSION.md` - submission format
31-
- `docs/SKILLS.md` - AI agent skill system overview
32-
- `skills/` - operational runbooks for AI agents (see `skills/README.md`)
33-
34-
## Git Policy
35-
- **All work happens on `main`** — do NOT create feature branches.
36-
- Never run `git checkout -b` or `git switch -c`.
37-
- Commit directly to `main`. This avoids cross-session branch confusion when multiple agents work on the repo.
38-
39-
## Run Launch Policy
40-
- **Every `harbor run` invocation MUST be gated by interactive confirmation.**
41-
The user must see a pre-flight summary and press Enter before any benchmark
42-
task launches. There is no `--yes` or unattended mode.
43-
- Use `confirm_launch "description" "config" N` from `_common.sh` in one-off
44-
scripts. `run_selected_tasks.sh` has its own built-in pre-flight gate.
45-
- **Never write a script that calls `harbor run` without a confirmation gate.**
46-
- **Never pass `--yes` to `run_selected_tasks.sh`** — the flag has been removed.
47-
48-
## Typical Skill Routing
49-
Use these defaults unless there is a task-specific reason not to.
50-
51-
- **Before commit or push:** `repo-health` — run `python3 scripts/repo_health.py` (or `--quick` if only docs/config changed). Do not commit/push with failing health checks.
52-
- Pre-run readiness: `check-infra`, `validate-tasks`
53-
- Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks`
54-
- Failure investigation: `triage-failure`, `quick-rerun`
55-
- Cross-config analysis: `compare-configs`, `mcp-audit`, `ir-analysis`
56-
- Cost/reporting: `cost-report`, `generate-report`
57-
- Data hygiene: `sync-metadata`, `reextract-metrics`, `archive-run`
58-
- Planning/prioritization: `whats-next`
59-
60-
## Evaluation Configs
61-
Config names encode three dimensions: `{agent}-{source}-{verifier}`.
62-
63-
**SDLC suites** (`ccb_build`, `ccb_debug`, etc.): use **baseline-local-direct**
64-
+ **mcp-remote-direct**. Agent produces code changes; verifier checks git diffs.
65-
66-
**MCP-unique suites** (`ccb_mcp_*`): use **baseline-local-artifact** +
67-
**mcp-remote-artifact**. Agent produces `answer.json`; verifier scores against
68-
oracle. Never use `-direct` configs for MCP-unique suites.
69-
70-
MCP configs use `Dockerfile.sg_only` (direct) or `Dockerfile.artifact_only`
71-
(artifact) so the agent must discover code via MCP tools. The verifier clones
72-
the mirror repo at verification time and overlays agent changes before scoring.
73-
See `docs/CONFIGS.md` for the full config matrix.
74-
75-
## Standard Workflow
76-
0. **Before commit or push:** Run `python3 scripts/repo_health.py` (or `--quick`). Fix any failures so main stays clean and drift is caught early (see `docs/REPO_HEALTH.md`).
77-
1. Run infrastructure checks before any batch.
78-
2. Validate task integrity before launch (include runtime smoke for new/changed tasks).
79-
3. Run the benchmark config (`configs/*_2config.sh` or equivalent).
80-
4. Monitor progress and classify errors while tasks are running.
81-
5. Validate outputs after each batch (`result.json`, `flagged_tasks.json`, trajectory coverage).
82-
6. Triage failures before rerunning; avoid blind reruns.
83-
7. Regenerate `MANIFEST.json` and evaluation report after run completion.
84-
8. Sync metadata if task definitions changed.
85-
86-
## Quality Gates
87-
A run is considered healthy only if all are true:
88-
89-
- No infra blockers (tokens, Docker, disk, credentials)
90-
- No unexpected missing `result.json`
91-
- Errored tasks are classified and actionable
92-
- Zero-reward clusters are explained (task difficulty vs infra/tooling)
93-
- Trajectory gaps are accounted for (or JSONL fallback noted)
94-
- Config comparisons are based on matched task sets
95-
96-
## Run Hygiene
97-
- Prefer isolated, well-scoped reruns (don't mix unrelated fixes in one batch).
98-
- Use parallel mode only when multi-account token state is confirmed fresh.
99-
- Keep run naming and suite/config metadata consistent.
100-
- Do not treat archived or draft analyses as canonical docs.
101-
- Keep `docs/` focused on maintained operational guidance.
102-
103-
## Escalation Rules
104-
- Repeated infra failures: stop batch reruns and fix root cause first.
105-
- Suspected verifier bug: quarantine task, document evidence, and open follow-up.
106-
- Missing trajectories: use transcript fallback and record the limitation.
107-
- Widespread MCP regressions: run MCP usage audit before changing prompts/configs.
108-
109-
## High-Use Commands
1+
# CodeContextBench Agent Router
2+
3+
This file is the root entrypoint for AI agents working in this repository.
4+
Keep it small. Use it to route to the right workflow and local guide, not as the
5+
full operations manual.
6+
7+
## Non-Negotiables
8+
- All work happens on `main`. Do not create feature branches.
9+
- Every `harbor run` must be gated by interactive confirmation.
10+
- Before commit/push, run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes).
11+
12+
## Minimal Loading Policy
13+
- Default load order: this file + one relevant skill + one relevant doc.
14+
- Do not open broad catalogs (`docs/TASK_CATALOG.md`, large script lists, full reports) unless required.
15+
- Prefer directory-local `AGENTS.md` / `CLAUDE.md` when working under `scripts/`, `configs/`, `tasks/`, or `docs/`.
16+
17+
## Fast Routing By Intent
18+
- Launch or rerun benchmarks: `docs/START_HERE_BY_TASK.md` -> "Launch / Rerun Benchmarks"
19+
- Monitor / status: `docs/START_HERE_BY_TASK.md` -> "Monitor Active Runs"
20+
- Triage failures: `docs/START_HERE_BY_TASK.md` -> "Triage Failed Tasks"
21+
- Compare configs / MCP impact / IR: `docs/START_HERE_BY_TASK.md` -> "Analyze Results"
22+
- Repo policy / health gate: `docs/REPO_HEALTH.md`, `docs/ops/WORKFLOWS.md`
23+
- Script discovery: `docs/ops/SCRIPT_INDEX.md`
24+
25+
## Local Guides
26+
- `scripts/AGENTS.md` - script categories, safe usage, one-off handling
27+
- `configs/AGENTS.md` - run launcher wrappers and confirmation gate policy
28+
- `tasks/AGENTS.md` - task metadata and validation workflow
29+
- `docs/AGENTS.md` - documentation IA and canonical vs archive guidance
30+
31+
## Compaction / Handoff Checkpoints
32+
- Compact after exploration, before multi-file edits.
33+
- Compact after launching a benchmark batch.
34+
- Compact after completing a triage batch or report generation pass.
35+
- Use `docs/ops/HANDOFF_TEMPLATE.md` when handing work to a new session.
36+
37+
## Canonical Maps
38+
- `docs/START_HERE_BY_TASK.md` - task-based read order
39+
- `docs/ops/WORKFLOWS.md` - operational workflow summaries
40+
- `docs/ops/TROUBLESHOOTING.md` - escalation and common failure routing
41+
- `docs/ops/SCRIPT_INDEX.md` - generated script registry index
42+
- `docs/reference/README.md` - stable specs and reference docs
43+
- `docs/explanations/README.md` - rationale and context docs
44+
45+
## Maintenance
46+
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
47+
- Regenerate after edits (single command):
11048
```bash
111-
python3 scripts/check_infra.py
112-
python3 scripts/validate_tasks_preflight.py --all
113-
python3 scripts/validate_tasks_preflight.py --task <task_dir> --smoke-runtime
114-
python3 scripts/validate_task_run.py --run <run_dir>
115-
python3 scripts/aggregate_status.py --staging
116-
python3 scripts/compare_configs.py --run <run_dir>
117-
python3 scripts/mcp_audit.py --run <run_dir>
118-
python3 scripts/cost_report.py --run <run_dir>
119-
python3 scripts/generate_manifest.py
120-
python3 scripts/generate_eval_report.py
121-
python3 scripts/abc_audit.py --suite <suite> # quality audit
122-
python3 scripts/abc_score_task.py --suite <suite> # per-task quality score
123-
python3 scripts/docs_consistency_check.py # documentation drift guard
124-
python3 scripts/repo_health.py # repo health gate (before push); --quick for fast check
49+
python3 scripts/refresh_agent_navigation.py
12550
```
126-
127-
## Script Entrypoints
128-
- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks, `confirm_launch()`, `validate_config_name()`)
129-
- `configs/sdlc_suite_2config.sh` - generic SDLC runner (used by phase wrappers)
130-
- `configs/{build,debug,design,document,fix,secure,test}_2config.sh` - thin SDLC phase wrappers
131-
- `configs/run_selected_tasks.sh` - unified runner from `selected_benchmark_tasks.json`
132-
- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per suite)
133-
- Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode.
134-
- Timeout diagnostics: `smoke_build_timeout` (image build phase) vs `smoke_verify_timeout` (verifier phase).
135-
- `scripts/promote_run.py` - staging to official promotion flow
136-
137-
## Script Categories
138-
139-
### Core Operations (used in every run)
140-
- `check_infra.py` - infrastructure readiness checker
141-
- `validate_tasks_preflight.py` - pre-flight task validation (static + optional runtime smoke)
142-
- `aggregate_status.py` - run scanner, status classification, watch mode
143-
- `validate_task_run.py` - post-run output validation
144-
- `status_fingerprints.py` - error classification (12 regex patterns)
145-
- `generate_eval_report.py` - deterministic evaluation report generator
146-
- `generate_manifest.py` - rebuild MANIFEST from on-disk results
147-
148-
### Analysis & Comparison
149-
- `compare_configs.py` - cross-config divergence analysis
150-
- `mcp_audit.py` - MCP tool usage audit
151-
- `ir_analysis.py` - information retrieval analysis
152-
- `cost_report.py` - token/cost aggregation
153-
- `cost_breakdown_analysis.py` - detailed cost breakdown
154-
- `failure_analysis.py` - failure pattern analysis
155-
- `reliability_analysis.py` - reliability metrics
156-
- `audit_traces.py` - agent trace auditing
157-
- `ds_audit.py` - Deep Search usage audit
158-
159-
### Repo health (reduce drift, clean branches)
160-
- `repo_health.py` - single gate: docs consistency + selection file + task preflight (see docs/REPO_HEALTH.md)
161-
- `docs_consistency_check.py` - documentation drift guard
162-
163-
### Quality Assurance
164-
- `abc_audit.py` - ABC benchmark quality audit (32 criteria across 3 dimensions)
165-
- `abc_score_task.py` - per-task quality scoring
166-
- `abc_criteria.py` - ABC criteria data model
167-
- `validate_official_integrity.py` - official run integrity checks
168-
- `quarantine_invalid_tasks.py` - quarantine tasks with zero MCP usage
169-
170-
### Data Management
171-
- `sync_task_metadata.py` - task.toml vs registry reconciliation (--fix to auto-update)
172-
- `archive_run.py` - archive old runs to save disk
173-
- `rerun_failed.py` - generate rerun commands for failed tasks
174-
- `promote_run.py` - staging to official promotion flow
175-
- `extract_task_metrics.py` - per-task metric extraction
176-
- `reextract_all_metrics.py` - bulk re-extraction
177-
178-
### Submission & Reporting
179-
- `validate_submission.py` - validate submission format
180-
- `package_submission.py` - package submission archive
181-
- `generate_leaderboard.py` - generate leaderboard rankings
182-
- `generate_comprehensive_report.py` - comprehensive analysis report
183-
- `ingest_judge_results.py` - ingest LLM judge results
184-
185-
### Task Creation & Selection
186-
- `select_benchmark_tasks.py` - canonical task selection pipeline
187-
- `mine_bug_tasks.py` - mine GitHub for bug-fix tasks
188-
- `generate_pytorch_expected_diffs.py` - generate PyTorch ground truth diffs
189-
190-
### One-Off / Historical
191-
Scripts in `scripts/` prefixed with `rerun_`, `backfill_`, `fix_`, or `repair_`
192-
are one-off scripts used to address specific past issues. They are preserved
193-
for auditability but are not part of the standard workflow.
194-
195-
DependEval-specific scripts (`dependeval_eval_*.py`, `generate_dependeval_tasks.py`,
196-
`select_dependeval_tasks.py`, `materialize_dependeval_repos.py`) relate to the
197-
archived ccb_dependeval suite.

0 commit comments

Comments
 (0)