|
1 | | -# CodeContextBench Operations Guide |
2 | | - |
3 | | -This file is the operational quick-reference for benchmark maintenance. |
4 | | -`CLAUDE.md` mirrors this file. |
5 | | - |
6 | | -## Benchmark Overview |
7 | | -8 SDLC phase suites + 10 MCP-unique suites. SDLC tasks measure code quality |
8 | | -across phases: build, debug, design, document, fix, secure, test, understand. |
9 | | -MCP-unique tasks measure org-scale cross-repo discovery and retrieval. |
10 | | -See `README.md` for the full suite table and `docs/TASK_CATALOG.md` for |
11 | | -per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension. |
12 | | - |
13 | | -## Canonical References |
14 | | -- `README.md` - repo overview and quick start |
15 | | -- `docs/CONFIGS.md` - config matrix and MCP behavior |
16 | | -- `docs/QA_PROCESS.md` - pre-run, run-time, post-run validation |
17 | | -- `docs/ERROR_CATALOG.md` - known failures and remediation |
18 | | -- `docs/TASK_SELECTION.md` - curation/difficulty policy |
19 | | -- `docs/TASK_CATALOG.md` - current task inventory |
20 | | -- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation (incl. oracle checks + hybrid scoring) |
21 | | -- `docs/EVALUATION_PIPELINE.md` - unified eval: verifier → LLM judge → statistics → report |
22 | | -- `docs/RETRIEVAL_EVAL_SPEC.md` - full retrieval/IR evaluation pipeline (normalized events → metrics/probes/taxonomy artifacts) |
23 | | -- `docs/MCP_UNIQUE_TASKS.md` - MCP-unique task system (suites, authoring, oracle, DS tasks) |
24 | | -- `docs/MCP_UNIQUE_CALIBRATION.md` - oracle coverage analysis and threshold calibration data |
25 | | -- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions |
26 | | -- `docs/AGENT_INTERFACE.md` - runtime I/O contract |
27 | | -- `docs/EXTENSIBILITY.md` - safe suite/task/config extension |
28 | | -- `docs/REPO_HEALTH.md` - health gate and branch hygiene (reduce drift) |
29 | | -- `docs/LEADERBOARD.md` - ranking policy |
30 | | -- `docs/SUBMISSION.md` - submission format |
31 | | -- `docs/SKILLS.md` - AI agent skill system overview |
32 | | -- `skills/` - operational runbooks for AI agents (see `skills/README.md`) |
33 | | - |
34 | | -## Git Policy |
35 | | -- **All work happens on `main`** — do NOT create feature branches. |
36 | | -- Never run `git checkout -b` or `git switch -c`. |
37 | | -- Commit directly to `main`. This avoids cross-session branch confusion when multiple agents work on the repo. |
38 | | - |
39 | | -## Run Launch Policy |
40 | | -- **Every `harbor run` invocation MUST be gated by interactive confirmation.** |
41 | | - The user must see a pre-flight summary and press Enter before any benchmark |
42 | | - task launches. There is no `--yes` or unattended mode. |
43 | | -- Use `confirm_launch "description" "config" N` from `_common.sh` in one-off |
44 | | - scripts. `run_selected_tasks.sh` has its own built-in pre-flight gate. |
45 | | -- **Never write a script that calls `harbor run` without a confirmation gate.** |
46 | | -- **Never pass `--yes` to `run_selected_tasks.sh`** — the flag has been removed. |
47 | | - |
48 | | -## Typical Skill Routing |
49 | | -Use these defaults unless there is a task-specific reason not to. |
50 | | - |
51 | | -- **Before commit or push:** `repo-health` — run `python3 scripts/repo_health.py` (or `--quick` if only docs/config changed). Do not commit/push with failing health checks. |
52 | | -- Pre-run readiness: `check-infra`, `validate-tasks` |
53 | | -- Launch/runs: `run-benchmark`, `run-status`, `watch-benchmarks` |
54 | | -- Failure investigation: `triage-failure`, `quick-rerun` |
55 | | -- Cross-config analysis: `compare-configs`, `mcp-audit`, `ir-analysis` |
56 | | -- Cost/reporting: `cost-report`, `generate-report` |
57 | | -- Data hygiene: `sync-metadata`, `reextract-metrics`, `archive-run` |
58 | | -- Planning/prioritization: `whats-next` |
59 | | - |
60 | | -## Evaluation Configs |
61 | | -Config names encode three dimensions: `{agent}-{source}-{verifier}`. |
62 | | - |
63 | | -**SDLC suites** (`ccb_build`, `ccb_debug`, etc.): use **baseline-local-direct** |
64 | | -+ **mcp-remote-direct**. Agent produces code changes; verifier checks git diffs. |
65 | | - |
66 | | -**MCP-unique suites** (`ccb_mcp_*`): use **baseline-local-artifact** + |
67 | | -**mcp-remote-artifact**. Agent produces `answer.json`; verifier scores against |
68 | | -oracle. Never use `-direct` configs for MCP-unique suites. |
69 | | - |
70 | | -MCP configs use `Dockerfile.sg_only` (direct) or `Dockerfile.artifact_only` |
71 | | -(artifact) so the agent must discover code via MCP tools. The verifier clones |
72 | | -the mirror repo at verification time and overlays agent changes before scoring. |
73 | | -See `docs/CONFIGS.md` for the full config matrix. |
74 | | - |
75 | | -## Standard Workflow |
76 | | -0. **Before commit or push:** Run `python3 scripts/repo_health.py` (or `--quick`). Fix any failures so main stays clean and drift is caught early (see `docs/REPO_HEALTH.md`). |
77 | | -1. Run infrastructure checks before any batch. |
78 | | -2. Validate task integrity before launch (include runtime smoke for new/changed tasks). |
79 | | -3. Run the benchmark config (`configs/*_2config.sh` or equivalent). |
80 | | -4. Monitor progress and classify errors while tasks are running. |
81 | | -5. Validate outputs after each batch (`result.json`, `flagged_tasks.json`, trajectory coverage). |
82 | | -6. Triage failures before rerunning; avoid blind reruns. |
83 | | -7. Regenerate `MANIFEST.json` and evaluation report after run completion. |
84 | | -8. Sync metadata if task definitions changed. |
85 | | - |
86 | | -## Quality Gates |
87 | | -A run is considered healthy only if all are true: |
88 | | - |
89 | | -- No infra blockers (tokens, Docker, disk, credentials) |
90 | | -- No unexpected missing `result.json` |
91 | | -- Errored tasks are classified and actionable |
92 | | -- Zero-reward clusters are explained (task difficulty vs infra/tooling) |
93 | | -- Trajectory gaps are accounted for (or JSONL fallback noted) |
94 | | -- Config comparisons are based on matched task sets |
95 | | - |
96 | | -## Run Hygiene |
97 | | -- Prefer isolated, well-scoped reruns (don't mix unrelated fixes in one batch). |
98 | | -- Use parallel mode only when multi-account token state is confirmed fresh. |
99 | | -- Keep run naming and suite/config metadata consistent. |
100 | | -- Do not treat archived or draft analyses as canonical docs. |
101 | | -- Keep `docs/` focused on maintained operational guidance. |
102 | | - |
103 | | -## Escalation Rules |
104 | | -- Repeated infra failures: stop batch reruns and fix root cause first. |
105 | | -- Suspected verifier bug: quarantine task, document evidence, and open follow-up. |
106 | | -- Missing trajectories: use transcript fallback and record the limitation. |
107 | | -- Widespread MCP regressions: run MCP usage audit before changing prompts/configs. |
108 | | - |
109 | | -## High-Use Commands |
| 1 | +# CodeContextBench Agent Router |
| 2 | + |
| 3 | +This file is the root entrypoint for AI agents working in this repository. |
| 4 | +Keep it small. Use it to route to the right workflow and local guide, not as the |
| 5 | +full operations manual. |
| 6 | + |
| 7 | +## Non-Negotiables |
| 8 | +- All work happens on `main`. Do not create feature branches. |
| 9 | +- Every `harbor run` must be gated by interactive confirmation. |
| 10 | +- Before commit/push, run `python3 scripts/repo_health.py` (or `--quick` for docs/config-only changes). |
| 11 | + |
| 12 | +## Minimal Loading Policy |
| 13 | +- Default load order: this file + one relevant skill + one relevant doc. |
| 14 | +- Do not open broad catalogs (`docs/TASK_CATALOG.md`, large script lists, full reports) unless required. |
| 15 | +- Prefer directory-local `AGENTS.md` / `CLAUDE.md` when working under `scripts/`, `configs/`, `tasks/`, or `docs/`. |
| 16 | + |
| 17 | +## Fast Routing By Intent |
| 18 | +- Launch or rerun benchmarks: `docs/START_HERE_BY_TASK.md` -> "Launch / Rerun Benchmarks" |
| 19 | +- Monitor / status: `docs/START_HERE_BY_TASK.md` -> "Monitor Active Runs" |
| 20 | +- Triage failures: `docs/START_HERE_BY_TASK.md` -> "Triage Failed Tasks" |
| 21 | +- Compare configs / MCP impact / IR: `docs/START_HERE_BY_TASK.md` -> "Analyze Results" |
| 22 | +- Repo policy / health gate: `docs/REPO_HEALTH.md`, `docs/ops/WORKFLOWS.md` |
| 23 | +- Script discovery: `docs/ops/SCRIPT_INDEX.md` |
| 24 | + |
| 25 | +## Local Guides |
| 26 | +- `scripts/AGENTS.md` - script categories, safe usage, one-off handling |
| 27 | +- `configs/AGENTS.md` - run launcher wrappers and confirmation gate policy |
| 28 | +- `tasks/AGENTS.md` - task metadata and validation workflow |
| 29 | +- `docs/AGENTS.md` - documentation IA and canonical vs archive guidance |
| 30 | + |
| 31 | +## Compaction / Handoff Checkpoints |
| 32 | +- Compact after exploration, before multi-file edits. |
| 33 | +- Compact after launching a benchmark batch. |
| 34 | +- Compact after completing a triage batch or report generation pass. |
| 35 | +- Use `docs/ops/HANDOFF_TEMPLATE.md` when handing work to a new session. |
| 36 | + |
| 37 | +## Canonical Maps |
| 38 | +- `docs/START_HERE_BY_TASK.md` - task-based read order |
| 39 | +- `docs/ops/WORKFLOWS.md` - operational workflow summaries |
| 40 | +- `docs/ops/TROUBLESHOOTING.md` - escalation and common failure routing |
| 41 | +- `docs/ops/SCRIPT_INDEX.md` - generated script registry index |
| 42 | +- `docs/reference/README.md` - stable specs and reference docs |
| 43 | +- `docs/explanations/README.md` - rationale and context docs |
| 44 | + |
| 45 | +## Maintenance |
| 46 | +- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`. |
| 47 | +- Regenerate after edits (single command): |
110 | 48 | ```bash |
111 | | -python3 scripts/check_infra.py |
112 | | -python3 scripts/validate_tasks_preflight.py --all |
113 | | -python3 scripts/validate_tasks_preflight.py --task <task_dir> --smoke-runtime |
114 | | -python3 scripts/validate_task_run.py --run <run_dir> |
115 | | -python3 scripts/aggregate_status.py --staging |
116 | | -python3 scripts/compare_configs.py --run <run_dir> |
117 | | -python3 scripts/mcp_audit.py --run <run_dir> |
118 | | -python3 scripts/cost_report.py --run <run_dir> |
119 | | -python3 scripts/generate_manifest.py |
120 | | -python3 scripts/generate_eval_report.py |
121 | | -python3 scripts/abc_audit.py --suite <suite> # quality audit |
122 | | -python3 scripts/abc_score_task.py --suite <suite> # per-task quality score |
123 | | -python3 scripts/docs_consistency_check.py # documentation drift guard |
124 | | -python3 scripts/repo_health.py # repo health gate (before push); --quick for fast check |
| 49 | +python3 scripts/refresh_agent_navigation.py |
125 | 50 | ``` |
126 | | - |
127 | | -## Script Entrypoints |
128 | | -- `configs/_common.sh` - shared run infra (parallelism, token refresh, validation hooks, `confirm_launch()`, `validate_config_name()`) |
129 | | -- `configs/sdlc_suite_2config.sh` - generic SDLC runner (used by phase wrappers) |
130 | | -- `configs/{build,debug,design,document,fix,secure,test}_2config.sh` - thin SDLC phase wrappers |
131 | | -- `configs/run_selected_tasks.sh` - unified runner from `selected_benchmark_tasks.json` |
132 | | -- `configs/validate_one_per_benchmark.sh --smoke-runtime` - quick no-agent runtime smoke (1 task per suite) |
133 | | - - Smoke interpretation: `smoke_verifier_nonzero_with_reward` is acceptable in no-agent mode. |
134 | | - - Timeout diagnostics: `smoke_build_timeout` (image build phase) vs `smoke_verify_timeout` (verifier phase). |
135 | | -- `scripts/promote_run.py` - staging to official promotion flow |
136 | | - |
137 | | -## Script Categories |
138 | | - |
139 | | -### Core Operations (used in every run) |
140 | | -- `check_infra.py` - infrastructure readiness checker |
141 | | -- `validate_tasks_preflight.py` - pre-flight task validation (static + optional runtime smoke) |
142 | | -- `aggregate_status.py` - run scanner, status classification, watch mode |
143 | | -- `validate_task_run.py` - post-run output validation |
144 | | -- `status_fingerprints.py` - error classification (12 regex patterns) |
145 | | -- `generate_eval_report.py` - deterministic evaluation report generator |
146 | | -- `generate_manifest.py` - rebuild MANIFEST from on-disk results |
147 | | - |
148 | | -### Analysis & Comparison |
149 | | -- `compare_configs.py` - cross-config divergence analysis |
150 | | -- `mcp_audit.py` - MCP tool usage audit |
151 | | -- `ir_analysis.py` - information retrieval analysis |
152 | | -- `cost_report.py` - token/cost aggregation |
153 | | -- `cost_breakdown_analysis.py` - detailed cost breakdown |
154 | | -- `failure_analysis.py` - failure pattern analysis |
155 | | -- `reliability_analysis.py` - reliability metrics |
156 | | -- `audit_traces.py` - agent trace auditing |
157 | | -- `ds_audit.py` - Deep Search usage audit |
158 | | - |
159 | | -### Repo health (reduce drift, clean branches) |
160 | | -- `repo_health.py` - single gate: docs consistency + selection file + task preflight (see docs/REPO_HEALTH.md) |
161 | | -- `docs_consistency_check.py` - documentation drift guard |
162 | | - |
163 | | -### Quality Assurance |
164 | | -- `abc_audit.py` - ABC benchmark quality audit (32 criteria across 3 dimensions) |
165 | | -- `abc_score_task.py` - per-task quality scoring |
166 | | -- `abc_criteria.py` - ABC criteria data model |
167 | | -- `validate_official_integrity.py` - official run integrity checks |
168 | | -- `quarantine_invalid_tasks.py` - quarantine tasks with zero MCP usage |
169 | | - |
170 | | -### Data Management |
171 | | -- `sync_task_metadata.py` - task.toml vs registry reconciliation (--fix to auto-update) |
172 | | -- `archive_run.py` - archive old runs to save disk |
173 | | -- `rerun_failed.py` - generate rerun commands for failed tasks |
174 | | -- `promote_run.py` - staging to official promotion flow |
175 | | -- `extract_task_metrics.py` - per-task metric extraction |
176 | | -- `reextract_all_metrics.py` - bulk re-extraction |
177 | | - |
178 | | -### Submission & Reporting |
179 | | -- `validate_submission.py` - validate submission format |
180 | | -- `package_submission.py` - package submission archive |
181 | | -- `generate_leaderboard.py` - generate leaderboard rankings |
182 | | -- `generate_comprehensive_report.py` - comprehensive analysis report |
183 | | -- `ingest_judge_results.py` - ingest LLM judge results |
184 | | - |
185 | | -### Task Creation & Selection |
186 | | -- `select_benchmark_tasks.py` - canonical task selection pipeline |
187 | | -- `mine_bug_tasks.py` - mine GitHub for bug-fix tasks |
188 | | -- `generate_pytorch_expected_diffs.py` - generate PyTorch ground truth diffs |
189 | | - |
190 | | -### One-Off / Historical |
191 | | -Scripts in `scripts/` prefixed with `rerun_`, `backfill_`, `fix_`, or `repair_` |
192 | | -are one-off scripts used to address specific past issues. They are preserved |
193 | | -for auditability but are not part of the standard workflow. |
194 | | - |
195 | | -DependEval-specific scripts (`dependeval_eval_*.py`, `generate_dependeval_tasks.py`, |
196 | | -`select_dependeval_tasks.py`, `materialize_dependeval_repos.py`) relate to the |
197 | | -archived ccb_dependeval suite. |
0 commit comments