|
| 1 | +# Retrieval Evaluation Specification |
| 2 | + |
| 3 | +> **Status**: v1 — standalone, non-ranking. |
| 4 | +> This framework evaluates retrieval quality and its downstream impact on task |
| 5 | +> outcomes without changing primary CCB scoring or leaderboard semantics. |
| 6 | +
|
| 7 | +## Purpose |
| 8 | + |
| 9 | +Measure three aspects of agent retrieval behavior: |
| 10 | + |
| 11 | +1. **Retrieval quality** — did the agent find the right files/symbols? |
| 12 | +2. **Utilization quality** — did the agent use retrieved evidence correctly? |
| 13 | +3. **Downstream impact** — how do retrieval metrics correlate with task |
| 14 | + outcomes, cost, and time? |
| 15 | + |
| 16 | +## Schema Overview |
| 17 | + |
| 18 | +The normalized retrieval event schema |
| 19 | +(`schemas/retrieval_events_schema.json`, version 1.0) defines a single |
| 20 | +JSON document per task-config pair containing: |
| 21 | + |
| 22 | +| Section | Purpose | |
| 23 | +|---------|---------| |
| 24 | +| `provenance` | Run/task/config identification | |
| 25 | +| `coverage` | Trace and ground-truth availability flags | |
| 26 | +| `ground_truth` | Expected files, optional symbols and chunks | |
| 27 | +| `events` | Ordered step-level retrieval events | |
| 28 | +| `summary` | Pre-computed aggregate counts (optional) | |
| 29 | + |
| 30 | +## Field Semantics |
| 31 | + |
| 32 | +### Provenance |
| 33 | + |
| 34 | +Uniquely identifies the task execution: |
| 35 | + |
| 36 | +- `run_id` — staging or official run directory name. |
| 37 | +- `batch_timestamp` — batch subdirectory within the run. |
| 38 | +- `task_name` — canonical task identifier (matches `task.toml` name). |
| 39 | +- `config_name` — full config label (e.g. `baseline-local-direct`, |
| 40 | + `mcp-remote-direct`). |
| 41 | +- `benchmark` — suite name (e.g. `ccb_fix`, `ccb_mcp_crossorg`). |
| 42 | + |
| 43 | +### Coverage Flags |
| 44 | + |
| 45 | +Every document reports trace availability explicitly so downstream stages |
| 46 | +can filter or flag results: |
| 47 | + |
| 48 | +- `has_trajectory` — `agent/trajectory.json` was found and parseable. |
| 49 | +- `has_transcript` — `agent/claude-code.txt` (JSONL) was found and parseable. |
| 50 | +- `has_ground_truth` — file-level expected files exist for the task. |
| 51 | +- `has_chunk_ground_truth` — line-range annotations exist (e.g. defect |
| 52 | + locations in code-review tasks). |
| 53 | +- `trace_source` — which source produced the events: |
| 54 | + - `trajectory` — events from `trajectory.json` only. |
| 55 | + - `transcript` — events from `claude-code.txt` only. |
| 56 | + - `merged` — events from both sources combined (trajectory preferred for |
| 57 | + tool calls, transcript for timestamps or subagent recovery). |
| 58 | + - `null` — degraded mode (no usable trace). |
| 59 | +- `degraded_reason` — human-readable explanation when events are empty or |
| 60 | + incomplete. |
| 61 | + |
| 62 | +### Ground Truth |
| 63 | + |
| 64 | +Ground truth is loaded from the task definition directory using the existing |
| 65 | +priority chain in `ccb_metrics/ground_truth.py`: |
| 66 | + |
| 67 | +1. `tests/ground_truth.json` (high confidence) |
| 68 | +2. `tests/expected_defects.json` (high confidence) |
| 69 | +3. `tests/expected_changes.json` (high confidence) |
| 70 | +4. `tests/reference_fix.patch` / `tests/expected.diff` (high confidence) |
| 71 | +5. `solution/solve.sh` gold patch (medium confidence) |
| 72 | +6. `instruction.md` / `tests/test.sh` regex extraction (medium/low confidence) |
| 73 | + |
| 74 | +Three levels of ground truth are supported: |
| 75 | + |
| 76 | +- **File-level** (`ground_truth.files`) — always populated when ground truth |
| 77 | + exists. Repo-relative paths. |
| 78 | +- **Symbol-level** (`ground_truth.symbols`) — optional. Function/class names |
| 79 | + within ground-truth files, loaded from `task_spec.json` oracle items. |
| 80 | +- **Chunk-level** (`ground_truth.chunks`) — optional. Line ranges within files, |
| 81 | + loaded from `expected_defects.json` annotations or similar. |
| 82 | + |
| 83 | +When `coverage.has_ground_truth` is false, `ground_truth.files` is an empty |
| 84 | +array and all IR metrics are marked as non-computable. |
| 85 | + |
| 86 | +### Retrieval Events |
| 87 | + |
| 88 | +Each event represents one retrieval-related tool call by the agent: |
| 89 | + |
| 90 | +- `step_index` — zero-based position in the trace. Preserves execution order. |
| 91 | +- `tool_name` — raw name from the trace (e.g. `Read`, |
| 92 | + `mcp__sourcegraph__sg_keyword_search`). |
| 93 | +- `tool_category` — normalized category for cross-config comparison: |
| 94 | + |
| 95 | +| Category | Local tools | MCP tools | |
| 96 | +|----------|-------------|-----------| |
| 97 | +| `file_read` | Read | read_file | |
| 98 | +| `file_search` | Glob, Grep | list_files | |
| 99 | +| `symbol_navigation` | — | find_references, go_to_definition | |
| 100 | +| `code_search` | Grep (pattern) | keyword_search, nls_search | |
| 101 | +| `commit_search` | — | commit_search, diff_search, compare_revisions | |
| 102 | +| `deep_search` | — | deepsearch, deepsearch_read | |
| 103 | +| `file_write` | Write, Edit | — | |
| 104 | +| `other` | Bash, Task | get_contributor_repos, list_repos | |
| 105 | + |
| 106 | +- `is_mcp` — true for any `mcp__sourcegraph__*` tool call. |
| 107 | +- `target_files` — normalized file paths accessed or returned. Normalization |
| 108 | + strips `/workspace/`, `/repo_full/`, `/testbed/`, and diff `a/`/`b/` prefixes; |
| 109 | + paths are lowercased for matching. |
| 110 | +- `hits_ground_truth` — true if any `target_file` matches a ground-truth file. |
| 111 | +- `cumulative_tokens` — running token total up to this step (when available). |
| 112 | +- `elapsed_seconds` — wall-clock time from agent execution start. |
| 113 | + |
| 114 | +### Event Summary |
| 115 | + |
| 116 | +Optional pre-computed counts to avoid re-scanning the events array: |
| 117 | + |
| 118 | +- `total_events`, `mcp_events`, `local_events` |
| 119 | +- `unique_files_accessed`, `ground_truth_files_hit` |
| 120 | +- `first_ground_truth_hit_step` |
| 121 | +- `events_by_category` (keyed by `tool_category`) |
| 122 | + |
| 123 | +## Degraded Mode Behavior |
| 124 | + |
| 125 | +The pipeline handles incomplete data gracefully: |
| 126 | + |
| 127 | +| Condition | Behavior | |
| 128 | +|-----------|----------| |
| 129 | +| No trajectory AND no transcript | `events` is empty, `coverage.trace_source` is null, `coverage.degraded_reason` explains | |
| 130 | +| Trajectory only (no transcript) | Events extracted from trajectory; timestamps may be absent for some steps | |
| 131 | +| Transcript only (no trajectory) | Events extracted from transcript; subagent tool calls may be missed | |
| 132 | +| No ground truth | `ground_truth.files` is empty; `hits_ground_truth` is false for all events; IR metrics non-computable | |
| 133 | +| No chunk ground truth | `ground_truth.chunks` absent; chunk-level metrics emit `resolution: "file_level_only"` flag | |
| 134 | + |
| 135 | +Downstream metric stages MUST check `coverage` flags before computing metrics |
| 136 | +and propagate appropriate `non_computable` markers rather than emitting |
| 137 | +misleading zeroes. |
| 138 | + |
| 139 | +## Schema Versioning |
| 140 | + |
| 141 | +- The `schema_version` field is a semver-style string (currently `"1.0"`). |
| 142 | +- **Minor bumps** (1.1, 1.2, ...) add optional fields. Consumers of 1.0 data |
| 143 | + continue to work unchanged. |
| 144 | +- **Major bumps** (2.0) change required fields or remove/rename existing ones. |
| 145 | + Consumers must update. |
| 146 | +- The normalization CLI embeds the schema version it was built against. |
| 147 | + Metric stages validate `schema_version` on load and reject unknown major |
| 148 | + versions. |
| 149 | + |
| 150 | +## Output Paths |
| 151 | + |
| 152 | +Normalized retrieval event files are written to a parallel directory structure |
| 153 | +that does not overwrite existing run artifacts: |
| 154 | + |
| 155 | +``` |
| 156 | +runs/{staging|official}/{run_id}/retrieval_events/ |
| 157 | + {config_name}/ |
| 158 | + {task_name}.retrieval_events.json |
| 159 | +``` |
| 160 | + |
| 161 | +Run-level aggregates are written alongside: |
| 162 | + |
| 163 | +``` |
| 164 | +runs/{staging|official}/{run_id}/retrieval_events/ |
| 165 | + run_retrieval_summary.json |
| 166 | +``` |
| 167 | + |
| 168 | +## Pipeline Stages |
| 169 | + |
| 170 | +The full evaluation pipeline (`scripts/retrieval_eval_pipeline.py`) runs five |
| 171 | +stages on each normalized event document: |
| 172 | + |
| 173 | +### Stage 1: File-Level IR Metrics |
| 174 | + |
| 175 | +Standard information retrieval metrics computed from the ordered list of |
| 176 | +retrieved files against ground-truth files: |
| 177 | + |
| 178 | +- **Precision@K, Recall@K, F1@K** (K = 1, 3, 5, 10) |
| 179 | +- **MRR** (Mean Reciprocal Rank) |
| 180 | +- **nDCG@K** (normalized Discounted Cumulative Gain) |
| 181 | +- **MAP** (Mean Average Precision) |
| 182 | +- **File-level recall** (fraction of GT files found anywhere in retrieved list) |
| 183 | +- **Context efficiency** (fraction of retrieved files that are relevant) |
| 184 | +- **TTFR** (time-to-first-relevant file, in seconds and tokens) |
| 185 | + |
| 186 | +Tasks without ground truth are marked `computable: false`. |
| 187 | + |
| 188 | +### Stage 2: Chunk-Level Relevance Metrics |
| 189 | + |
| 190 | +When chunk-level ground truth (line-range annotations) is available: |
| 191 | + |
| 192 | +- **Chunk recall** = fraction of GT chunks whose file was accessed by the agent. |
| 193 | +- **Resolution** field: `"chunk_level"` or `"file_level_only"`. |
| 194 | +- **Validity** field: `"file_match_only"` (v1 granularity) or `"unsupported"`. |
| 195 | + |
| 196 | +**Chunking assumption**: In v1, a retrieval event "covers" a ground-truth |
| 197 | +chunk if any `target_file` matches the chunk's file path. Sub-line matching |
| 198 | +(e.g. exact line range overlap) requires structured diff data and is deferred |
| 199 | +to future schema versions. |
| 200 | + |
| 201 | +### Stage 3: Utilization Probe Metrics |
| 202 | + |
| 203 | +Measures whether retrieved evidence was actually *used* by the agent: |
| 204 | + |
| 205 | +- **`util_referenced_file_correctness`** = |files_written ∩ GT| / |GT|. |
| 206 | + Measures whether the agent wrote to the correct files after retrieval. |
| 207 | +- **`util_read_before_write_ratio`** = fraction of written files that were |
| 208 | + read by the agent before being written to. High values indicate deliberate |
| 209 | + evidence consumption. |
| 210 | + |
| 211 | +**Coverage**: `probe_available: false` when the agent performed no file writes |
| 212 | +or when no ground truth exists. The probe requires write events to measure |
| 213 | +utilization — read-only tasks produce no utilization signal. |
| 214 | + |
| 215 | +**Limitations**: These probes measure file-level correctness only. They do |
| 216 | +not validate whether the *content* written was semantically correct (that is |
| 217 | +the verifier's job). Future probes may add symbol-level or API-level checks. |
| 218 | + |
| 219 | +### Stage 4: Error Taxonomy and Calibration Slices |
| 220 | + |
| 221 | +Five taxonomy labels classify retrieval error modes per-task: |
| 222 | + |
| 223 | +| Label | Definition | |
| 224 | +|-------|-----------| |
| 225 | +| `irrelevant_retrieval` | Files retrieved that are not in ground truth | |
| 226 | +| `missed_key_evidence` | Ground truth files never retrieved | |
| 227 | +| `wrong_evidence_used` | Non-GT files the agent wrote to | |
| 228 | +| `unused_correct_retrieval` | GT files retrieved but never written to | |
| 229 | +| `ambiguity_near_miss` | Retrieved files in the same directory as a GT file | |
| 230 | + |
| 231 | +Two calibration slice dimensions: |
| 232 | + |
| 233 | +- **Candidate set size**: `small` (≤5 files), `medium` (6–20), `large` (>20) |
| 234 | +- **Evidence type**: `local` (no MCP tools used) or `mcp` (at least one MCP call) |
| 235 | + |
| 236 | +### Stage 5: Artifact Emission |
| 237 | + |
| 238 | +Per-task artifacts (`{task_name}.retrieval_metrics.json`) contain all four |
| 239 | +metric stages plus provenance and coverage metadata. Run-level summaries |
| 240 | +(`run_retrieval_summary.json`) contain aggregated statistics across all |
| 241 | +computable tasks. |
| 242 | + |
| 243 | +## Relationship to Existing Pipeline |
| 244 | + |
| 245 | +This evaluation is **standalone and non-ranking** in v1: |
| 246 | + |
| 247 | +- Does not modify `result.json`, `task_metrics.json`, or `MANIFEST.json`. |
| 248 | +- Does not affect verifier rewards or leaderboard scoring. |
| 249 | +- Consumes the same run artifacts as `ir_analysis.py` and `mcp_audit.py`. |
| 250 | +- Future versions may feed retrieval metrics into `generate_eval_report.py` |
| 251 | + as an optional supplementary section. |
| 252 | + |
| 253 | +## v1 Rollout Boundaries |
| 254 | + |
| 255 | +### What v1 Does |
| 256 | + |
| 257 | +- Normalizes agent traces into step-level retrieval events. |
| 258 | +- Computes file-level IR metrics, chunk-level metrics (with fallback), |
| 259 | + utilization probes, and error taxonomy. |
| 260 | +- Correlates retrieval metrics with task outcomes (association only). |
| 261 | +- Generates matched task comparisons between baseline and MCP configs. |
| 262 | +- Produces standalone human-readable reports. |
| 263 | + |
| 264 | +### What v1 Does NOT Do |
| 265 | + |
| 266 | +- Does not change verifier rewards, leaderboard scoring, or MANIFEST.json. |
| 267 | +- Does not block or gate benchmark runs on retrieval quality. |
| 268 | +- Does not modify existing evaluation pipeline outputs. |
| 269 | +- Does not claim causal relationships between retrieval and outcomes. |
| 270 | + |
| 271 | +### Comparability Requirements |
| 272 | + |
| 273 | +Matched task comparisons require: |
| 274 | + |
| 275 | +- **Same task** executed in both baseline and MCP configs. |
| 276 | +- **Same model** and harness version across paired configs. |
| 277 | +- **Result.json present** with valid reward for both configs. |
| 278 | +- **At least 3 matched tasks** for aggregate statistics. |
| 279 | + |
| 280 | +Unmatched tasks (present in one config but not the other) are excluded from |
| 281 | +matched comparisons but included in per-config aggregates. |
| 282 | + |
| 283 | +### Coverage Caveats |
| 284 | + |
| 285 | +- Tasks without file-level ground truth (MCP-unique discovery tasks, |
| 286 | + write-only tasks) are excluded from IR metrics. |
| 287 | +- Tasks in degraded mode (no trajectory or transcript) emit empty events |
| 288 | + and are flagged in coverage metadata. |
| 289 | +- Chunk-level metrics operate at file-match granularity in v1. |
| 290 | + |
| 291 | +## Future Integration Points |
| 292 | + |
| 293 | +The following touchpoints exist for optional future integration. **None of |
| 294 | +these should be implemented without explicit policy discussion.** |
| 295 | + |
| 296 | +### `docs/EVALUATION_PIPELINE.md` |
| 297 | + |
| 298 | +- **Optional Layer 5**: Add retrieval evaluation as an optional post-run |
| 299 | + analysis layer alongside the existing 4-layer pipeline. |
| 300 | +- Retrieval metrics could appear as supplementary columns in the eval report |
| 301 | + tables without affecting the primary scoring dimensions. |
| 302 | + |
| 303 | +### `docs/SCORING_SEMANTICS.md` |
| 304 | + |
| 305 | +- **Retrieval-aware composite scores**: A future version could define a |
| 306 | + weighted composite that includes retrieval quality alongside verifier |
| 307 | + reward. This would require consensus on weight calibration and must not |
| 308 | + change existing per-task reward semantics. |
| 309 | +- **Confidence gating**: Tasks with low retrieval coverage could receive |
| 310 | + confidence flags that downstream consumers use for filtering but not |
| 311 | + score modification. |
| 312 | + |
| 313 | +### `docs/MCP_UNIQUE_TASKS.md` / `docs/MCP_UNIQUE_CALIBRATION.md` |
| 314 | + |
| 315 | +- **Oracle coverage integration**: MCP-unique task oracle items could be |
| 316 | + mapped to retrieval events for oracle-aware retrieval scoring. |
| 317 | +- **Deep Search effectiveness**: The `deep_search` tool category enables |
| 318 | + future analysis of Deep Search ROI versus keyword/NLS search. |
| 319 | + |
| 320 | +### `docs/LEADERBOARD.md` |
| 321 | + |
| 322 | +- **Retrieval-conditioned rankings**: Future leaderboard views could show |
| 323 | + rankings conditioned on retrieval quality tiers (e.g. "among tasks where |
| 324 | + the agent retrieved ≥50% of ground truth files"). This would be |
| 325 | + supplementary, not replacing the primary ranking. |
| 326 | + |
| 327 | +### `scripts/generate_eval_report.py` |
| 328 | + |
| 329 | +- **Supplementary tables**: A future version of the report generator could |
| 330 | + optionally include retrieval quality tables and correlation summaries |
| 331 | + from the retrieval pipeline output. |
| 332 | + |
| 333 | +## See Also |
| 334 | + |
| 335 | +- `schemas/retrieval_events_schema.json` — JSON Schema definition |
| 336 | +- `docs/EVALUATION_PIPELINE.md` — primary evaluation pipeline |
| 337 | +- `docs/SCORING_SEMANTICS.md` — reward interpretation |
| 338 | +- `docs/MCP_UNIQUE_TASKS.md` — MCP-unique task system |
0 commit comments