Skip to content

Commit 9aa95a0

Browse files
sjarmakclaude
andcommitted
feat: add standalone retrieval evaluation framework (v1)
Implements a 5-stage pipeline for evaluating retrieval quality, utilization, error taxonomy, and downstream impact without changing primary CCB scoring. New files: - schemas/retrieval_events_schema.json - normalized event schema (v1.0) - docs/RETRIEVAL_EVAL_SPEC.md - spec with field semantics, pipeline stages, rollout boundaries, and future integration points - scripts/normalize_retrieval_events.py - trace normalization CLI - scripts/compute_retrieval_metrics.py - standalone file-level IR metrics - scripts/retrieval_eval_pipeline.py - full 5-stage pipeline (file/chunk metrics, utilization probes, error taxonomy, artifact emission) - scripts/retrieval_impact_analysis.py - correlation + matched comparison - scripts/generate_retrieval_report.py - Markdown report generator Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0073dc9 commit 9aa95a0

File tree

7 files changed

+3498
-0
lines changed

7 files changed

+3498
-0
lines changed

docs/RETRIEVAL_EVAL_SPEC.md

Lines changed: 338 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,338 @@
1+
# Retrieval Evaluation Specification
2+
3+
> **Status**: v1 — standalone, non-ranking.
4+
> This framework evaluates retrieval quality and its downstream impact on task
5+
> outcomes without changing primary CCB scoring or leaderboard semantics.
6+
7+
## Purpose
8+
9+
Measure three aspects of agent retrieval behavior:
10+
11+
1. **Retrieval quality** — did the agent find the right files/symbols?
12+
2. **Utilization quality** — did the agent use retrieved evidence correctly?
13+
3. **Downstream impact** — how do retrieval metrics correlate with task
14+
outcomes, cost, and time?
15+
16+
## Schema Overview
17+
18+
The normalized retrieval event schema
19+
(`schemas/retrieval_events_schema.json`, version 1.0) defines a single
20+
JSON document per task-config pair containing:
21+
22+
| Section | Purpose |
23+
|---------|---------|
24+
| `provenance` | Run/task/config identification |
25+
| `coverage` | Trace and ground-truth availability flags |
26+
| `ground_truth` | Expected files, optional symbols and chunks |
27+
| `events` | Ordered step-level retrieval events |
28+
| `summary` | Pre-computed aggregate counts (optional) |
29+
30+
## Field Semantics
31+
32+
### Provenance
33+
34+
Uniquely identifies the task execution:
35+
36+
- `run_id` — staging or official run directory name.
37+
- `batch_timestamp` — batch subdirectory within the run.
38+
- `task_name` — canonical task identifier (matches `task.toml` name).
39+
- `config_name` — full config label (e.g. `baseline-local-direct`,
40+
`mcp-remote-direct`).
41+
- `benchmark` — suite name (e.g. `ccb_fix`, `ccb_mcp_crossorg`).
42+
43+
### Coverage Flags
44+
45+
Every document reports trace availability explicitly so downstream stages
46+
can filter or flag results:
47+
48+
- `has_trajectory``agent/trajectory.json` was found and parseable.
49+
- `has_transcript``agent/claude-code.txt` (JSONL) was found and parseable.
50+
- `has_ground_truth` — file-level expected files exist for the task.
51+
- `has_chunk_ground_truth` — line-range annotations exist (e.g. defect
52+
locations in code-review tasks).
53+
- `trace_source` — which source produced the events:
54+
- `trajectory` — events from `trajectory.json` only.
55+
- `transcript` — events from `claude-code.txt` only.
56+
- `merged` — events from both sources combined (trajectory preferred for
57+
tool calls, transcript for timestamps or subagent recovery).
58+
- `null` — degraded mode (no usable trace).
59+
- `degraded_reason` — human-readable explanation when events are empty or
60+
incomplete.
61+
62+
### Ground Truth
63+
64+
Ground truth is loaded from the task definition directory using the existing
65+
priority chain in `ccb_metrics/ground_truth.py`:
66+
67+
1. `tests/ground_truth.json` (high confidence)
68+
2. `tests/expected_defects.json` (high confidence)
69+
3. `tests/expected_changes.json` (high confidence)
70+
4. `tests/reference_fix.patch` / `tests/expected.diff` (high confidence)
71+
5. `solution/solve.sh` gold patch (medium confidence)
72+
6. `instruction.md` / `tests/test.sh` regex extraction (medium/low confidence)
73+
74+
Three levels of ground truth are supported:
75+
76+
- **File-level** (`ground_truth.files`) — always populated when ground truth
77+
exists. Repo-relative paths.
78+
- **Symbol-level** (`ground_truth.symbols`) — optional. Function/class names
79+
within ground-truth files, loaded from `task_spec.json` oracle items.
80+
- **Chunk-level** (`ground_truth.chunks`) — optional. Line ranges within files,
81+
loaded from `expected_defects.json` annotations or similar.
82+
83+
When `coverage.has_ground_truth` is false, `ground_truth.files` is an empty
84+
array and all IR metrics are marked as non-computable.
85+
86+
### Retrieval Events
87+
88+
Each event represents one retrieval-related tool call by the agent:
89+
90+
- `step_index` — zero-based position in the trace. Preserves execution order.
91+
- `tool_name` — raw name from the trace (e.g. `Read`,
92+
`mcp__sourcegraph__sg_keyword_search`).
93+
- `tool_category` — normalized category for cross-config comparison:
94+
95+
| Category | Local tools | MCP tools |
96+
|----------|-------------|-----------|
97+
| `file_read` | Read | read_file |
98+
| `file_search` | Glob, Grep | list_files |
99+
| `symbol_navigation` || find_references, go_to_definition |
100+
| `code_search` | Grep (pattern) | keyword_search, nls_search |
101+
| `commit_search` || commit_search, diff_search, compare_revisions |
102+
| `deep_search` || deepsearch, deepsearch_read |
103+
| `file_write` | Write, Edit ||
104+
| `other` | Bash, Task | get_contributor_repos, list_repos |
105+
106+
- `is_mcp` — true for any `mcp__sourcegraph__*` tool call.
107+
- `target_files` — normalized file paths accessed or returned. Normalization
108+
strips `/workspace/`, `/repo_full/`, `/testbed/`, and diff `a/`/`b/` prefixes;
109+
paths are lowercased for matching.
110+
- `hits_ground_truth` — true if any `target_file` matches a ground-truth file.
111+
- `cumulative_tokens` — running token total up to this step (when available).
112+
- `elapsed_seconds` — wall-clock time from agent execution start.
113+
114+
### Event Summary
115+
116+
Optional pre-computed counts to avoid re-scanning the events array:
117+
118+
- `total_events`, `mcp_events`, `local_events`
119+
- `unique_files_accessed`, `ground_truth_files_hit`
120+
- `first_ground_truth_hit_step`
121+
- `events_by_category` (keyed by `tool_category`)
122+
123+
## Degraded Mode Behavior
124+
125+
The pipeline handles incomplete data gracefully:
126+
127+
| Condition | Behavior |
128+
|-----------|----------|
129+
| No trajectory AND no transcript | `events` is empty, `coverage.trace_source` is null, `coverage.degraded_reason` explains |
130+
| Trajectory only (no transcript) | Events extracted from trajectory; timestamps may be absent for some steps |
131+
| Transcript only (no trajectory) | Events extracted from transcript; subagent tool calls may be missed |
132+
| No ground truth | `ground_truth.files` is empty; `hits_ground_truth` is false for all events; IR metrics non-computable |
133+
| No chunk ground truth | `ground_truth.chunks` absent; chunk-level metrics emit `resolution: "file_level_only"` flag |
134+
135+
Downstream metric stages MUST check `coverage` flags before computing metrics
136+
and propagate appropriate `non_computable` markers rather than emitting
137+
misleading zeroes.
138+
139+
## Schema Versioning
140+
141+
- The `schema_version` field is a semver-style string (currently `"1.0"`).
142+
- **Minor bumps** (1.1, 1.2, ...) add optional fields. Consumers of 1.0 data
143+
continue to work unchanged.
144+
- **Major bumps** (2.0) change required fields or remove/rename existing ones.
145+
Consumers must update.
146+
- The normalization CLI embeds the schema version it was built against.
147+
Metric stages validate `schema_version` on load and reject unknown major
148+
versions.
149+
150+
## Output Paths
151+
152+
Normalized retrieval event files are written to a parallel directory structure
153+
that does not overwrite existing run artifacts:
154+
155+
```
156+
runs/{staging|official}/{run_id}/retrieval_events/
157+
{config_name}/
158+
{task_name}.retrieval_events.json
159+
```
160+
161+
Run-level aggregates are written alongside:
162+
163+
```
164+
runs/{staging|official}/{run_id}/retrieval_events/
165+
run_retrieval_summary.json
166+
```
167+
168+
## Pipeline Stages
169+
170+
The full evaluation pipeline (`scripts/retrieval_eval_pipeline.py`) runs five
171+
stages on each normalized event document:
172+
173+
### Stage 1: File-Level IR Metrics
174+
175+
Standard information retrieval metrics computed from the ordered list of
176+
retrieved files against ground-truth files:
177+
178+
- **Precision@K, Recall@K, F1@K** (K = 1, 3, 5, 10)
179+
- **MRR** (Mean Reciprocal Rank)
180+
- **nDCG@K** (normalized Discounted Cumulative Gain)
181+
- **MAP** (Mean Average Precision)
182+
- **File-level recall** (fraction of GT files found anywhere in retrieved list)
183+
- **Context efficiency** (fraction of retrieved files that are relevant)
184+
- **TTFR** (time-to-first-relevant file, in seconds and tokens)
185+
186+
Tasks without ground truth are marked `computable: false`.
187+
188+
### Stage 2: Chunk-Level Relevance Metrics
189+
190+
When chunk-level ground truth (line-range annotations) is available:
191+
192+
- **Chunk recall** = fraction of GT chunks whose file was accessed by the agent.
193+
- **Resolution** field: `"chunk_level"` or `"file_level_only"`.
194+
- **Validity** field: `"file_match_only"` (v1 granularity) or `"unsupported"`.
195+
196+
**Chunking assumption**: In v1, a retrieval event "covers" a ground-truth
197+
chunk if any `target_file` matches the chunk's file path. Sub-line matching
198+
(e.g. exact line range overlap) requires structured diff data and is deferred
199+
to future schema versions.
200+
201+
### Stage 3: Utilization Probe Metrics
202+
203+
Measures whether retrieved evidence was actually *used* by the agent:
204+
205+
- **`util_referenced_file_correctness`** = |files_written ∩ GT| / |GT|.
206+
Measures whether the agent wrote to the correct files after retrieval.
207+
- **`util_read_before_write_ratio`** = fraction of written files that were
208+
read by the agent before being written to. High values indicate deliberate
209+
evidence consumption.
210+
211+
**Coverage**: `probe_available: false` when the agent performed no file writes
212+
or when no ground truth exists. The probe requires write events to measure
213+
utilization — read-only tasks produce no utilization signal.
214+
215+
**Limitations**: These probes measure file-level correctness only. They do
216+
not validate whether the *content* written was semantically correct (that is
217+
the verifier's job). Future probes may add symbol-level or API-level checks.
218+
219+
### Stage 4: Error Taxonomy and Calibration Slices
220+
221+
Five taxonomy labels classify retrieval error modes per-task:
222+
223+
| Label | Definition |
224+
|-------|-----------|
225+
| `irrelevant_retrieval` | Files retrieved that are not in ground truth |
226+
| `missed_key_evidence` | Ground truth files never retrieved |
227+
| `wrong_evidence_used` | Non-GT files the agent wrote to |
228+
| `unused_correct_retrieval` | GT files retrieved but never written to |
229+
| `ambiguity_near_miss` | Retrieved files in the same directory as a GT file |
230+
231+
Two calibration slice dimensions:
232+
233+
- **Candidate set size**: `small` (≤5 files), `medium` (6–20), `large` (>20)
234+
- **Evidence type**: `local` (no MCP tools used) or `mcp` (at least one MCP call)
235+
236+
### Stage 5: Artifact Emission
237+
238+
Per-task artifacts (`{task_name}.retrieval_metrics.json`) contain all four
239+
metric stages plus provenance and coverage metadata. Run-level summaries
240+
(`run_retrieval_summary.json`) contain aggregated statistics across all
241+
computable tasks.
242+
243+
## Relationship to Existing Pipeline
244+
245+
This evaluation is **standalone and non-ranking** in v1:
246+
247+
- Does not modify `result.json`, `task_metrics.json`, or `MANIFEST.json`.
248+
- Does not affect verifier rewards or leaderboard scoring.
249+
- Consumes the same run artifacts as `ir_analysis.py` and `mcp_audit.py`.
250+
- Future versions may feed retrieval metrics into `generate_eval_report.py`
251+
as an optional supplementary section.
252+
253+
## v1 Rollout Boundaries
254+
255+
### What v1 Does
256+
257+
- Normalizes agent traces into step-level retrieval events.
258+
- Computes file-level IR metrics, chunk-level metrics (with fallback),
259+
utilization probes, and error taxonomy.
260+
- Correlates retrieval metrics with task outcomes (association only).
261+
- Generates matched task comparisons between baseline and MCP configs.
262+
- Produces standalone human-readable reports.
263+
264+
### What v1 Does NOT Do
265+
266+
- Does not change verifier rewards, leaderboard scoring, or MANIFEST.json.
267+
- Does not block or gate benchmark runs on retrieval quality.
268+
- Does not modify existing evaluation pipeline outputs.
269+
- Does not claim causal relationships between retrieval and outcomes.
270+
271+
### Comparability Requirements
272+
273+
Matched task comparisons require:
274+
275+
- **Same task** executed in both baseline and MCP configs.
276+
- **Same model** and harness version across paired configs.
277+
- **Result.json present** with valid reward for both configs.
278+
- **At least 3 matched tasks** for aggregate statistics.
279+
280+
Unmatched tasks (present in one config but not the other) are excluded from
281+
matched comparisons but included in per-config aggregates.
282+
283+
### Coverage Caveats
284+
285+
- Tasks without file-level ground truth (MCP-unique discovery tasks,
286+
write-only tasks) are excluded from IR metrics.
287+
- Tasks in degraded mode (no trajectory or transcript) emit empty events
288+
and are flagged in coverage metadata.
289+
- Chunk-level metrics operate at file-match granularity in v1.
290+
291+
## Future Integration Points
292+
293+
The following touchpoints exist for optional future integration. **None of
294+
these should be implemented without explicit policy discussion.**
295+
296+
### `docs/EVALUATION_PIPELINE.md`
297+
298+
- **Optional Layer 5**: Add retrieval evaluation as an optional post-run
299+
analysis layer alongside the existing 4-layer pipeline.
300+
- Retrieval metrics could appear as supplementary columns in the eval report
301+
tables without affecting the primary scoring dimensions.
302+
303+
### `docs/SCORING_SEMANTICS.md`
304+
305+
- **Retrieval-aware composite scores**: A future version could define a
306+
weighted composite that includes retrieval quality alongside verifier
307+
reward. This would require consensus on weight calibration and must not
308+
change existing per-task reward semantics.
309+
- **Confidence gating**: Tasks with low retrieval coverage could receive
310+
confidence flags that downstream consumers use for filtering but not
311+
score modification.
312+
313+
### `docs/MCP_UNIQUE_TASKS.md` / `docs/MCP_UNIQUE_CALIBRATION.md`
314+
315+
- **Oracle coverage integration**: MCP-unique task oracle items could be
316+
mapped to retrieval events for oracle-aware retrieval scoring.
317+
- **Deep Search effectiveness**: The `deep_search` tool category enables
318+
future analysis of Deep Search ROI versus keyword/NLS search.
319+
320+
### `docs/LEADERBOARD.md`
321+
322+
- **Retrieval-conditioned rankings**: Future leaderboard views could show
323+
rankings conditioned on retrieval quality tiers (e.g. "among tasks where
324+
the agent retrieved ≥50% of ground truth files"). This would be
325+
supplementary, not replacing the primary ranking.
326+
327+
### `scripts/generate_eval_report.py`
328+
329+
- **Supplementary tables**: A future version of the report generator could
330+
optionally include retrieval quality tables and correlation summaries
331+
from the retrieval pipeline output.
332+
333+
## See Also
334+
335+
- `schemas/retrieval_events_schema.json` — JSON Schema definition
336+
- `docs/EVALUATION_PIPELINE.md` — primary evaluation pipeline
337+
- `docs/SCORING_SEMANTICS.md` — reward interpretation
338+
- `docs/MCP_UNIQUE_TASKS.md` — MCP-unique task system

0 commit comments

Comments
 (0)