Skip to content

Commit dba9a39

Browse files
committed
docs: link retrieval eval pipeline spec
1 parent a5d9671 commit dba9a39

File tree

3 files changed

+6
-1
lines changed

3 files changed

+6
-1
lines changed

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
1919
- `docs/TASK_CATALOG.md` - current task inventory
2020
- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation (incl. oracle checks + hybrid scoring)
2121
- `docs/EVALUATION_PIPELINE.md` - unified eval: verifier → LLM judge → statistics → report
22+
- `docs/RETRIEVAL_EVAL_SPEC.md` - full retrieval/IR evaluation pipeline (normalized events → metrics/probes/taxonomy artifacts)
2223
- `docs/MCP_UNIQUE_TASKS.md` - MCP-unique task system (suites, authoring, oracle, DS tasks)
2324
- `docs/MCP_UNIQUE_CALIBRATION.md` - oracle coverage analysis and threshold calibration data
2425
- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions

CLAUDE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
2020
- `docs/TASK_CATALOG.md` - current task inventory
2121
- `docs/SCORING_SEMANTICS.md` - reward/pass interpretation (incl. oracle checks + hybrid scoring)
2222
- `docs/EVALUATION_PIPELINE.md` - unified eval: verifier → LLM judge → statistics → report
23+
- `docs/RETRIEVAL_EVAL_SPEC.md` - full retrieval/IR evaluation pipeline (normalized events → metrics/probes/taxonomy artifacts)
2324
- `docs/MCP_UNIQUE_TASKS.md` - MCP-unique task system (suites, authoring, oracle, DS tasks)
2425
- `docs/MCP_UNIQUE_CALIBRATION.md` - oracle coverage analysis and threshold calibration data
2526
- `docs/WORKFLOW_METRICS.md` - timing/cost metric definitions

docs/EVALUATION_PIPELINE.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,10 @@ and statistical modules provide confidence intervals and correlation analysis.
66

77
This document covers the pipeline architecture. For per-benchmark scoring
88
details, see [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md). For MCP-unique
9-
oracle checks, see [MCP_UNIQUE_TASKS.md](MCP_UNIQUE_TASKS.md).
9+
oracle checks, see [MCP_UNIQUE_TASKS.md](MCP_UNIQUE_TASKS.md). For the full
10+
retrieval/IR evaluation pipeline (normalized retrieval events, file/chunk IR
11+
metrics, utilization probes, taxonomy, and emitted artifacts), see
12+
[RETRIEVAL_EVAL_SPEC.md](RETRIEVAL_EVAL_SPEC.md).
1013

1114
---
1215

0 commit comments

Comments
 (0)