Skip to content

Add retrieval eval arm#2

Merged
olety merged 1 commit into
mainfrom
codex/onerb-2
Jul 2, 2026
Merged

Add retrieval eval arm#2
olety merged 1 commit into
mainfrom
codex/onerb-2

Conversation

@olety

@olety olety commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

  • Adds scripts/eval_retrieval.py for query-to-document retrieval evaluation with hf: and api: adapter specs.
  • Adds retrieval schema helpers, embedding adapters, cache-keyed corpus embeddings, MRL dim ablation, bucketed retrieval metrics, and p50/p95 query embedding latency.
  • Adds a 12-document / 6-query JA+EN smoke fixture and tests for metric literals, metric cutoffs, dim truncation/renorm, native-dim defaulting, language normalization, and cache hit/miss behavior.

Linear: ONE-1329

Gate

Rebased onto origin/main at 17c05d6 Wire temporal classifier into conversion (#1) before final gate.

uv venv .venv
uv pip install --python .venv/bin/python -r requirements.txt -q
python -m pytest tests/ -x -q
105 passed in 0.06s

python scripts/validate_mappings.py
ALL CHECKS PASSED

python scripts/eval_retrieval.py --pairs tests/fixtures/retr_pairs_smoke.jsonl --corpus tests/fixtures/retr_corpus_smoke.jsonl --model api:dummy --dims 256,768 --out /tmp/onerb2-smoke
test -f /tmp/onerb2-smoke/metrics.json
metrics.json exists

Fusion Pass

Ran one post-gate fusion review pass. Findings folded into this PR: stronger metric cutoff tests, canonical language normalization, native-dimension default for omitted --dims, partial-cache accounting, and best-effort HF resolved revision for cache keys.

Copilot AI review requested due to automatic review settings July 2, 2026 10:17
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 5 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 243c49ba-676e-4170-bd4c-3637c8bb4db5

📥 Commits

Reviewing files that changed from the base of the PR and between 17c05d6 and 3fb81c4.

📒 Files selected for processing (6)
  • scripts/eval_retrieval.py
  • scripts/lib/embed_adapters.py
  • scripts/lib/retrieval_schema.py
  • tests/fixtures/retr_corpus_smoke.jsonl
  • tests/fixtures/retr_pairs_smoke.jsonl
  • tests/test_eval_retrieval.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/onerb-2

Comment @coderabbitai help to get the list of available commands.

@olety olety marked this pull request as ready for review July 2, 2026 10:17
@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Add retrieval evaluation script with embedding adapters, caching, and smoke tests

✨ Enhancement 🧪 Tests 🕐 20-40 Minutes

Grey Divider

AI Description

• Add CLI to evaluate query→document retrieval with nDCG/MRR/Recall and latency percentiles.
• Introduce embedding adapters (HF + dummy API) plus cache-keyed corpus embeddings and dim
 ablations.
• Add JA/EN smoke fixtures and tests covering metrics cutoffs, dim truncation, language
 normalization, and cache hits.
Diagram

graph TD
  A["scripts/eval_retrieval.py"] --> B["retrieval_schema"] --> C["pairs/corpus JSONL"] --> D["embed_adapters"] --> E[("embedding cache")]
  A --> F["ranking + metrics"] --> G["metrics.json"]
  subgraph Legend
    direction LR
    _cli["CLI/Script"] ~~~ _mod["Module"] ~~~ _data["Data file"] ~~~ _db[("Cache/Store")]
  end
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Use a standard IR evaluation library (e.g., pytrec_eval)
  • ➕ Reduces risk of subtle metric/cutoff mistakes (nDCG/MRR variants).
  • ➕ Easier to extend to more metrics (MAP, R-prec, etc.) consistently.
  • ➖ Adds dependency + integration friction for a small internal CLI.
  • ➖ May require reshaping inputs into qrels/run formats.
2. Vectorize ranking with NumPy (matrix multiply) instead of Python loops
  • ➕ Much faster for larger corpora and multiple dims.
  • ➕ Simplifies consistent dot-product computation and top-k retrieval.
  • ➖ Introduces NumPy dependency (if not already standard) and memory considerations.
  • ➖ More work to keep cache/streaming behavior simple.
3. Batch query embedding instead of per-query calls
  • ➕ More realistic throughput measurement for HF and API providers.
  • ➕ Lower overhead and fewer adapter calls.
  • ➖ Harder to attribute per-query latency percentiles without extra instrumentation.
  • ➖ Requires adapter contract clarity around batching limits.

Recommendation: The PR’s lightweight, dependency-minimal approach is appropriate for a first retrieval-eval arm (especially with strong schema validation + cache-keying + tests). If this is expected to scale beyond smoke/regression runs, the next most valuable upgrade would be vectorized ranking (NumPy) and/or adopting pytrec_eval to reduce long-term metric-maintenance risk.

Files changed (6) +946 / -0

Enhancement (3) +791 / -0
eval_retrieval.pyAdd retrieval evaluation CLI with dim ablations, caching, and bucketed metrics +408/-0

Add retrieval evaluation CLI with dim ablations, caching, and bucketed metrics

• Introduces a CLI that loads retrieval pairs/corpus JSONL, embeds queries/docs via an adapter, and computes nDCG@10/MRR@10/Recall@10/Recall@50 across language buckets. Adds cache-keyed corpus embeddings per (model spec, resolved revision, dim, doc hash) plus p50/p95 query-embedding latency reporting and a tabular stdout formatter.

scripts/eval_retrieval.py

embed_adapters.pyIntroduce embedding adapter interface with HF and deterministic dummy implementations +241/-0

Introduce embedding adapter interface with HF and deterministic dummy implementations

• Adds an EmbeddingAdapter protocol, model-spec parsing for hf:/api: formats, and adapter factory construction. Implements a Hugging Face adapter (SentenceTransformers preferred, Transformers mean-pooling fallback) and a deterministic hash-based dummy API adapter for tests/smoke runs, plus truncation-and-renormalization helpers for MRL-style dim truncation.

scripts/lib/embed_adapters.py

retrieval_schema.pyAdd retrieval pair/corpus schema parsing and validation with language canonicalization +142/-0

Add retrieval pair/corpus schema parsing and validation with language canonicalization

• Defines RetrievalDoc/RetrievalPair dataclasses, JSONL loaders with strict validation (non-empty fields, deduped positives, non-empty corpus), and pair-vs-corpus consistency checks. Normalizes language codes (casefold, '_'→'-', region stripped) for consistent bucketing (e.g., EN-US→en).

scripts/lib/retrieval_schema.py

Tests (3) +155 / -0
retr_corpus_smoke.jsonlAdd 12-document bilingual (JA/EN) retrieval corpus smoke fixture +12/-0

Add 12-document bilingual (JA/EN) retrieval corpus smoke fixture

• Adds a small mixed-language corpus fixture used by retrieval evaluation smoke tests to exercise language buckets and cross-lingual positives.

tests/fixtures/retr_corpus_smoke.jsonl

retr_pairs_smoke.jsonlAdd 6-query retrieval pairs smoke fixture referencing the corpus +6/-0

Add 6-query retrieval pairs smoke fixture referencing the corpus

• Adds query-to-positive-doc pairs spanning both same-language and cross-language positives, referencing the corpus via corpus_ref to test default corpus resolution paths.

tests/fixtures/retr_pairs_smoke.jsonl

test_eval_retrieval.pyAdd tests for retrieval metrics, dim truncation, language normalization, and cache behavior +137/-0

Add tests for retrieval metrics, dim truncation, language normalization, and cache behavior

• Adds smoke tests ensuring metric keys/buckets are stable across dims and that omitted dims default to the adapter’s native dimension. Validates ranking metric cutoffs with multiple positives, tests truncate+renorm math, verifies cache hit/miss/partial-cache accounting via a counting adapter, and confirms canonical language normalization.

tests/test_eval_retrieval.py

@olety olety merged commit 6a2d1bd into main Jul 2, 2026
2 checks passed
@olety olety deleted the codex/onerb-2 branch July 2, 2026 10:19
@qodo-code-review

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📜 Skill insights (0)

Grey Divider


Action required

1. Hardcoded cross-lingual buckets 🐞 Bug ≡ Correctness
Description
compute_buckets() only emits cross-lingual buckets for ja->en and en->ja, so any other language pair
is silently omitted from the metrics output. This conflicts with the retrieval schema documentation
that describes cross-lingual buckets as generally inferred from query_lang and positive document
languages.
Code

scripts/eval_retrieval.py[R234-237]

+        for target_lang, target_positives in positive_langs.items():
+            name = f"cross:{ranking.pair.query_lang}->{target_lang}"
+            if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}:
+                bucket_items.setdefault(name, []).append((ranking, target_positives))
Evidence
The evaluation code explicitly restricts cross-lingual buckets to only ja/en, while the schema
documentation describes cross-lingual bucket inference as a general mechanism based on query_lang
and positive doc languages.

scripts/eval_retrieval.py[223-237]
scripts/lib/retrieval_schema.py[20-25]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`compute_buckets()` currently filters cross-lingual buckets to only `{cross:ja->en, cross:en->ja}`. This causes missing cross-lingual metrics for any other query/document language combinations, while the schema docs describe cross-lingual buckets as inferred in general.

### Issue Context
This affects both `metrics.json` and the printed table, because those are sourced from the buckets produced by `compute_buckets()`.

### Fix Focus Areas
- scripts/eval_retrieval.py[223-242]
- scripts/lib/retrieval_schema.py[20-26]

### Suggested fix
- Remove the hard-coded filter and emit a cross bucket for every `query_lang != target_lang` pair (e.g., always add `cross:{query_lang}->{target_lang}`).
- If the intent is to only support ja/en for now, then:
 - keep the filter but update the schema/module docstrings and CLI help text to explicitly state the limitation, and
 - consider emitting a warning when other language pairs are encountered so omissions are not silent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Cache parse can crash 🐞 Bug ☼ Reliability
Description
_read_cached_embedding() converts cached embedding elements using float(value) without handling
ValueError/TypeError, so a syntactically valid cache file with non-numeric values will abort
evaluation instead of being treated as a cache miss. This makes runs brittle when cache entries are
corrupted/tampered or otherwise contain invalid numeric payloads.
Code

scripts/eval_retrieval.py[R308-315]

+    try:
+        payload = json.loads(path.read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError):
+        return None
+    embedding = payload.get("embedding")
+    if not isinstance(embedding, list) or len(embedding) != dim:
+        return None
+    return [float(value) for value in embedding]
Evidence
The reader explicitly catches JSONDecodeError/OSError but then performs an unguarded float cast over
the embedding list, which can raise ValueError/TypeError for invalid numeric entries.

scripts/eval_retrieval.py[299-316]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_read_cached_embedding()` gracefully handles unreadable JSON (`JSONDecodeError`) but does not guard the `float(...)` conversions, so a cache entry with non-numeric embedding values can raise and crash the evaluation.

### Issue Context
This function is called for every (dim, doc) pair during cache probing, so a single malformed cache entry can break an entire run.

### Fix Focus Areas
- scripts/eval_retrieval.py[299-316]

### Suggested fix
- Wrap the float conversion in a try/except and treat conversion failures as a cache miss, e.g.:
 - `try: return [float(v) for v in embedding] except (TypeError, ValueError): return None`
- (Optional) Add a small unit test that writes a cache file with `"embedding": ["nan"]` or `"embedding": [null]` and verifies it is treated as a miss and triggers re-embedding rather than crashing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread scripts/eval_retrieval.py
Comment on lines +234 to +237
for target_lang, target_positives in positive_langs.items():
name = f"cross:{ranking.pair.query_lang}->{target_lang}"
if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}:
bucket_items.setdefault(name, []).append((ranking, target_positives))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Hardcoded cross-lingual buckets 🐞 Bug ≡ Correctness

compute_buckets() only emits cross-lingual buckets for ja->en and en->ja, so any other language pair
is silently omitted from the metrics output. This conflicts with the retrieval schema documentation
that describes cross-lingual buckets as generally inferred from query_lang and positive document
languages.
Agent Prompt
### Issue description
`compute_buckets()` currently filters cross-lingual buckets to only `{cross:ja->en, cross:en->ja}`. This causes missing cross-lingual metrics for any other query/document language combinations, while the schema docs describe cross-lingual buckets as inferred in general.

### Issue Context
This affects both `metrics.json` and the printed table, because those are sourced from the buckets produced by `compute_buckets()`.

### Fix Focus Areas
- scripts/eval_retrieval.py[223-242]
- scripts/lib/retrieval_schema.py[20-26]

### Suggested fix
- Remove the hard-coded filter and emit a cross bucket for every `query_lang != target_lang` pair (e.g., always add `cross:{query_lang}->{target_lang}`).
- If the intent is to only support ja/en for now, then:
  - keep the filter but update the schema/module docstrings and CLI help text to explicitly state the limitation, and
  - consider emitting a warning when other language pairs are encountered so omissions are not silent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment thread scripts/eval_retrieval.py
Comment on lines +308 to +315
try:
payload = json.loads(path.read_text(encoding="utf-8"))
except (OSError, json.JSONDecodeError):
return None
embedding = payload.get("embedding")
if not isinstance(embedding, list) or len(embedding) != dim:
return None
return [float(value) for value in embedding]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remediation recommended

2. Cache parse can crash 🐞 Bug ☼ Reliability

_read_cached_embedding() converts cached embedding elements using float(value) without handling
ValueError/TypeError, so a syntactically valid cache file with non-numeric values will abort
evaluation instead of being treated as a cache miss. This makes runs brittle when cache entries are
corrupted/tampered or otherwise contain invalid numeric payloads.
Agent Prompt
### Issue description
`_read_cached_embedding()` gracefully handles unreadable JSON (`JSONDecodeError`) but does not guard the `float(...)` conversions, so a cache entry with non-numeric embedding values can raise and crash the evaluation.

### Issue Context
This function is called for every (dim, doc) pair during cache probing, so a single malformed cache entry can break an entire run.

### Fix Focus Areas
- scripts/eval_retrieval.py[299-316]

### Suggested fix
- Wrap the float conversion in a try/except and treat conversion failures as a cache miss, e.g.:
  - `try: return [float(v) for v in embedding] except (TypeError, ValueError): return None`
- (Optional) Add a small unit test that writes a cache file with `"embedding": ["nan"]` or `"embedding": [null]` and verifies it is treated as a miss and triggers re-embedding rather than crashing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new retrieval-evaluation “arm” to the repository, enabling query→document retrieval scoring for embedding models via hf: and api: model specs, with caching and bucketed metrics, plus a JA/EN smoke fixture and tests to validate core behaviors.

Changes:

  • Add scripts/eval_retrieval.py to run retrieval evaluation, emit metrics.json, and report bucketed metrics + p50/p95 query embedding latency.
  • Add retrieval helpers: JSONL schema loading/validation (scripts/lib/retrieval_schema.py) and embedding adapter plumbing (scripts/lib/embed_adapters.py) including dim truncation/renorm and cache-keyed corpus embeddings.
  • Add tests + fixtures for metric correctness, dim-defaulting, truncation behavior, language canonicalization, and cache hit/miss accounting.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/eval_retrieval.py New retrieval evaluation CLI + caching, ranking, metrics bucketing, and JSON report output.
scripts/lib/embed_adapters.py Model-spec parsing and embedding adapter implementations (HF + dummy API), plus truncation/normalization utilities.
scripts/lib/retrieval_schema.py JSONL schema loaders/validators for pairs and corpus, including language canonicalization.
tests/test_eval_retrieval.py Test coverage for metrics, dim handling, truncation/renorm, cache behavior, and language normalization.
tests/fixtures/retr_pairs_smoke.jsonl Small JA/EN retrieval-pairs smoke fixture.
tests/fixtures/retr_corpus_smoke.jsonl Small JA/EN retrieval-corpus smoke fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/eval_retrieval.py
Comment on lines +55 to +62
for part in raw.split(","):
value = part.strip()
if not value:
continue
dim = int(value)
if dim <= 0:
raise ValueError("dims must be positive integers")
dims.append(dim)
Comment thread scripts/eval_retrieval.py
Comment on lines +234 to +237
for target_lang, target_positives in positive_langs.items():
name = f"cross:{ranking.pair.query_lang}->{target_lang}"
if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}:
bucket_items.setdefault(name, []).append((ranking, target_positives))
Comment thread scripts/eval_retrieval.py
Comment on lines +280 to +281
def dot(left: list[float], right: list[float]) -> float:
return sum(a * b for a, b in zip(left, right))


def load_corpus(path: Path) -> list[RetrievalDoc]:
docs = [RetrievalDoc.from_dict(row, line_no) for line_no, row in _read_jsonl(path)]


def load_pairs(path: Path) -> list[RetrievalPair]:
pairs = [RetrievalPair.from_dict(row, line_no) for line_no, row in _read_jsonl(path)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants