Add retrieval eval arm by olety · Pull Request #2 · oneiron-dev/oneiroNER

olety · 2026-07-02T10:17:28Z

Summary

Adds scripts/eval_retrieval.py for query-to-document retrieval evaluation with hf: and api: adapter specs.
Adds retrieval schema helpers, embedding adapters, cache-keyed corpus embeddings, MRL dim ablation, bucketed retrieval metrics, and p50/p95 query embedding latency.
Adds a 12-document / 6-query JA+EN smoke fixture and tests for metric literals, metric cutoffs, dim truncation/renorm, native-dim defaulting, language normalization, and cache hit/miss behavior.

Linear: ONE-1329

Gate

Rebased onto origin/main at 17c05d6 Wire temporal classifier into conversion (#1) before final gate.

uv venv .venv
uv pip install --python .venv/bin/python -r requirements.txt -q
python -m pytest tests/ -x -q
105 passed in 0.06s

python scripts/validate_mappings.py
ALL CHECKS PASSED

python scripts/eval_retrieval.py --pairs tests/fixtures/retr_pairs_smoke.jsonl --corpus tests/fixtures/retr_corpus_smoke.jsonl --model api:dummy --dims 256,768 --out /tmp/onerb2-smoke
test -f /tmp/onerb2-smoke/metrics.json
metrics.json exists

Fusion Pass

Ran one post-gate fusion review pass. Findings folded into this PR: stronger metric cutoff tests, canonical language normalization, native-dimension default for omitted --dims, partial-cache accounting, and best-effort HF resolved revision for cache keys.

coderabbitai · 2026-07-02T10:17:35Z

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 5 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 243c49ba-676e-4170-bd4c-3637c8bb4db5

📥 Commits

Reviewing files that changed from the base of the PR and between 17c05d6 and 3fb81c4.

📒 Files selected for processing (6)

scripts/eval_retrieval.py
scripts/lib/embed_adapters.py
scripts/lib/retrieval_schema.py
tests/fixtures/retr_corpus_smoke.jsonl
tests/fixtures/retr_pairs_smoke.jsonl
tests/test_eval_retrieval.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/onerb-2

_{Comment @coderabbitai help to get the list of available commands.}

qodo-code-review · 2026-07-02T10:18:26Z

PR Summary by Qodo

Add retrieval evaluation script with embedding adapters, caching, and smoke tests

✨ Enhancement 🧪 Tests 🕐 20-40 Minutes

AI Description

• Add CLI to evaluate query→document retrieval with nDCG/MRR/Recall and latency percentiles.
• Introduce embedding adapters (HF + dummy API) plus cache-keyed corpus embeddings and dim
 ablations.
• Add JA/EN smoke fixtures and tests covering metrics cutoffs, dim truncation, language
 normalization, and cache hits.

Diagram

graph TD
  A["scripts/eval_retrieval.py"] --> B["retrieval_schema"] --> C["pairs/corpus JSONL"] --> D["embed_adapters"] --> E[("embedding cache")]
  A --> F["ranking + metrics"] --> G["metrics.json"]
  subgraph Legend
    direction LR
    _cli["CLI/Script"] ~~~ _mod["Module"] ~~~ _data["Data file"] ~~~ _db[("Cache/Store")]
  end

High-Level Assessment

The following are alternative approaches to this PR:

1. Use a standard IR evaluation library (e.g., pytrec_eval)

➕ Reduces risk of subtle metric/cutoff mistakes (nDCG/MRR variants).
➕ Easier to extend to more metrics (MAP, R-prec, etc.) consistently.
➖ Adds dependency + integration friction for a small internal CLI.
➖ May require reshaping inputs into qrels/run formats.

2. Vectorize ranking with NumPy (matrix multiply) instead of Python loops

➕ Much faster for larger corpora and multiple dims.
➕ Simplifies consistent dot-product computation and top-k retrieval.
➖ Introduces NumPy dependency (if not already standard) and memory considerations.
➖ More work to keep cache/streaming behavior simple.

3. Batch query embedding instead of per-query calls

➕ More realistic throughput measurement for HF and API providers.
➕ Lower overhead and fewer adapter calls.
➖ Harder to attribute per-query latency percentiles without extra instrumentation.
➖ Requires adapter contract clarity around batching limits.

Recommendation: The PR’s lightweight, dependency-minimal approach is appropriate for a first retrieval-eval arm (especially with strong schema validation + cache-keying + tests). If this is expected to scale beyond smoke/regression runs, the next most valuable upgrade would be vectorized ranking (NumPy) and/or adopting pytrec_eval to reduce long-term metric-maintenance risk.

Files changed (6) +946 / -0

Enhancement (3) +791 / -0

eval_retrieval.pyAdd retrieval evaluation CLI with dim ablations, caching, and bucketed metrics +408/-0

Add retrieval evaluation CLI with dim ablations, caching, and bucketed metrics

• Introduces a CLI that loads retrieval pairs/corpus JSONL, embeds queries/docs via an adapter, and computes nDCG@10/MRR@10/Recall@10/Recall@50 across language buckets. Adds cache-keyed corpus embeddings per (model spec, resolved revision, dim, doc hash) plus p50/p95 query-embedding latency reporting and a tabular stdout formatter.

scripts/eval_retrieval.py

embed_adapters.pyIntroduce embedding adapter interface with HF and deterministic dummy implementations +241/-0

Introduce embedding adapter interface with HF and deterministic dummy implementations

• Adds an EmbeddingAdapter protocol, model-spec parsing for hf:/api: formats, and adapter factory construction. Implements a Hugging Face adapter (SentenceTransformers preferred, Transformers mean-pooling fallback) and a deterministic hash-based dummy API adapter for tests/smoke runs, plus truncation-and-renormalization helpers for MRL-style dim truncation.

scripts/lib/embed_adapters.py

retrieval_schema.pyAdd retrieval pair/corpus schema parsing and validation with language canonicalization +142/-0

Add retrieval pair/corpus schema parsing and validation with language canonicalization

• Defines RetrievalDoc/RetrievalPair dataclasses, JSONL loaders with strict validation (non-empty fields, deduped positives, non-empty corpus), and pair-vs-corpus consistency checks. Normalizes language codes (casefold, '_'→'-', region stripped) for consistent bucketing (e.g., EN-US→en).

scripts/lib/retrieval_schema.py

Tests (3) +155 / -0

retr_corpus_smoke.jsonlAdd 12-document bilingual (JA/EN) retrieval corpus smoke fixture +12/-0

Add 12-document bilingual (JA/EN) retrieval corpus smoke fixture
• Adds a small mixed-language corpus fixture used by retrieval evaluation smoke tests to exercise language buckets and cross-lingual positives.
tests/fixtures/retr_corpus_smoke.jsonl

retr_pairs_smoke.jsonlAdd 6-query retrieval pairs smoke fixture referencing the corpus +6/-0

Add 6-query retrieval pairs smoke fixture referencing the corpus
• Adds query-to-positive-doc pairs spanning both same-language and cross-language positives, referencing the corpus via corpus_ref to test default corpus resolution paths.
tests/fixtures/retr_pairs_smoke.jsonl

test_eval_retrieval.pyAdd tests for retrieval metrics, dim truncation, language normalization, and cache behavior +137/-0

Add tests for retrieval metrics, dim truncation, language normalization, and cache behavior

• Adds smoke tests ensuring metric keys/buckets are stable across dims and that omitted dims default to the adapter’s native dimension. Validates ranking metric cutoffs with multiple positives, tests truncate+renorm math, verifies cache hit/miss/partial-cache accounting via a counting adapter, and confirms canonical language normalization.

tests/test_eval_retrieval.py

qodo-code-review · 2026-07-02T10:21:31Z

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📜 Skill insights (0)

1. Hardcoded cross-lingual buckets 🐞 Bug ≡ Correctness

Description

compute_buckets() only emits cross-lingual buckets for ja->en and en->ja, so any other language pair
is silently omitted from the metrics output. This conflicts with the retrieval schema documentation
that describes cross-lingual buckets as generally inferred from query_lang and positive document
languages.

Code

scripts/eval_retrieval.py[R234-237]

+        for target_lang, target_positives in positive_langs.items():
+            name = f"cross:{ranking.pair.query_lang}->{target_lang}"
+            if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}:
+                bucket_items.setdefault(name, []).append((ranking, target_positives))

Evidence

The evaluation code explicitly restricts cross-lingual buckets to only ja/en, while the schema
documentation describes cross-lingual bucket inference as a general mechanism based on query_lang
and positive doc languages.

scripts/eval_retrieval.py[223-237]
scripts/lib/retrieval_schema.py[20-25]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`compute_buckets()` currently filters cross-lingual buckets to only `{cross:ja->en, cross:en->ja}`. This causes missing cross-lingual metrics for any other query/document language combinations, while the schema docs describe cross-lingual buckets as inferred in general.

### Issue Context
This affects both `metrics.json` and the printed table, because those are sourced from the buckets produced by `compute_buckets()`.

### Fix Focus Areas
- scripts/eval_retrieval.py[223-242]
- scripts/lib/retrieval_schema.py[20-26]

### Suggested fix
- Remove the hard-coded filter and emit a cross bucket for every `query_lang != target_lang` pair (e.g., always add `cross:{query_lang}->{target_lang}`).
- If the intent is to only support ja/en for now, then:
 - keep the filter but update the schema/module docstrings and CLI help text to explicitly state the limitation, and
 - consider emitting a warning when other language pairs are encountered so omissions are not silent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Cache parse can crash 🐞 Bug ☼ Reliability

Description

_read_cached_embedding() converts cached embedding elements using float(value) without handling
ValueError/TypeError, so a syntactically valid cache file with non-numeric values will abort
evaluation instead of being treated as a cache miss. This makes runs brittle when cache entries are
corrupted/tampered or otherwise contain invalid numeric payloads.

Code

scripts/eval_retrieval.py[R308-315]

+    try:
+        payload = json.loads(path.read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError):
+        return None
+    embedding = payload.get("embedding")
+    if not isinstance(embedding, list) or len(embedding) != dim:
+        return None
+    return [float(value) for value in embedding]

Evidence

The reader explicitly catches JSONDecodeError/OSError but then performs an unguarded float cast over
the embedding list, which can raise ValueError/TypeError for invalid numeric entries.

scripts/eval_retrieval.py[299-316]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`_read_cached_embedding()` gracefully handles unreadable JSON (`JSONDecodeError`) but does not guard the `float(...)` conversions, so a cache entry with non-numeric embedding values can raise and crash the evaluation.

### Issue Context
This function is called for every (dim, doc) pair during cache probing, so a single malformed cache entry can break an entire run.

### Fix Focus Areas
- scripts/eval_retrieval.py[299-316]

### Suggested fix
- Wrap the float conversion in a try/except and treat conversion failures as a cache miss, e.g.:
 - `try: return [float(v) for v in embedding] except (TypeError, ValueError): return None`
- (Optional) Add a small unit test that writes a cache file with `"embedding": ["nan"]` or `"embedding": [null]` and verifies it is treated as a miss and triggers re-embedding rather than crashing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-07-02T10:21:32Z

+        for target_lang, target_positives in positive_langs.items():
+            name = f"cross:{ranking.pair.query_lang}->{target_lang}"
+            if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}:
+                bucket_items.setdefault(name, []).append((ranking, target_positives))


1. Hardcoded cross-lingual buckets 🐞 Bug ≡ Correctness

compute_buckets() only emits cross-lingual buckets for ja->en and en->ja, so any other language pair is silently omitted from the metrics output. This conflicts with the retrieval schema documentation that describes cross-lingual buckets as generally inferred from query_lang and positive document languages.

Agent Prompt

### Issue description `compute_buckets()` currently filters cross-lingual buckets to only `{cross:ja->en, cross:en->ja}`. This causes missing cross-lingual metrics for any other query/document language combinations, while the schema docs describe cross-lingual buckets as inferred in general. ### Issue Context This affects both `metrics.json` and the printed table, because those are sourced from the buckets produced by `compute_buckets()`. ### Fix Focus Areas - scripts/eval_retrieval.py[223-242] - scripts/lib/retrieval_schema.py[20-26] ### Suggested fix - Remove the hard-coded filter and emit a cross bucket for every `query_lang != target_lang` pair (e.g., always add `cross:{query_lang}->{target_lang}`). - If the intent is to only support ja/en for now, then: - keep the filter but update the schema/module docstrings and CLI help text to explicitly state the limitation, and - consider emitting a warning when other language pairs are encountered so omissions are not silent.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-07-02T10:21:32Z

+    try:
+        payload = json.loads(path.read_text(encoding="utf-8"))
+    except (OSError, json.JSONDecodeError):
+        return None
+    embedding = payload.get("embedding")
+    if not isinstance(embedding, list) or len(embedding) != dim:
+        return None
+    return [float(value) for value in embedding]


2. Cache parse can crash 🐞 Bug ☼ Reliability

_read_cached_embedding() converts cached embedding elements using float(value) without handling ValueError/TypeError, so a syntactically valid cache file with non-numeric values will abort evaluation instead of being treated as a cache miss. This makes runs brittle when cache entries are corrupted/tampered or otherwise contain invalid numeric payloads.

Agent Prompt

### Issue description `_read_cached_embedding()` gracefully handles unreadable JSON (`JSONDecodeError`) but does not guard the `float(...)` conversions, so a cache entry with non-numeric embedding values can raise and crash the evaluation. ### Issue Context This function is called for every (dim, doc) pair during cache probing, so a single malformed cache entry can break an entire run. ### Fix Focus Areas - scripts/eval_retrieval.py[299-316] ### Suggested fix - Wrap the float conversion in a try/except and treat conversion failures as a cache miss, e.g.: - `try: return [float(v) for v in embedding] except (TypeError, ValueError): return None` - (Optional) Add a small unit test that writes a cache file with `"embedding": ["nan"]` or `"embedding": [null]` and verifies it is treated as a miss and triggers re-embedding rather than crashing.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copilot

Pull request overview

This PR introduces a new retrieval-evaluation “arm” to the repository, enabling query→document retrieval scoring for embedding models via hf: and api: model specs, with caching and bucketed metrics, plus a JA/EN smoke fixture and tests to validate core behaviors.

Changes:

Add scripts/eval_retrieval.py to run retrieval evaluation, emit metrics.json, and report bucketed metrics + p50/p95 query embedding latency.
Add retrieval helpers: JSONL schema loading/validation (scripts/lib/retrieval_schema.py) and embedding adapter plumbing (scripts/lib/embed_adapters.py) including dim truncation/renorm and cache-keyed corpus embeddings.
Add tests + fixtures for metric correctness, dim-defaulting, truncation behavior, language canonicalization, and cache hit/miss accounting.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
scripts/eval_retrieval.py	New retrieval evaluation CLI + caching, ranking, metrics bucketing, and JSON report output.
scripts/lib/embed_adapters.py	Model-spec parsing and embedding adapter implementations (HF + dummy API), plus truncation/normalization utilities.
scripts/lib/retrieval_schema.py	JSONL schema loaders/validators for pairs and corpus, including language canonicalization.
tests/test_eval_retrieval.py	Test coverage for metrics, dim handling, truncation/renorm, cache behavior, and language normalization.
tests/fixtures/retr_pairs_smoke.jsonl	Small JA/EN retrieval-pairs smoke fixture.
tests/fixtures/retr_corpus_smoke.jsonl	Small JA/EN retrieval-corpus smoke fixture.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    for part in raw.split(","):
+        value = part.strip()
+        if not value:
+            continue
+        dim = int(value)
+        if dim <= 0:
+            raise ValueError("dims must be positive integers")
+        dims.append(dim)


+        for target_lang, target_positives in positive_langs.items():
+            name = f"cross:{ranking.pair.query_lang}->{target_lang}"
+            if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}:
+                bucket_items.setdefault(name, []).append((ranking, target_positives))


+def dot(left: list[float], right: list[float]) -> float:
+    return sum(a * b for a, b in zip(left, right))


+
+
+def load_corpus(path: Path) -> list[RetrievalDoc]:
+    docs = [RetrievalDoc.from_dict(row, line_no) for line_no, row in _read_jsonl(path)]


+
+
+def load_pairs(path: Path) -> list[RetrievalPair]:
+    pairs = [RetrievalPair.from_dict(row, line_no) for line_no, row in _read_jsonl(path)]


Add retrieval eval arm

3fb81c4

Copilot AI review requested due to automatic review settings July 2, 2026 10:17

olety marked this pull request as ready for review July 2, 2026 10:17

Copilot started reviewing on behalf of olety July 2, 2026 10:17 View session

olety merged commit 6a2d1bd into main Jul 2, 2026
2 checks passed

olety deleted the codex/onerb-2 branch July 2, 2026 10:19

qodo-code-review Bot reviewed Jul 2, 2026

View reviewed changes

Copilot AI reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retrieval eval arm#2

Add retrieval eval arm#2
olety merged 1 commit into
mainfrom
codex/onerb-2

olety commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Review limit reached

Uh oh!

qodo-code-review Bot commented Jul 2, 2026

Uh oh!

Uh oh!

qodo-code-review Bot commented Jul 2, 2026

Uh oh!

qodo-code-review Bot Jul 2, 2026

Uh oh!

qodo-code-review Bot Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def dot(left: list[float], right: list[float]) -> float:
		return sum(a * b for a, b in zip(left, right))



		def load_corpus(path: Path) -> list[RetrievalDoc]:
		docs = [RetrievalDoc.from_dict(row, line_no) for line_no, row in _read_jsonl(path)]



		def load_pairs(path: Path) -> list[RetrievalPair]:
		pairs = [RetrievalPair.from_dict(row, line_no) for line_no, row in _read_jsonl(path)]

Uh oh!

Conversation

olety commented Jul 2, 2026

Summary

Gate

Fusion Pass

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

qodo-code-review Bot commented Jul 2, 2026

PR Summary by Qodo

Uh oh!

Uh oh!

qodo-code-review Bot commented Jul 2, 2026

Code Review by Qodo

Uh oh!

qodo-code-review Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading