Add retrieval eval arm#2
Conversation
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 5 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
PR Summary by QodoAdd retrieval evaluation script with embedding adapters, caching, and smoke tests
AI Description
Diagram
High-Level Assessment
Files changed (6)
|
Code Review by Qodo
1. Hardcoded cross-lingual buckets
|
| for target_lang, target_positives in positive_langs.items(): | ||
| name = f"cross:{ranking.pair.query_lang}->{target_lang}" | ||
| if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}: | ||
| bucket_items.setdefault(name, []).append((ranking, target_positives)) |
There was a problem hiding this comment.
1. Hardcoded cross-lingual buckets 🐞 Bug ≡ Correctness
compute_buckets() only emits cross-lingual buckets for ja->en and en->ja, so any other language pair is silently omitted from the metrics output. This conflicts with the retrieval schema documentation that describes cross-lingual buckets as generally inferred from query_lang and positive document languages.
Agent Prompt
### Issue description
`compute_buckets()` currently filters cross-lingual buckets to only `{cross:ja->en, cross:en->ja}`. This causes missing cross-lingual metrics for any other query/document language combinations, while the schema docs describe cross-lingual buckets as inferred in general.
### Issue Context
This affects both `metrics.json` and the printed table, because those are sourced from the buckets produced by `compute_buckets()`.
### Fix Focus Areas
- scripts/eval_retrieval.py[223-242]
- scripts/lib/retrieval_schema.py[20-26]
### Suggested fix
- Remove the hard-coded filter and emit a cross bucket for every `query_lang != target_lang` pair (e.g., always add `cross:{query_lang}->{target_lang}`).
- If the intent is to only support ja/en for now, then:
- keep the filter but update the schema/module docstrings and CLI help text to explicitly state the limitation, and
- consider emitting a warning when other language pairs are encountered so omissions are not silent.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| try: | ||
| payload = json.loads(path.read_text(encoding="utf-8")) | ||
| except (OSError, json.JSONDecodeError): | ||
| return None | ||
| embedding = payload.get("embedding") | ||
| if not isinstance(embedding, list) or len(embedding) != dim: | ||
| return None | ||
| return [float(value) for value in embedding] |
There was a problem hiding this comment.
2. Cache parse can crash 🐞 Bug ☼ Reliability
_read_cached_embedding() converts cached embedding elements using float(value) without handling ValueError/TypeError, so a syntactically valid cache file with non-numeric values will abort evaluation instead of being treated as a cache miss. This makes runs brittle when cache entries are corrupted/tampered or otherwise contain invalid numeric payloads.
Agent Prompt
### Issue description
`_read_cached_embedding()` gracefully handles unreadable JSON (`JSONDecodeError`) but does not guard the `float(...)` conversions, so a cache entry with non-numeric embedding values can raise and crash the evaluation.
### Issue Context
This function is called for every (dim, doc) pair during cache probing, so a single malformed cache entry can break an entire run.
### Fix Focus Areas
- scripts/eval_retrieval.py[299-316]
### Suggested fix
- Wrap the float conversion in a try/except and treat conversion failures as a cache miss, e.g.:
- `try: return [float(v) for v in embedding] except (TypeError, ValueError): return None`
- (Optional) Add a small unit test that writes a cache file with `"embedding": ["nan"]` or `"embedding": [null]` and verifies it is treated as a miss and triggers re-embedding rather than crashing.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Pull request overview
This PR introduces a new retrieval-evaluation “arm” to the repository, enabling query→document retrieval scoring for embedding models via hf: and api: model specs, with caching and bucketed metrics, plus a JA/EN smoke fixture and tests to validate core behaviors.
Changes:
- Add
scripts/eval_retrieval.pyto run retrieval evaluation, emitmetrics.json, and report bucketed metrics + p50/p95 query embedding latency. - Add retrieval helpers: JSONL schema loading/validation (
scripts/lib/retrieval_schema.py) and embedding adapter plumbing (scripts/lib/embed_adapters.py) including dim truncation/renorm and cache-keyed corpus embeddings. - Add tests + fixtures for metric correctness, dim-defaulting, truncation behavior, language canonicalization, and cache hit/miss accounting.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/eval_retrieval.py | New retrieval evaluation CLI + caching, ranking, metrics bucketing, and JSON report output. |
| scripts/lib/embed_adapters.py | Model-spec parsing and embedding adapter implementations (HF + dummy API), plus truncation/normalization utilities. |
| scripts/lib/retrieval_schema.py | JSONL schema loaders/validators for pairs and corpus, including language canonicalization. |
| tests/test_eval_retrieval.py | Test coverage for metrics, dim handling, truncation/renorm, cache behavior, and language normalization. |
| tests/fixtures/retr_pairs_smoke.jsonl | Small JA/EN retrieval-pairs smoke fixture. |
| tests/fixtures/retr_corpus_smoke.jsonl | Small JA/EN retrieval-corpus smoke fixture. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for part in raw.split(","): | ||
| value = part.strip() | ||
| if not value: | ||
| continue | ||
| dim = int(value) | ||
| if dim <= 0: | ||
| raise ValueError("dims must be positive integers") | ||
| dims.append(dim) |
| for target_lang, target_positives in positive_langs.items(): | ||
| name = f"cross:{ranking.pair.query_lang}->{target_lang}" | ||
| if ranking.pair.query_lang != target_lang and name in {"cross:ja->en", "cross:en->ja"}: | ||
| bucket_items.setdefault(name, []).append((ranking, target_positives)) |
| def dot(left: list[float], right: list[float]) -> float: | ||
| return sum(a * b for a, b in zip(left, right)) |
|
|
||
|
|
||
| def load_corpus(path: Path) -> list[RetrievalDoc]: | ||
| docs = [RetrievalDoc.from_dict(row, line_no) for line_no, row in _read_jsonl(path)] |
|
|
||
|
|
||
| def load_pairs(path: Path) -> list[RetrievalPair]: | ||
| pairs = [RetrievalPair.from_dict(row, line_no) for line_no, row in _read_jsonl(path)] |
Summary
scripts/eval_retrieval.pyfor query-to-document retrieval evaluation withhf:andapi:adapter specs.Linear: ONE-1329
Gate
Rebased onto
origin/mainat17c05d6 Wire temporal classifier into conversion (#1)before final gate.Fusion Pass
Ran one post-gate fusion review pass. Findings folded into this PR: stronger metric cutoff tests, canonical language normalization, native-dimension default for omitted
--dims, partial-cache accounting, and best-effort HF resolved revision for cache keys.