Add gold generation scaffold#3
Conversation
|
Warning Review limit reachedYou’ve reached a temporary PR review limit under our Fair Usage Limits Policy. Next review available in: 50 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
PR Summary by QodoAdd synthetic gold generation scaffold with QA export/import and schema 1.2 guards
AI Description
Diagram
High-Level Assessment
Files changed (8)
|
Code Review by Qodo
1. LLM parse can crash
|
There was a problem hiding this comment.
Pull request overview
This PR adds a scripts/goldgen/ scaffold to generate synthetic eval-only “gold candidate” NER records across multiple domain/language buckets, plus a QA export/import loop and tests to validate bucket coverage, span/hallucination gates, schema promotion, and env-based LLM config.
Changes:
- Add bucket definitions and prompt template for chat/notes/journal/reminders across en/ja/zh/ko, including JA hard-case buckets.
- Implement an OpenAI-compatible LLM client and a generator that enforces verbatim-surface + span computation and promotes schema-valid records.
- Add QA export/import scripts and tests for schema validity, rejection paths, and QA round-tripping.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_goldgen.py | Adds tests for bucket coverage, hallucination/span gating, schema promotion, env config, QA round trip, and dry-run prompt emission. |
| scripts/goldgen/init.py | Introduces the goldgen package scaffold. |
| scripts/goldgen/buckets.yaml | Defines bucket matrix + JA hard-case buckets and default negative types. |
| scripts/goldgen/generate.py | Implements prompt rendering, LLM response parsing, candidate building, schema validation/promotion, and JSONL writing. |
| scripts/goldgen/llm.py | Adds provider-agnostic OpenAI-compatible /chat/completions client and env loading. |
| scripts/goldgen/prompts/candidate.md | Adds the candidate-generation prompt template. |
| scripts/goldgen/qa_export.py | Exports candidate JSONL to Markdown + CSV for human QA. |
| scripts/goldgen/qa_import.py | Imports QA decisions, applies optional fixes, and promotes accepted records to schema-valid synthetic gold. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "schema_version": "1.0", | ||
| "source": SOURCE, | ||
| "source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}", | ||
| "language": bucket["language"], |
| if bucket["domain"] == "chat" or "turns" in example: | ||
| turns = example.get("turns") | ||
| if not isinstance(turns, list) or len(turns) < 2: | ||
| raise CandidateRejected("Chat candidate must include at least two turns") | ||
| clean_turns = [] | ||
| for turn in turns: | ||
| speaker = turn.get("speaker") | ||
| text = turn.get("text") | ||
| if not isinstance(speaker, str) or not speaker or not isinstance(text, str) or not text: | ||
| raise CandidateRejected(f"Bad turn shape: {turn!r}") | ||
| clean_turns.append({"speaker": speaker, "text": text}) | ||
| spanned = _compute_conversation_entities(clean_turns, entities) | ||
| query_types = _query_types(example, spanned, default_negative_types) | ||
| record = { | ||
| **base, | ||
| "format": "conversation", | ||
| "turns": clean_turns, | ||
| "query_types": query_types, | ||
| "entities": spanned, | ||
| } | ||
| _validate_promotable_candidate(record) | ||
| return record | ||
|
|
||
| text = example.get("text") | ||
| if not isinstance(text, str) or not text: |
| def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]: | ||
| fixed = dict(candidate) | ||
| if decision.get("fixed_text"): | ||
| fixed["text"] = decision["fixed_text"] | ||
| fixed.pop("turns", None) | ||
| fixed.pop("format", None) | ||
| if decision.get("fixed_turns_json"): | ||
| fixed["turns"] = json.loads(decision["fixed_turns_json"]) | ||
| fixed["format"] = "conversation" | ||
| fixed.pop("text", None) | ||
| if decision.get("fixed_entities_json"): | ||
| raw_entities = _normal_entities(json.loads(decision["fixed_entities_json"])) | ||
| if fixed.get("format") == "conversation": | ||
| fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities) | ||
| else: | ||
| fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities) | ||
| return fixed |
| def import_decisions(candidates: Path, decisions: Path, out: Path) -> list[dict[str, Any]]: | ||
| records = {record["source_id"]: record for record in _load_jsonl(candidates)} | ||
| decision_rows = _load_decisions(decisions) | ||
| promoted = [] | ||
| rejected = [] |
| Return strict JSON only, with this shape: | ||
|
|
||
| ```json | ||
| {{ | ||
| "examples": [ | ||
| {{ | ||
| "text": "single passage text, unless domain is chat", | ||
| "turns": [ | ||
| {{"speaker": "A", "text": "chat turn"}}, | ||
| {{"speaker": "B", "text": "chat turn"}} | ||
| ], | ||
| "entities": [ | ||
| {{"surface": "verbatim text span", "type": "PERSON"}} | ||
| ], | ||
| "query_types": ["PERSON", "DATE", "ORG", "PRODUCT"] | ||
| }} | ||
| ] | ||
| }} | ||
| ``` |
| def parse_llm_json(content: str) -> list[dict[str, Any]]: | ||
| payload = json.loads(content) | ||
| examples = payload.get("examples") | ||
| if not isinstance(examples, list): | ||
| raise CandidateRejected("LLM response must contain an examples array") | ||
| return examples |
|
Boundary-check update: ONERB-3 is blocked on AC#4 as written. The SOW requires output records to carry
|
| def parse_llm_json(content: str) -> list[dict[str, Any]]: | ||
| payload = json.loads(content) | ||
| examples = payload.get("examples") | ||
| if not isinstance(examples, list): | ||
| raise CandidateRejected("LLM response must contain an examples array") | ||
| return examples |
| content = llm.chat_completion(prompt) | ||
| examples = parse_llm_json(content) | ||
| records = [] | ||
| rejected = [] |
| def import_decisions(candidates: Path, decisions: Path, out: Path) -> list[dict[str, Any]]: | ||
| records = {record["source_id"]: record for record in _load_jsonl(candidates)} | ||
| decision_rows = _load_decisions(decisions) | ||
| promoted = [] | ||
| rejected = [] | ||
|
|
||
| for source_id, candidate in records.items(): | ||
| decision = decision_rows.get(source_id, {}) | ||
| action = (decision.get("decision") or "").strip().lower() |
| def parse_llm_json(content: str) -> list[dict[str, Any]]: | ||
| payload = json.loads(content) | ||
| examples = payload.get("examples") | ||
| if not isinstance(examples, list): | ||
| raise CandidateRejected("LLM response must contain an examples array") | ||
| return examples |
There was a problem hiding this comment.
1. Llm parse can crash 🐞 Bug ☼ Reliability
parse_llm_json() calls json.loads() and then payload.get("examples") without guarding
JSONDecodeError or non-dict payloads, and generate() does not catch those failures. A single
malformed/unsupported LLM response aborts the whole generation run and prevents writing
candidates/rejected outputs.
Agent Prompt
### Issue description
`scripts/goldgen/generate.parse_llm_json()` assumes the LLM returns valid JSON with a top-level object and an `examples` array. If the provider returns malformed JSON, or a non-object JSON value (e.g., a list/string), the code raises `JSONDecodeError` or `AttributeError` and the whole `generate()` run aborts.
### Issue Context
This is a reliability problem because LLM responses can occasionally be malformed (provider outages, truncation, safety filters, etc.). Right now, those failures happen before any candidate/rejected JSONL is written.
### Fix Focus Areas
- scripts/goldgen/generate.py[79-84]
- scripts/goldgen/generate.py[323-359]
### Implementation notes
- In `parse_llm_json`, catch `json.JSONDecodeError` and raise `CandidateRejected` with a short message (optionally include a small preview of `content`).
- Also validate `payload` is a `dict` before calling `.get()`; otherwise raise `CandidateRejected`.
- In `generate()`, consider catching `CandidateRejected` around `parse_llm_json(content)` and writing a single rejected-row file (or raising a clean `SystemExit` that still leaves the prompt file behind), so the failure is diagnosable.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]: | ||
| fixed = dict(candidate) | ||
| if decision.get("fixed_text"): | ||
| fixed["text"] = decision["fixed_text"] | ||
| fixed.pop("turns", None) | ||
| fixed.pop("format", None) | ||
| if decision.get("fixed_turns_json"): | ||
| fixed["turns"] = _normal_turns(_load_fix_json(decision, "fixed_turns_json")) | ||
| fixed["format"] = "conversation" | ||
| fixed.pop("text", None) | ||
| if decision.get("fixed_entities_json"): | ||
| loaded_entities = _load_fix_json(decision, "fixed_entities_json") | ||
| if not isinstance(loaded_entities, list): | ||
| raise CandidateRejected("fixed_entities_json must contain an entity list") | ||
| raw_entities = _normal_entities(loaded_entities) | ||
| if fixed.get("format") == "conversation": | ||
| fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities) | ||
| else: | ||
| fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities) | ||
| return fixed |
There was a problem hiding this comment.
2. Qa fixes keep stale spans 🐞 Bug ≡ Correctness
qa_import._apply_fix() updates fixed_text/fixed_turns_json but does not recompute entity start/end spans unless fixed_entities_json is also provided. Any text/turn edit that shifts offsets will cause promoted_schema_record() validation to fail with span mismatch and reject otherwise-valid QA fixes.
Agent Prompt
### Issue description
When QA provides `fixed_text` or `fixed_turns_json`, `_apply_fix()` changes the underlying text but leaves the original `entities` list (including `start`/`end`) untouched unless `fixed_entities_json` is also provided. This creates stale offsets, and promotion fails later because schema validation enforces `text[start:end] == surface`.
### Issue Context
This makes the `fix` workflow fragile: simple edits like adding/removing a character before an entity will cause rejection even if the entity surfaces still appear verbatim.
### Fix Focus Areas
- scripts/goldgen/qa_import.py[56-75]
- scripts/lib/schema.py[44-68]
### Implementation notes
Choose one of the following (either is acceptable, but be explicit):
1) **Auto-recompute spans** whenever `fixed_text` or `fixed_turns_json` is provided and `fixed_entities_json` is empty:
- Build a raw entity list from existing entities using only `surface/type/original_type/(turn_index)` (ignore existing `start/end`).
- Re-run `_compute_passage_entities()` or `_compute_conversation_entities()` to regenerate offsets.
2) **Fail fast with a clear error**: if text/turns changed but entities weren’t provided, raise `CandidateRejected` with a message like "fixed_text/fixed_turns_json requires fixed_entities_json (offsets must be recomputed)".
Option (1) is more forgiving and reduces unnecessary rejections.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| entities = _normal_entities(example.get("entities", [])) | ||
| metadata = _metadata(bucket, model=model) | ||
| base = { | ||
| "schema_version": "1.2", | ||
| "source": SOURCE, | ||
| "source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}", | ||
| "language": bucket["language"], | ||
| "split": "eval", | ||
| "confidence": CANDIDATE_CONFIDENCE, | ||
| "provenance": _metadata_provenance(metadata), | ||
| "generation_metadata": metadata, | ||
| } |
There was a problem hiding this comment.
3. Source_id date may drift 🐞 Bug ⚙ Maintainability
build_candidate_record() embeds date.today() (local time) into source_id while generation_metadata/provenance uses a UTC generated_at-derived date. Around midnight or on non-UTC systems, a single record can carry inconsistent dates across its ID vs provenance metadata.
Agent Prompt
### Issue description
`source_id` uses `date.today()` (local timezone) but `generation_metadata.date` is derived from a UTC timestamp. These can disagree and create confusing traceability.
### Issue Context
This matters for debugging/deduping and for reliably grouping outputs by a single run date.
### Fix Focus Areas
- scripts/goldgen/generate.py[87-95]
- scripts/goldgen/generate.py[252-271]
### Implementation notes
- Use a single UTC-derived date for both fields (e.g., `metadata['date']`) when constructing `source_id`, or compute `run_date = datetime.now(timezone.utc).date().isoformat()` once and reuse it for both `source_id` and `generation_metadata.date`.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
|
Code review by qodo was updated up to the latest commit 0b9c0cc |
Summary
scripts/goldgenscaffold for synthetic eval gold candidates across the required domain/language buckets and JA hard cases.confidence: "synthetic-gold-candidate", schema1.2, and schema-valid generation metadata (model,prompt_id,generated_at, plus generator context).synthetic-gold, harden malformed QA fixes into rejected rows, dedupe query types while preserving order, and enforce schema-compatible 2-4 turn chat candidates.Refs ONE-1330.
Gate
Rebased onto
origin/mainat1ee9e64 ONERB-6: add schema 1.2 goldgen candidate guardbefore this gate.Scope
Files changed are limited to
scripts/goldgen/andtests/test_goldgen.pyas required by ONERB-3.