Add gold generation scaffold by olety · Pull Request #3 · oneiron-dev/oneiroNER

olety · 2026-07-02T10:20:37Z

Summary

Add scripts/goldgen scaffold for synthetic eval gold candidates across the required domain/language buckets and JA hard cases.
Emit pre-QA candidates with confidence: "synthetic-gold-candidate", schema 1.2, and schema-valid generation metadata (model, prompt_id, generated_at, plus generator context).
Keep QA promotion to synthetic-gold, harden malformed QA fixes into rejected rows, dedupe query types while preserving order, and enforce schema-compatible 2-4 turn chat candidates.
Add focused coverage for schema-1.2 candidate validation, QA malformed-fix rejection, query type dedupe, chat turn limits, dry-run prompts, and the existing hallucination/span gates.

Refs ONE-1330.

Gate

Rebased onto origin/main at 1ee9e64 ONERB-6: add schema 1.2 goldgen candidate guard before this gate.

uv venv .venv
Using CPython 3.13.5
Creating virtual environment at: .venv

uv pip install --python .venv/bin/python -r requirements.txt -q
ok

.venv/bin/python -m pytest tests/ -x -q
136 passed in 8.46s

.venv/bin/python scripts/validate_mappings.py
Train entries: 566
Eval entries:  684
ALL CHECKS PASSED

.venv/bin/python scripts/goldgen/generate.py --bucket chat_ja --n 2 --dry-run --out /tmp/onerb3-smoke
Prompt emitted for chat_ja with JA hard cases and the 2 to 4 turn chat rule.

.venv/bin/python -m pytest tests/test_goldgen.py -q
11 passed in 0.05s

Scope

Files changed are limited to scripts/goldgen/ and tests/test_goldgen.py as required by ONERB-3.

coderabbitai · 2026-07-02T10:20:46Z

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 50 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d8d8c477-121e-413a-82ff-281f53e07866

📥 Commits

Reviewing files that changed from the base of the PR and between 1ee9e64 and 0b9c0cc.

📒 Files selected for processing (8)

scripts/goldgen/__init__.py
scripts/goldgen/buckets.yaml
scripts/goldgen/generate.py
scripts/goldgen/llm.py
scripts/goldgen/prompts/candidate.md
scripts/goldgen/qa_export.py
scripts/goldgen/qa_import.py
tests/test_goldgen.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/onerb-3

_{Comment @coderabbitai help to get the list of available commands.}

qodo-code-review · 2026-07-02T10:22:04Z

PR Summary by Qodo

Add synthetic gold generation scaffold with QA export/import and schema 1.2 guards

✨ Enhancement 🧪 Tests 🕐 40+ Minutes

AI Description

• Introduce scripts/goldgen to generate schema-1.2 synthetic eval gold candidates per bucket.
• Enforce hallucination/span gates, query-type dedupe, and 2–4 turn chat constraints.
• Add QA export/import flow to promote accepted candidates to synthetic-gold safely.

Diagram

graph TD
  A["generate.py (CLI)"] --> B["buckets.yaml"] --> C["candidate.md prompt"] --> D["llm.py (chat API)"]
  A --> E["span_computer (compute_spans_batch)"] --> F["configs/schema.json"]
  A --> G["candidates.jsonl + rejected.jsonl"] --> H["qa_export.py / qa_import.py"]

High-Level Assessment

The following are alternative approaches to this PR:

1. Use OpenAI SDK (or vendor SDK) instead of urllib

➕ Less bespoke HTTP handling and response-shape parsing
➕ Easier support for retries, timeouts, and structured outputs
➖ Adds dependency surface and possible vendor lock-in
➖ Harder to stay provider-agnostic across OpenAI-compatible endpoints

2. Model records with Pydantic (or attrs/dataclasses) for validation

➕ Cleaner invariants for turns/entities/query_types and better error messages
➕ Can unify candidate vs promoted record shaping without key filtering
➖ Additional dependency and a second validation system alongside jsonschema
➖ Still must validate against the repo’s canonical JSON schema

3. Parse true YAML with PyYAML/ruamel.yaml

➕ Allows comments, anchors, and more readable config as buckets expand
➕ Avoids JSON-in-YAML constraint
➖ Introduces a runtime YAML dependency the scaffold explicitly avoids

Recommendation: The current approach is solid for a gated scaffold: JSON-compatible buckets avoid new dependencies, and double validation (jsonschema + internal record validators) reduces risk of producing non-promotable candidates. The main follow-up to consider is making the QA importer rely on a small public helper API (rather than importing underscore-prefixed functions) to keep module boundaries cleaner, but the overall strategy is appropriate for ONERB-3’s constrained scope.

Files changed (8) +1268 / -0

Enhancement (5) +671 / -0

__init__.pyCreate goldgen package marker +2/-0

Create goldgen package marker

• Introduces the 'scripts.goldgen' package with a module docstring to scope the feature as synthetic eval gold generation.

scripts/goldgen/init.py

generate.pyImplement candidate generation with schema 1.2 + hallucination/span gates +376/-0

Implement candidate generation with schema 1.2 + hallucination/span gates

• Adds the main generator CLI and core record-building logic, including prompt rendering, OpenAI-compatible LLM invocation, span computation, query-type dedupe/padding, and strict schema validation. Produces schema-1.2 candidate records with generation metadata and writes rejected examples with reasons; also provides promotion-safe validation for QA acceptance.

scripts/goldgen/generate.py

llm.pyAdd provider-agnostic OpenAI-compatible chat client +81/-0

Add provider-agnostic OpenAI-compatible chat client

• Implements a minimal HTTP client for '/chat/completions' using 'GOLDGEN_MODEL', 'GOLDGEN_API_BASE', and 'GOLDGEN_API_KEY'. Enforces missing-env errors and normalizes API base URLs for consistent request construction.

scripts/goldgen/llm.py

qa_export.pyExport candidates to Markdown + CSV QA worksheet +88/-0

Export candidates to Markdown + CSV QA worksheet

• Adds a QA export utility that renders candidates into a human-reviewable Markdown batch and a CSV decision sheet. Includes entity spans (and turn indices for chat) to help reviewers verify boundaries and correctness.

scripts/goldgen/qa_export.py

qa_import.pyImport QA decisions and promote accepted candidates to synthetic-gold +124/-0

Import QA decisions and promote accepted candidates to synthetic-gold

• Adds a QA import utility that reads CSV decisions, applies optional text/turn/entity fixes, recomputes spans, and promotes accepted/fixed records to 'synthetic-gold'. Malformed fixes and invalid promotions are captured as rejected rows without aborting the batch.

scripts/goldgen/qa_import.py

Tests (1) +351 / -0

test_goldgen.pyAdd focused tests for schema 1.2 candidates, QA flow, and guards +351/-0

Add focused tests for schema 1.2 candidates, QA flow, and guards

• Adds tests ensuring bucket coverage, hallucination rejection, Japanese span computation, candidate/promoted schema validity, query-type dedupe behavior, chat turn limits, dry-run prompt emission, and QA export/import rejection handling for malformed fixes.

tests/test_goldgen.py

Other (2) +246 / -0

buckets.yamlDefine domain/language buckets and JA hard-case buckets +203/-0

Define domain/language buckets and JA hard-case buckets

• Adds bucket configuration covering chat/notes/journal/reminders across en/ja/zh/ko, plus dedicated Japanese hard-case buckets. Also defines a default list of negative entity types used to pad query types.

scripts/goldgen/buckets.yaml

candidate.mdAdd strict JSON prompt template with chat and span rules +43/-0

Add strict JSON prompt template with chat and span rules

• Introduces a prompt template that constrains the model to return strict JSON examples and explicitly requires verbatim entity surfaces. Encodes the 2–4 turn chat rule and negative query-type requirements to reduce invalid outputs.

scripts/goldgen/prompts/candidate.md

qodo-code-review · 2026-07-02T10:23:52Z

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📜 Skill insights (0)

1. LLM parse can crash 🐞 Bug ☼ Reliability ⭐ New

Description

parse_llm_json() calls json.loads() and then payload.get("examples") without guarding
JSONDecodeError or non-dict payloads, and generate() does not catch those failures. A single
malformed/unsupported LLM response aborts the whole generation run and prevents writing
candidates/rejected outputs.

Code

scripts/goldgen/generate.py[R79-84]

+def parse_llm_json(content: str) -> list[dict[str, Any]]:
+    payload = json.loads(content)
+    examples = payload.get("examples")
+    if not isinstance(examples, list):
+        raise CandidateRejected("LLM response must contain an examples array")
+    return examples

Evidence

The code currently performs unguarded JSON parsing and dict access, and the call site does not wrap
those exceptions, so response-shape errors will terminate the run before output is written.

scripts/goldgen/generate.py[79-84]
scripts/goldgen/generate.py[323-359]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`scripts/goldgen/generate.parse_llm_json()` assumes the LLM returns valid JSON with a top-level object and an `examples` array. If the provider returns malformed JSON, or a non-object JSON value (e.g., a list/string), the code raises `JSONDecodeError` or `AttributeError` and the whole `generate()` run aborts.

### Issue Context
This is a reliability problem because LLM responses can occasionally be malformed (provider outages, truncation, safety filters, etc.). Right now, those failures happen before any candidate/rejected JSONL is written.

### Fix Focus Areas
- scripts/goldgen/generate.py[79-84]
- scripts/goldgen/generate.py[323-359]

### Implementation notes
- In `parse_llm_json`, catch `json.JSONDecodeError` and raise `CandidateRejected` with a short message (optionally include a small preview of `content`).
- Also validate `payload` is a `dict` before calling `.get()`; otherwise raise `CandidateRejected`.
- In `generate()`, consider catching `CandidateRejected` around `parse_llm_json(content)` and writing a single rejected-row file (or raising a clean `SystemExit` that still leaves the prompt file behind), so the failure is diagnosable.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. QA fixes keep stale spans 🐞 Bug ≡ Correctness ⭐ New

Description

qa_import._apply_fix() updates fixed_text/fixed_turns_json but does not recompute entity start/end
spans unless fixed_entities_json is also provided. Any text/turn edit that shifts offsets will cause
promoted_schema_record() validation to fail with span mismatch and reject otherwise-valid QA fixes.

Code

scripts/goldgen/qa_import.py[R56-75]

+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = _normal_turns(_load_fix_json(decision, "fixed_turns_json"))
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        loaded_entities = _load_fix_json(decision, "fixed_entities_json")
+        if not isinstance(loaded_entities, list):
+            raise CandidateRejected("fixed_entities_json must contain an entity list")
+        raw_entities = _normal_entities(loaded_entities)
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed

Evidence

_apply_fix mutates text/turns without updating offsets, while the record validators assert that
offsets must still select the exact surface string; those two behaviors are incompatible after any
offset-shifting edit.

scripts/goldgen/qa_import.py[56-75]
scripts/lib/schema.py[44-68]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When QA provides `fixed_text` or `fixed_turns_json`, `_apply_fix()` changes the underlying text but leaves the original `entities` list (including `start`/`end`) untouched unless `fixed_entities_json` is also provided. This creates stale offsets, and promotion fails later because schema validation enforces `text[start:end] == surface`.

### Issue Context
This makes the `fix` workflow fragile: simple edits like adding/removing a character before an entity will cause rejection even if the entity surfaces still appear verbatim.

### Fix Focus Areas
- scripts/goldgen/qa_import.py[56-75]
- scripts/lib/schema.py[44-68]

### Implementation notes
Choose one of the following (either is acceptable, but be explicit):
1) **Auto-recompute spans** whenever `fixed_text` or `fixed_turns_json` is provided and `fixed_entities_json` is empty:
  - Build a raw entity list from existing entities using only `surface/type/original_type/(turn_index)` (ignore existing `start/end`).
  - Re-run `_compute_passage_entities()` or `_compute_conversation_entities()` to regenerate offsets.
2) **Fail fast with a clear error**: if text/turns changed but entities weren’t provided, raise `CandidateRejected` with a message like "fixed_text/fixed_turns_json requires fixed_entities_json (offsets must be recomputed)".

Option (1) is more forgiving and reduces unnecessary rejections.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~QA import can hard-crash~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

qa_import._apply_fix calls json.loads on fixed_*_json fields without error handling or structural
validation, so malformed JSON or wrong turn shape can raise JSONDecodeError/KeyError/TypeError and
abort the entire import batch.

Code

scripts/goldgen/qa_import.py[R34-50]

+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = json.loads(decision["fixed_turns_json"])
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        raw_entities = _normal_entities(json.loads(decision["fixed_entities_json"]))
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed

Evidence
The import path does not catch JSON parsing errors and passes decoded data straight into span
computation which indexes required keys directly; additionally, import_decisions does not catch
CandidateRejected, so any exception will terminate the whole import.
scripts/goldgen/qa_import.py[34-71]
scripts/goldgen/generate.py[173-177]
scripts/goldgen/generate.py[195-206]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/qa_import.py::_apply_fix` assumes `fixed_turns_json` and `fixed_entities_json` are valid JSON and in the expected shape. If a QA editor produces malformed JSON or a wrong structure (e.g., turns missing `text`), `_compute_conversation_entities` will index `turn["text"]` and crash with `KeyError`/`TypeError`, aborting the whole `import_decisions` run.

## Issue Context
This is a human-in-the-loop flow; row-level failures are common and should be reported as `CandidateRejected` (ideally recorded into a rejected file) rather than crashing the batch.

## Fix Focus Areas
- scripts/goldgen/qa_import.py[34-71]
- scripts/goldgen/generate.py[173-177]
- scripts/goldgen/generate.py[195-206]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. ~~Turn-count schema mismatch~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

Chat candidates only enforce a minimum of two turns, but ConversationRecord validation rejects
records with more than four turns; since the prompt doesn’t specify the max, the LLM can generate >4
turns and candidates will be rejected.

Code

scripts/goldgen/generate.py[R259-262]

+    if bucket["domain"] == "chat" or "turns" in example:
+        turns = example.get("turns")
+        if not isinstance(turns, list) or len(turns) < 2:
+            raise CandidateRejected("Chat candidate must include at least two turns")

Evidence
Generator-side validation only checks len(turns) < 2, while the canonical conversation validator
enforces an upper bound of 4 turns; the prompt lacks a max-turn constraint, so LLM outputs can
routinely violate the validator.
scripts/goldgen/generate.py[259-269]
scripts/lib/schema.py[135-145]
scripts/goldgen/prompts/candidate.md[34-37]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`build_candidate_record()` accepts any chat `turns` length >= 2, but `ConversationRecord.validate()` enforces `2 <= len(turns) <= 4`. This causes avoidable rejections when the LLM emits 5+ turns.

## Issue Context
The prompt template currently instructs to use `turns` for chat but does not specify an upper bound, making this mismatch likely.

## Fix Focus Areas
- scripts/goldgen/generate.py[259-269]
- scripts/goldgen/prompts/candidate.md[34-37]
- scripts/lib/schema.py[135-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. ~~Duplicate query_types rejected~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

_query_types preserves duplicates from the LLM-provided query_types list, but the schema and record
validators require query_types to be unique, so otherwise-valid candidates will be rejected as
non-promotable.

Code

scripts/goldgen/generate.py[R126-157]

+def _query_types(example: dict[str, Any], entities: list[dict[str, Any]], default_negatives: list[str]) -> list[str]:
+    positives = []
+    for entity in entities:
+        if entity["type"] not in positives:
+            positives.append(entity["type"])
+
+    query_types = [item for item in example.get("query_types", []) if isinstance(item, str) and item]
+    for item in positives:
+        if item not in query_types:
+            query_types.append(item)
+
+    negative_count = len([item for item in query_types if item not in positives])
+    for item in default_negatives:
+        if negative_count >= 2:
+            break
+        if item not in query_types and item not in positives:
+            query_types.append(item)
+            negative_count += 1
+
+    if len([item for item in query_types if item not in positives]) > 5:
+        allowed = set(positives)
+        trimmed = []
+        negatives = 0
+        for item in query_types:
+            if item in allowed:
+                trimmed.append(item)
+            elif negatives < 5:
+                trimmed.append(item)
+                negatives += 1
+        query_types = trimmed
+
+    return query_types

Evidence
The generator currently returns query_types without deduplicating, but both the JSON schema and
the runtime validators reject duplicates, causing CandidateRejected for otherwise-usable examples.
scripts/goldgen/generate.py[126-157]
configs/schema.json[145-150]
scripts/lib/schema.py[44-52]
scripts/lib/schema.py[145-147]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/generate.py::_query_types` can return duplicate `query_types` (it filters for strings but does not dedupe). `configs/schema.json` requires `query_types` to have `uniqueItems: true`, and `scripts/lib/schema.py` enforces this with assertions, so candidates will be rejected during `_validate_promotable_candidate` / `promoted_schema_record`.

## Issue Context
This is especially likely with LLM output where the model repeats types. The dedupe should preserve order and still guarantee 2–5 negatives after normalization.

## Fix Focus Areas
- scripts/goldgen/generate.py[126-157]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

6. source_id date may drift 🐞 Bug ⚙ Maintainability ⭐ New

Description

build_candidate_record() embeds date.today() (local time) into source_id while
generation_metadata/provenance uses a UTC generated_at-derived date. Around midnight or on non-UTC
systems, a single record can carry inconsistent dates across its ID vs provenance metadata.

Code

scripts/goldgen/generate.py[R260-271]

+    entities = _normal_entities(example.get("entities", []))
+    metadata = _metadata(bucket, model=model)
+    base = {
+        "schema_version": "1.2",
+        "source": SOURCE,
+        "source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}",
+        "language": bucket["language"],
+        "split": "eval",
+        "confidence": CANDIDATE_CONFIDENCE,
+        "provenance": _metadata_provenance(metadata),
+        "generation_metadata": metadata,
+    }

Evidence
The metadata date is explicitly derived from a UTC timestamp, while the source_id date is derived
from local date.today(), which can differ near day boundaries.
scripts/goldgen/generate.py[87-95]
scripts/goldgen/generate.py[252-271]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`source_id` uses `date.today()` (local timezone) but `generation_metadata.date` is derived from a UTC timestamp. These can disagree and create confusing traceability.

### Issue Context
This matters for debugging/deduping and for reliably grouping outputs by a single run date.

### Fix Focus Areas
- scripts/goldgen/generate.py[87-95]
- scripts/goldgen/generate.py[252-271]

### Implementation notes
- Use a single UTC-derived date for both fields (e.g., `metadata['date']`) when constructing `source_id`, or compute `run_date = datetime.now(timezone.utc).date().isoformat()` once and reuse it for both `source_id` and `generation_metadata.date`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Previous review results

Review updated until commit 0b9c0cc

Results up to commit 97be296

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)

1. ~~QA import can hard-crash~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

qa_import._apply_fix calls json.loads on fixed_*_json fields without error handling or structural
validation, so malformed JSON or wrong turn shape can raise JSONDecodeError/KeyError/TypeError and
abort the entire import batch.

Code

scripts/goldgen/qa_import.py[R34-50]

+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = json.loads(decision["fixed_turns_json"])
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        raw_entities = _normal_entities(json.loads(decision["fixed_entities_json"]))
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed

Evidence
The import path does not catch JSON parsing errors and passes decoded data straight into span
computation which indexes required keys directly; additionally, import_decisions does not catch
CandidateRejected, so any exception will terminate the whole import.
scripts/goldgen/qa_import.py[34-71]
scripts/goldgen/generate.py[173-177]
scripts/goldgen/generate.py[195-206]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/qa_import.py::_apply_fix` assumes `fixed_turns_json` and `fixed_entities_json` are valid JSON and in the expected shape. If a QA editor produces malformed JSON or a wrong structure (e.g., turns missing `text`), `_compute_conversation_entities` will index `turn["text"]` and crash with `KeyError`/`TypeError`, aborting the whole `import_decisions` run.

## Issue Context
This is a human-in-the-loop flow; row-level failures are common and should be reported as `CandidateRejected` (ideally recorded into a rejected file) rather than crashing the batch.

## Fix Focus Areas
- scripts/goldgen/qa_import.py[34-71]
- scripts/goldgen/generate.py[173-177]
- scripts/goldgen/generate.py[195-206]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Turn-count schema mismatch~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

Chat candidates only enforce a minimum of two turns, but ConversationRecord validation rejects
records with more than four turns; since the prompt doesn’t specify the max, the LLM can generate >4
turns and candidates will be rejected.

Code

scripts/goldgen/generate.py[R259-262]

+    if bucket["domain"] == "chat" or "turns" in example:
+        turns = example.get("turns")
+        if not isinstance(turns, list) or len(turns) < 2:
+            raise CandidateRejected("Chat candidate must include at least two turns")

Evidence
Generator-side validation only checks len(turns) < 2, while the canonical conversation validator
enforces an upper bound of 4 turns; the prompt lacks a max-turn constraint, so LLM outputs can
routinely violate the validator.
scripts/goldgen/generate.py[259-269]
scripts/lib/schema.py[135-145]
scripts/goldgen/prompts/candidate.md[34-37]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`build_candidate_record()` accepts any chat `turns` length >= 2, but `ConversationRecord.validate()` enforces `2 <= len(turns) <= 4`. This causes avoidable rejections when the LLM emits 5+ turns.

## Issue Context
The prompt template currently instructs to use `turns` for chat but does not specify an upper bound, making this mismatch likely.

## Fix Focus Areas
- scripts/goldgen/generate.py[259-269]
- scripts/goldgen/prompts/candidate.md[34-37]
- scripts/lib/schema.py[135-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~Duplicate query_types rejected~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

_query_types preserves duplicates from the LLM-provided query_types list, but the schema and record
validators require query_types to be unique, so otherwise-valid candidates will be rejected as
non-promotable.

Code

scripts/goldgen/generate.py[R126-157]

+def _query_types(example: dict[str, Any], entities: list[dict[str, Any]], default_negatives: list[str]) -> list[str]:
+    positives = []
+    for entity in entities:
+        if entity["type"] not in positives:
+            positives.append(entity["type"])
+
+    query_types = [item for item in example.get("query_types", []) if isinstance(item, str) and item]
+    for item in positives:
+        if item not in query_types:
+            query_types.append(item)
+
+    negative_count = len([item for item in query_types if item not in positives])
+    for item in default_negatives:
+        if negative_count >= 2:
+            break
+        if item not in query_types and item not in positives:
+            query_types.append(item)
+            negative_count += 1
+
+    if len([item for item in query_types if item not in positives]) > 5:
+        allowed = set(positives)
+        trimmed = []
+        negatives = 0
+        for item in query_types:
+            if item in allowed:
+                trimmed.append(item)
+            elif negatives < 5:
+                trimmed.append(item)
+                negatives += 1
+        query_types = trimmed
+
+    return query_types

Evidence
The generator currently returns query_types without deduplicating, but both the JSON schema and
the runtime validators reject duplicates, causing CandidateRejected for otherwise-usable examples.
scripts/goldgen/generate.py[126-157]
configs/schema.json[145-150]
scripts/lib/schema.py[44-52]
scripts/lib/schema.py[145-147]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/generate.py::_query_types` can return duplicate `query_types` (it filters for strings but does not dedupe). `configs/schema.json` requires `query_types` to have `uniqueItems: true`, and `scripts/lib/schema.py` enforces this with assertions, so candidates will be rejected during `_validate_promotable_candidate` / `promoted_schema_record`.

## Issue Context
This is especially likely with LLM output where the model repeats types. The dedupe should preserve order and still guarantee 2–5 negatives after normalization.

## Fix Focus Areas
- scripts/goldgen/generate.py[126-157]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Copilot

Pull request overview

This PR adds a scripts/goldgen/ scaffold to generate synthetic eval-only “gold candidate” NER records across multiple domain/language buckets, plus a QA export/import loop and tests to validate bucket coverage, span/hallucination gates, schema promotion, and env-based LLM config.

Changes:

Add bucket definitions and prompt template for chat/notes/journal/reminders across en/ja/zh/ko, including JA hard-case buckets.
Implement an OpenAI-compatible LLM client and a generator that enforces verbatim-surface + span computation and promotes schema-valid records.
Add QA export/import scripts and tests for schema validity, rejection paths, and QA round-tripping.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tests/test_goldgen.py	Adds tests for bucket coverage, hallucination/span gating, schema promotion, env config, QA round trip, and dry-run prompt emission.
scripts/goldgen/init.py	Introduces the goldgen package scaffold.
scripts/goldgen/buckets.yaml	Defines bucket matrix + JA hard-case buckets and default negative types.
scripts/goldgen/generate.py	Implements prompt rendering, LLM response parsing, candidate building, schema validation/promotion, and JSONL writing.
scripts/goldgen/llm.py	Adds provider-agnostic OpenAI-compatible `/chat/completions` client and env loading.
scripts/goldgen/prompts/candidate.md	Adds the candidate-generation prompt template.
scripts/goldgen/qa_export.py	Exports candidate JSONL to Markdown + CSV for human QA.
scripts/goldgen/qa_import.py	Imports QA decisions, applies optional fixes, and promotes accepted records to schema-valid synthetic gold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        "schema_version": "1.0",
+        "source": SOURCE,
+        "source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}",
+        "language": bucket["language"],


+    if bucket["domain"] == "chat" or "turns" in example:
+        turns = example.get("turns")
+        if not isinstance(turns, list) or len(turns) < 2:
+            raise CandidateRejected("Chat candidate must include at least two turns")
+        clean_turns = []
+        for turn in turns:
+            speaker = turn.get("speaker")
+            text = turn.get("text")
+            if not isinstance(speaker, str) or not speaker or not isinstance(text, str) or not text:
+                raise CandidateRejected(f"Bad turn shape: {turn!r}")
+            clean_turns.append({"speaker": speaker, "text": text})
+        spanned = _compute_conversation_entities(clean_turns, entities)
+        query_types = _query_types(example, spanned, default_negative_types)
+        record = {
+            **base,
+            "format": "conversation",
+            "turns": clean_turns,
+            "query_types": query_types,
+            "entities": spanned,
+        }
+        _validate_promotable_candidate(record)
+        return record
+
+    text = example.get("text")
+    if not isinstance(text, str) or not text:


+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = json.loads(decision["fixed_turns_json"])
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        raw_entities = _normal_entities(json.loads(decision["fixed_entities_json"]))
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed


+def import_decisions(candidates: Path, decisions: Path, out: Path) -> list[dict[str, Any]]:
+    records = {record["source_id"]: record for record in _load_jsonl(candidates)}
+    decision_rows = _load_decisions(decisions)
+    promoted = []
+    rejected = []


+Return strict JSON only, with this shape:
+
+```json
+{{
+  "examples": [
+    {{
+      "text": "single passage text, unless domain is chat",
+      "turns": [
+        {{"speaker": "A", "text": "chat turn"}},
+        {{"speaker": "B", "text": "chat turn"}}
+      ],
+      "entities": [
+        {{"surface": "verbatim text span", "type": "PERSON"}}
+      ],
+      "query_types": ["PERSON", "DATE", "ORG", "PRODUCT"]
+    }}
+  ]
+}}
+```


+def parse_llm_json(content: str) -> list[dict[str, Any]]:
+    payload = json.loads(content)
+    examples = payload.get("examples")
+    if not isinstance(examples, list):
+        raise CandidateRejected("LLM response must contain an examples array")
+    return examples


olety · 2026-07-02T10:25:26Z

Boundary-check update: ONERB-3 is blocked on AC#4 as written.

The SOW requires output records to carry confidence: "synthetic-gold-candidate" plus generation_metadata and be valid against configs/schema.json. The current schema rejects both parts: the confidence enum does not include synthetic-gold-candidate, and top-level unevaluatedProperties: false rejects generation_metadata.

configs/schema.json is outside the ONERB-3 SOW file scope, so I marked this PR back to draft and wrote the required flag at /Users/olety/Desktop/code/fable-queue/oneironer-wave/onerb-3.flag.md. The scaffold diff remains scoped, but this PR should not merge until the schema/SOW conflict is resolved.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

+def parse_llm_json(content: str) -> list[dict[str, Any]]:
+    payload = json.loads(content)
+    examples = payload.get("examples")
+    if not isinstance(examples, list):
+        raise CandidateRejected("LLM response must contain an examples array")
+    return examples


+    content = llm.chat_completion(prompt)
+    examples = parse_llm_json(content)
+    records = []
+    rejected = []


+def import_decisions(candidates: Path, decisions: Path, out: Path) -> list[dict[str, Any]]:
+    records = {record["source_id"]: record for record in _load_jsonl(candidates)}
+    decision_rows = _load_decisions(decisions)
+    promoted = []
+    rejected = []
+
+    for source_id, candidate in records.items():
+        decision = decision_rows.get(source_id, {})
+        action = (decision.get("decision") or "").strip().lower()


qodo-code-review · 2026-07-02T14:52:06Z

+def parse_llm_json(content: str) -> list[dict[str, Any]]:
+    payload = json.loads(content)
+    examples = payload.get("examples")
+    if not isinstance(examples, list):
+        raise CandidateRejected("LLM response must contain an examples array")
+    return examples


1. Llm parse can crash 🐞 Bug ☼ Reliability

parse_llm_json() calls json.loads() and then payload.get("examples") without guarding JSONDecodeError or non-dict payloads, and generate() does not catch those failures. A single malformed/unsupported LLM response aborts the whole generation run and prevents writing candidates/rejected outputs.

Agent Prompt

### Issue description `scripts/goldgen/generate.parse_llm_json()` assumes the LLM returns valid JSON with a top-level object and an `examples` array. If the provider returns malformed JSON, or a non-object JSON value (e.g., a list/string), the code raises `JSONDecodeError` or `AttributeError` and the whole `generate()` run aborts. ### Issue Context This is a reliability problem because LLM responses can occasionally be malformed (provider outages, truncation, safety filters, etc.). Right now, those failures happen before any candidate/rejected JSONL is written. ### Fix Focus Areas - scripts/goldgen/generate.py[79-84] - scripts/goldgen/generate.py[323-359] ### Implementation notes - In `parse_llm_json`, catch `json.JSONDecodeError` and raise `CandidateRejected` with a short message (optionally include a small preview of `content`). - Also validate `payload` is a `dict` before calling `.get()`; otherwise raise `CandidateRejected`. - In `generate()`, consider catching `CandidateRejected` around `parse_llm_json(content)` and writing a single rejected-row file (or raising a clean `SystemExit` that still leaves the prompt file behind), so the failure is diagnosable.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-07-02T14:52:06Z

+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = _normal_turns(_load_fix_json(decision, "fixed_turns_json"))
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        loaded_entities = _load_fix_json(decision, "fixed_entities_json")
+        if not isinstance(loaded_entities, list):
+            raise CandidateRejected("fixed_entities_json must contain an entity list")
+        raw_entities = _normal_entities(loaded_entities)
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed


2. Qa fixes keep stale spans 🐞 Bug ≡ Correctness

qa_import._apply_fix() updates fixed_text/fixed_turns_json but does not recompute entity start/end spans unless fixed_entities_json is also provided. Any text/turn edit that shifts offsets will cause promoted_schema_record() validation to fail with span mismatch and reject otherwise-valid QA fixes.

Agent Prompt

### Issue description When QA provides `fixed_text` or `fixed_turns_json`, `_apply_fix()` changes the underlying text but leaves the original `entities` list (including `start`/`end`) untouched unless `fixed_entities_json` is also provided. This creates stale offsets, and promotion fails later because schema validation enforces `text[start:end] == surface`. ### Issue Context This makes the `fix` workflow fragile: simple edits like adding/removing a character before an entity will cause rejection even if the entity surfaces still appear verbatim. ### Fix Focus Areas - scripts/goldgen/qa_import.py[56-75] - scripts/lib/schema.py[44-68] ### Implementation notes Choose one of the following (either is acceptable, but be explicit): 1) **Auto-recompute spans** whenever `fixed_text` or `fixed_turns_json` is provided and `fixed_entities_json` is empty: - Build a raw entity list from existing entities using only `surface/type/original_type/(turn_index)` (ignore existing `start/end`). - Re-run `_compute_passage_entities()` or `_compute_conversation_entities()` to regenerate offsets. 2) **Fail fast with a clear error**: if text/turns changed but entities weren’t provided, raise `CandidateRejected` with a message like "fixed_text/fixed_turns_json requires fixed_entities_json (offsets must be recomputed)". Option (1) is more forgiving and reduces unnecessary rejections.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-07-02T14:52:07Z

+    entities = _normal_entities(example.get("entities", []))
+    metadata = _metadata(bucket, model=model)
+    base = {
+        "schema_version": "1.2",
+        "source": SOURCE,
+        "source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}",
+        "language": bucket["language"],
+        "split": "eval",
+        "confidence": CANDIDATE_CONFIDENCE,
+        "provenance": _metadata_provenance(metadata),
+        "generation_metadata": metadata,
+    }


3. Source_id date may drift 🐞 Bug ⚙ Maintainability

build_candidate_record() embeds date.today() (local time) into source_id while generation_metadata/provenance uses a UTC generated_at-derived date. Around midnight or on non-UTC systems, a single record can carry inconsistent dates across its ID vs provenance metadata.

Agent Prompt

### Issue description `source_id` uses `date.today()` (local timezone) but `generation_metadata.date` is derived from a UTC timestamp. These can disagree and create confusing traceability. ### Issue Context This matters for debugging/deduping and for reliably grouping outputs by a single run date. ### Fix Focus Areas - scripts/goldgen/generate.py[87-95] - scripts/goldgen/generate.py[252-271] ### Implementation notes - Use a single UTC-derived date for both fields (e.g., `metadata['date']`) when constructing `source_id`, or compute `run_date = datetime.now(timezone.utc).date().isoformat()` once and reuse it for both `source_id` and `generation_metadata.date`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-code-review · 2026-07-02T14:52:09Z

Code review by qodo was updated up to the latest commit 0b9c0cc

Copilot AI review requested due to automatic review settings July 2, 2026 10:20

olety marked this pull request as ready for review July 2, 2026 10:20

Copilot started reviewing on behalf of olety July 2, 2026 10:21 View session

qodo-code-review Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread scripts/goldgen/generate.py

Comment thread scripts/goldgen/generate.py Outdated

Comment thread scripts/goldgen/qa_import.py

Copilot AI reviewed Jul 2, 2026

View reviewed changes

olety marked this pull request as draft July 2, 2026 10:25

olety added 2 commits July 2, 2026 23:41

Add gold generation scaffold

aaa8310

ONERB-3: satisfy schema 1.2 candidate guard

0b9c0cc

olety force-pushed the codex/onerb-3 branch from 97be296 to 0b9c0cc Compare July 2, 2026 14:46

olety marked this pull request as ready for review July 2, 2026 14:46

Copilot AI review requested due to automatic review settings July 2, 2026 14:46

Copilot started reviewing on behalf of olety July 2, 2026 14:47 View session

olety merged commit e0b55ea into main Jul 2, 2026
2 checks passed

olety deleted the codex/onerb-3 branch July 2, 2026 14:49

Copilot AI reviewed Jul 2, 2026

View reviewed changes

qodo-code-review Bot reviewed Jul 2, 2026

View reviewed changes

Uh oh!

Conversation

olety commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Gate

Scope

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Uh oh!

qodo-code-review Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary by Qodo

Uh oh!

qodo-code-review Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Previous review results

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

olety commented Jul 2, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

qodo-code-review Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

qodo-code-review Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

olety commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

qodo-code-review Bot commented Jul 2, 2026 •

edited

Loading

qodo-code-review Bot commented Jul 2, 2026 •

edited

Loading