Skip to content

Add gold generation scaffold#3

Merged
olety merged 2 commits into
mainfrom
codex/onerb-3
Jul 2, 2026
Merged

Add gold generation scaffold#3
olety merged 2 commits into
mainfrom
codex/onerb-3

Conversation

@olety

@olety olety commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

  • Add scripts/goldgen scaffold for synthetic eval gold candidates across the required domain/language buckets and JA hard cases.
  • Emit pre-QA candidates with confidence: "synthetic-gold-candidate", schema 1.2, and schema-valid generation metadata (model, prompt_id, generated_at, plus generator context).
  • Keep QA promotion to synthetic-gold, harden malformed QA fixes into rejected rows, dedupe query types while preserving order, and enforce schema-compatible 2-4 turn chat candidates.
  • Add focused coverage for schema-1.2 candidate validation, QA malformed-fix rejection, query type dedupe, chat turn limits, dry-run prompts, and the existing hallucination/span gates.

Refs ONE-1330.

Gate

Rebased onto origin/main at 1ee9e64 ONERB-6: add schema 1.2 goldgen candidate guard before this gate.

uv venv .venv
Using CPython 3.13.5
Creating virtual environment at: .venv

uv pip install --python .venv/bin/python -r requirements.txt -q
ok

.venv/bin/python -m pytest tests/ -x -q
136 passed in 8.46s

.venv/bin/python scripts/validate_mappings.py
Train entries: 566
Eval entries:  684
ALL CHECKS PASSED

.venv/bin/python scripts/goldgen/generate.py --bucket chat_ja --n 2 --dry-run --out /tmp/onerb3-smoke
Prompt emitted for chat_ja with JA hard cases and the 2 to 4 turn chat rule.

.venv/bin/python -m pytest tests/test_goldgen.py -q
11 passed in 0.05s

Scope

Files changed are limited to scripts/goldgen/ and tests/test_goldgen.py as required by ONERB-3.

Copilot AI review requested due to automatic review settings July 2, 2026 10:20
@olety olety marked this pull request as ready for review July 2, 2026 10:20
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Warning

Review limit reached

You’ve reached a temporary PR review limit under our Fair Usage Limits Policy.

Your recent review volume is higher than typical usage, so adaptive limits are currently applied.

Next review available in: 50 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d8d8c477-121e-413a-82ff-281f53e07866

📥 Commits

Reviewing files that changed from the base of the PR and between 1ee9e64 and 0b9c0cc.

📒 Files selected for processing (8)
  • scripts/goldgen/__init__.py
  • scripts/goldgen/buckets.yaml
  • scripts/goldgen/generate.py
  • scripts/goldgen/llm.py
  • scripts/goldgen/prompts/candidate.md
  • scripts/goldgen/qa_export.py
  • scripts/goldgen/qa_import.py
  • tests/test_goldgen.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/onerb-3

Comment @coderabbitai help to get the list of available commands.

@qodo-code-review

qodo-code-review Bot commented Jul 2, 2026

Copy link
Copy Markdown

PR Summary by Qodo

Add synthetic gold generation scaffold with QA export/import and schema 1.2 guards

✨ Enhancement 🧪 Tests 🕐 40+ Minutes

Grey Divider

AI Description

• Introduce scripts/goldgen to generate schema-1.2 synthetic eval gold candidates per bucket.
• Enforce hallucination/span gates, query-type dedupe, and 2–4 turn chat constraints.
• Add QA export/import flow to promote accepted candidates to synthetic-gold safely.
Diagram

graph TD
  A["generate.py (CLI)"] --> B["buckets.yaml"] --> C["candidate.md prompt"] --> D["llm.py (chat API)"]
  A --> E["span_computer (compute_spans_batch)"] --> F["configs/schema.json"]
  A --> G["candidates.jsonl + rejected.jsonl"] --> H["qa_export.py / qa_import.py"]
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Use OpenAI SDK (or vendor SDK) instead of urllib
  • ➕ Less bespoke HTTP handling and response-shape parsing
  • ➕ Easier support for retries, timeouts, and structured outputs
  • ➖ Adds dependency surface and possible vendor lock-in
  • ➖ Harder to stay provider-agnostic across OpenAI-compatible endpoints
2. Model records with Pydantic (or attrs/dataclasses) for validation
  • ➕ Cleaner invariants for turns/entities/query_types and better error messages
  • ➕ Can unify candidate vs promoted record shaping without key filtering
  • ➖ Additional dependency and a second validation system alongside jsonschema
  • ➖ Still must validate against the repo’s canonical JSON schema
3. Parse true YAML with PyYAML/ruamel.yaml
  • ➕ Allows comments, anchors, and more readable config as buckets expand
  • ➕ Avoids JSON-in-YAML constraint
  • ➖ Introduces a runtime YAML dependency the scaffold explicitly avoids

Recommendation: The current approach is solid for a gated scaffold: JSON-compatible buckets avoid new dependencies, and double validation (jsonschema + internal record validators) reduces risk of producing non-promotable candidates. The main follow-up to consider is making the QA importer rely on a small public helper API (rather than importing underscore-prefixed functions) to keep module boundaries cleaner, but the overall strategy is appropriate for ONERB-3’s constrained scope.

Files changed (8) +1268 / -0

Enhancement (5) +671 / -0
__init__.pyCreate goldgen package marker +2/-0

Create goldgen package marker

• Introduces the 'scripts.goldgen' package with a module docstring to scope the feature as synthetic eval gold generation.

scripts/goldgen/init.py

generate.pyImplement candidate generation with schema 1.2 + hallucination/span gates +376/-0

Implement candidate generation with schema 1.2 + hallucination/span gates

• Adds the main generator CLI and core record-building logic, including prompt rendering, OpenAI-compatible LLM invocation, span computation, query-type dedupe/padding, and strict schema validation. Produces schema-1.2 candidate records with generation metadata and writes rejected examples with reasons; also provides promotion-safe validation for QA acceptance.

scripts/goldgen/generate.py

llm.pyAdd provider-agnostic OpenAI-compatible chat client +81/-0

Add provider-agnostic OpenAI-compatible chat client

• Implements a minimal HTTP client for '/chat/completions' using 'GOLDGEN_MODEL', 'GOLDGEN_API_BASE', and 'GOLDGEN_API_KEY'. Enforces missing-env errors and normalizes API base URLs for consistent request construction.

scripts/goldgen/llm.py

qa_export.pyExport candidates to Markdown + CSV QA worksheet +88/-0

Export candidates to Markdown + CSV QA worksheet

• Adds a QA export utility that renders candidates into a human-reviewable Markdown batch and a CSV decision sheet. Includes entity spans (and turn indices for chat) to help reviewers verify boundaries and correctness.

scripts/goldgen/qa_export.py

qa_import.pyImport QA decisions and promote accepted candidates to synthetic-gold +124/-0

Import QA decisions and promote accepted candidates to synthetic-gold

• Adds a QA import utility that reads CSV decisions, applies optional text/turn/entity fixes, recomputes spans, and promotes accepted/fixed records to 'synthetic-gold'. Malformed fixes and invalid promotions are captured as rejected rows without aborting the batch.

scripts/goldgen/qa_import.py

Tests (1) +351 / -0
test_goldgen.pyAdd focused tests for schema 1.2 candidates, QA flow, and guards +351/-0

Add focused tests for schema 1.2 candidates, QA flow, and guards

• Adds tests ensuring bucket coverage, hallucination rejection, Japanese span computation, candidate/promoted schema validity, query-type dedupe behavior, chat turn limits, dry-run prompt emission, and QA export/import rejection handling for malformed fixes.

tests/test_goldgen.py

Other (2) +246 / -0
buckets.yamlDefine domain/language buckets and JA hard-case buckets +203/-0

Define domain/language buckets and JA hard-case buckets

• Adds bucket configuration covering chat/notes/journal/reminders across en/ja/zh/ko, plus dedicated Japanese hard-case buckets. Also defines a default list of negative entity types used to pad query types.

scripts/goldgen/buckets.yaml

candidate.mdAdd strict JSON prompt template with chat and span rules +43/-0

Add strict JSON prompt template with chat and span rules

• Introduces a prompt template that constrains the model to return strict JSON examples and explicitly requires verbatim entity surfaces. Encodes the 2–4 turn chat rule and negative query-type requirements to reduce invalid outputs.

scripts/goldgen/prompts/candidate.md

@qodo-code-review

qodo-code-review Bot commented Jul 2, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📜 Skill insights (0)

Grey Divider


Action required

1. LLM parse can crash 🐞 Bug ☼ Reliability ⭐ New
Description
parse_llm_json() calls json.loads() and then payload.get("examples") without guarding
JSONDecodeError or non-dict payloads, and generate() does not catch those failures. A single
malformed/unsupported LLM response aborts the whole generation run and prevents writing
candidates/rejected outputs.
Code

scripts/goldgen/generate.py[R79-84]

+def parse_llm_json(content: str) -> list[dict[str, Any]]:
+    payload = json.loads(content)
+    examples = payload.get("examples")
+    if not isinstance(examples, list):
+        raise CandidateRejected("LLM response must contain an examples array")
+    return examples
Evidence
The code currently performs unguarded JSON parsing and dict access, and the call site does not wrap
those exceptions, so response-shape errors will terminate the run before output is written.

scripts/goldgen/generate.py[79-84]
scripts/goldgen/generate.py[323-359]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`scripts/goldgen/generate.parse_llm_json()` assumes the LLM returns valid JSON with a top-level object and an `examples` array. If the provider returns malformed JSON, or a non-object JSON value (e.g., a list/string), the code raises `JSONDecodeError` or `AttributeError` and the whole `generate()` run aborts.

### Issue Context
This is a reliability problem because LLM responses can occasionally be malformed (provider outages, truncation, safety filters, etc.). Right now, those failures happen before any candidate/rejected JSONL is written.

### Fix Focus Areas
- scripts/goldgen/generate.py[79-84]
- scripts/goldgen/generate.py[323-359]

### Implementation notes
- In `parse_llm_json`, catch `json.JSONDecodeError` and raise `CandidateRejected` with a short message (optionally include a small preview of `content`).
- Also validate `payload` is a `dict` before calling `.get()`; otherwise raise `CandidateRejected`.
- In `generate()`, consider catching `CandidateRejected` around `parse_llm_json(content)` and writing a single rejected-row file (or raising a clean `SystemExit` that still leaves the prompt file behind), so the failure is diagnosable.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. QA fixes keep stale spans 🐞 Bug ≡ Correctness ⭐ New
Description
qa_import._apply_fix() updates fixed_text/fixed_turns_json but does not recompute entity start/end
spans unless fixed_entities_json is also provided. Any text/turn edit that shifts offsets will cause
promoted_schema_record() validation to fail with span mismatch and reject otherwise-valid QA fixes.
Code

scripts/goldgen/qa_import.py[R56-75]

+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = _normal_turns(_load_fix_json(decision, "fixed_turns_json"))
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        loaded_entities = _load_fix_json(decision, "fixed_entities_json")
+        if not isinstance(loaded_entities, list):
+            raise CandidateRejected("fixed_entities_json must contain an entity list")
+        raw_entities = _normal_entities(loaded_entities)
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed
Evidence
_apply_fix mutates text/turns without updating offsets, while the record validators assert that
offsets must still select the exact surface string; those two behaviors are incompatible after any
offset-shifting edit.

scripts/goldgen/qa_import.py[56-75]
scripts/lib/schema.py[44-68]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
When QA provides `fixed_text` or `fixed_turns_json`, `_apply_fix()` changes the underlying text but leaves the original `entities` list (including `start`/`end`) untouched unless `fixed_entities_json` is also provided. This creates stale offsets, and promotion fails later because schema validation enforces `text[start:end] == surface`.

### Issue Context
This makes the `fix` workflow fragile: simple edits like adding/removing a character before an entity will cause rejection even if the entity surfaces still appear verbatim.

### Fix Focus Areas
- scripts/goldgen/qa_import.py[56-75]
- scripts/lib/schema.py[44-68]

### Implementation notes
Choose one of the following (either is acceptable, but be explicit):
1) **Auto-recompute spans** whenever `fixed_text` or `fixed_turns_json` is provided and `fixed_entities_json` is empty:
  - Build a raw entity list from existing entities using only `surface/type/original_type/(turn_index)` (ignore existing `start/end`).
  - Re-run `_compute_passage_entities()` or `_compute_conversation_entities()` to regenerate offsets.
2) **Fail fast with a clear error**: if text/turns changed but entities weren’t provided, raise `CandidateRejected` with a message like "fixed_text/fixed_turns_json requires fixed_entities_json (offsets must be recomputed)".

Option (1) is more forgiving and reduces unnecessary rejections.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. QA import can hard-crash ✓ Resolved 🐞 Bug ☼ Reliability
Description
qa_import._apply_fix calls json.loads on fixed_*_json fields without error handling or structural
validation, so malformed JSON or wrong turn shape can raise JSONDecodeError/KeyError/TypeError and
abort the entire import batch.
Code

scripts/goldgen/qa_import.py[R34-50]

+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = json.loads(decision["fixed_turns_json"])
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        raw_entities = _normal_entities(json.loads(decision["fixed_entities_json"]))
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed
Evidence
The import path does not catch JSON parsing errors and passes decoded data straight into span
computation which indexes required keys directly; additionally, import_decisions does not catch
CandidateRejected, so any exception will terminate the whole import.

scripts/goldgen/qa_import.py[34-71]
scripts/goldgen/generate.py[173-177]
scripts/goldgen/generate.py[195-206]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/qa_import.py::_apply_fix` assumes `fixed_turns_json` and `fixed_entities_json` are valid JSON and in the expected shape. If a QA editor produces malformed JSON or a wrong structure (e.g., turns missing `text`), `_compute_conversation_entities` will index `turn["text"]` and crash with `KeyError`/`TypeError`, aborting the whole `import_decisions` run.

## Issue Context
This is a human-in-the-loop flow; row-level failures are common and should be reported as `CandidateRejected` (ideally recorded into a rejected file) rather than crashing the batch.

## Fix Focus Areas
- scripts/goldgen/qa_import.py[34-71]
- scripts/goldgen/generate.py[173-177]
- scripts/goldgen/generate.py[195-206]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Turn-count schema mismatch ✓ Resolved 🐞 Bug ☼ Reliability
Description
Chat candidates only enforce a minimum of two turns, but ConversationRecord validation rejects
records with more than four turns; since the prompt doesn’t specify the max, the LLM can generate >4
turns and candidates will be rejected.
Code

scripts/goldgen/generate.py[R259-262]

+    if bucket["domain"] == "chat" or "turns" in example:
+        turns = example.get("turns")
+        if not isinstance(turns, list) or len(turns) < 2:
+            raise CandidateRejected("Chat candidate must include at least two turns")
Evidence
Generator-side validation only checks len(turns) < 2, while the canonical conversation validator
enforces an upper bound of 4 turns; the prompt lacks a max-turn constraint, so LLM outputs can
routinely violate the validator.

scripts/goldgen/generate.py[259-269]
scripts/lib/schema.py[135-145]
scripts/goldgen/prompts/candidate.md[34-37]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`build_candidate_record()` accepts any chat `turns` length >= 2, but `ConversationRecord.validate()` enforces `2 <= len(turns) <= 4`. This causes avoidable rejections when the LLM emits 5+ turns.

## Issue Context
The prompt template currently instructs to use `turns` for chat but does not specify an upper bound, making this mismatch likely.

## Fix Focus Areas
- scripts/goldgen/generate.py[259-269]
- scripts/goldgen/prompts/candidate.md[34-37]
- scripts/lib/schema.py[135-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Duplicate query_types rejected ✓ Resolved 🐞 Bug ≡ Correctness
Description
_query_types preserves duplicates from the LLM-provided query_types list, but the schema and record
validators require query_types to be unique, so otherwise-valid candidates will be rejected as
non-promotable.
Code

scripts/goldgen/generate.py[R126-157]

+def _query_types(example: dict[str, Any], entities: list[dict[str, Any]], default_negatives: list[str]) -> list[str]:
+    positives = []
+    for entity in entities:
+        if entity["type"] not in positives:
+            positives.append(entity["type"])
+
+    query_types = [item for item in example.get("query_types", []) if isinstance(item, str) and item]
+    for item in positives:
+        if item not in query_types:
+            query_types.append(item)
+
+    negative_count = len([item for item in query_types if item not in positives])
+    for item in default_negatives:
+        if negative_count >= 2:
+            break
+        if item not in query_types and item not in positives:
+            query_types.append(item)
+            negative_count += 1
+
+    if len([item for item in query_types if item not in positives]) > 5:
+        allowed = set(positives)
+        trimmed = []
+        negatives = 0
+        for item in query_types:
+            if item in allowed:
+                trimmed.append(item)
+            elif negatives < 5:
+                trimmed.append(item)
+                negatives += 1
+        query_types = trimmed
+
+    return query_types
Evidence
The generator currently returns query_types without deduplicating, but both the JSON schema and
the runtime validators reject duplicates, causing CandidateRejected for otherwise-usable examples.

scripts/goldgen/generate.py[126-157]
configs/schema.json[145-150]
scripts/lib/schema.py[44-52]
scripts/lib/schema.py[145-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/generate.py::_query_types` can return duplicate `query_types` (it filters for strings but does not dedupe). `configs/schema.json` requires `query_types` to have `uniqueItems: true`, and `scripts/lib/schema.py` enforces this with assertions, so candidates will be rejected during `_validate_promotable_candidate` / `promoted_schema_record`.

## Issue Context
This is especially likely with LLM output where the model repeats types. The dedupe should preserve order and still guarantee 2–5 negatives after normalization.

## Fix Focus Areas
- scripts/goldgen/generate.py[126-157]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

6. source_id date may drift 🐞 Bug ⚙ Maintainability ⭐ New
Description
build_candidate_record() embeds date.today() (local time) into source_id while
generation_metadata/provenance uses a UTC generated_at-derived date. Around midnight or on non-UTC
systems, a single record can carry inconsistent dates across its ID vs provenance metadata.
Code

scripts/goldgen/generate.py[R260-271]

+    entities = _normal_entities(example.get("entities", []))
+    metadata = _metadata(bucket, model=model)
+    base = {
+        "schema_version": "1.2",
+        "source": SOURCE,
+        "source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}",
+        "language": bucket["language"],
+        "split": "eval",
+        "confidence": CANDIDATE_CONFIDENCE,
+        "provenance": _metadata_provenance(metadata),
+        "generation_metadata": metadata,
+    }
Evidence
The metadata date is explicitly derived from a UTC timestamp, while the source_id date is derived
from local date.today(), which can differ near day boundaries.

scripts/goldgen/generate.py[87-95]
scripts/goldgen/generate.py[252-271]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`source_id` uses `date.today()` (local timezone) but `generation_metadata.date` is derived from a UTC timestamp. These can disagree and create confusing traceability.

### Issue Context
This matters for debugging/deduping and for reliably grouping outputs by a single run date.

### Fix Focus Areas
- scripts/goldgen/generate.py[87-95]
- scripts/goldgen/generate.py[252-271]

### Implementation notes
- Use a single UTC-derived date for both fields (e.g., `metadata['date']`) when constructing `source_id`, or compute `run_date = datetime.now(timezone.utc).date().isoformat()` once and reuse it for both `source_id` and `generation_metadata.date`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit 0b9c0cc

Results up to commit 97be296


🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)


Action required
1. QA import can hard-crash ✓ Resolved 🐞 Bug ☼ Reliability
Description
qa_import._apply_fix calls json.loads on fixed_*_json fields without error handling or structural
validation, so malformed JSON or wrong turn shape can raise JSONDecodeError/KeyError/TypeError and
abort the entire import batch.
Code

scripts/goldgen/qa_import.py[R34-50]

+def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
+    fixed = dict(candidate)
+    if decision.get("fixed_text"):
+        fixed["text"] = decision["fixed_text"]
+        fixed.pop("turns", None)
+        fixed.pop("format", None)
+    if decision.get("fixed_turns_json"):
+        fixed["turns"] = json.loads(decision["fixed_turns_json"])
+        fixed["format"] = "conversation"
+        fixed.pop("text", None)
+    if decision.get("fixed_entities_json"):
+        raw_entities = _normal_entities(json.loads(decision["fixed_entities_json"]))
+        if fixed.get("format") == "conversation":
+            fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
+        else:
+            fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
+    return fixed
Evidence
The import path does not catch JSON parsing errors and passes decoded data straight into span
computation which indexes required keys directly; additionally, import_decisions does not catch
CandidateRejected, so any exception will terminate the whole import.

scripts/goldgen/qa_import.py[34-71]
scripts/goldgen/generate.py[173-177]
scripts/goldgen/generate.py[195-206]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/qa_import.py::_apply_fix` assumes `fixed_turns_json` and `fixed_entities_json` are valid JSON and in the expected shape. If a QA editor produces malformed JSON or a wrong structure (e.g., turns missing `text`), `_compute_conversation_entities` will index `turn["text"]` and crash with `KeyError`/`TypeError`, aborting the whole `import_decisions` run.

## Issue Context
This is a human-in-the-loop flow; row-level failures are common and should be reported as `CandidateRejected` (ideally recorded into a rejected file) rather than crashing the batch.

## Fix Focus Areas
- scripts/goldgen/qa_import.py[34-71]
- scripts/goldgen/generate.py[173-177]
- scripts/goldgen/generate.py[195-206]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
2. Turn-count schema mismatch ✓ Resolved 🐞 Bug ☼ Reliability
Description
Chat candidates only enforce a minimum of two turns, but ConversationRecord validation rejects
records with more than four turns; since the prompt doesn’t specify the max, the LLM can generate >4
turns and candidates will be rejected.
Code

scripts/goldgen/generate.py[R259-262]

+    if bucket["domain"] == "chat" or "turns" in example:
+        turns = example.get("turns")
+        if not isinstance(turns, list) or len(turns) < 2:
+            raise CandidateRejected("Chat candidate must include at least two turns")
Evidence
Generator-side validation only checks len(turns) < 2, while the canonical conversation validator
enforces an upper bound of 4 turns; the prompt lacks a max-turn constraint, so LLM outputs can
routinely violate the validator.

scripts/goldgen/generate.py[259-269]
scripts/lib/schema.py[135-145]
scripts/goldgen/prompts/candidate.md[34-37]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`build_candidate_record()` accepts any chat `turns` length >= 2, but `ConversationRecord.validate()` enforces `2 <= len(turns) <= 4`. This causes avoidable rejections when the LLM emits 5+ turns.

## Issue Context
The prompt template currently instructs to use `turns` for chat but does not specify an upper bound, making this mismatch likely.

## Fix Focus Areas
- scripts/goldgen/generate.py[259-269]
- scripts/goldgen/prompts/candidate.md[34-37]
- scripts/lib/schema.py[135-145]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Duplicate query_types rejected ✓ Resolved 🐞 Bug ≡ Correctness
Description
_query_types preserves duplicates from the LLM-provided query_types list, but the schema and record
validators require query_types to be unique, so otherwise-valid candidates will be rejected as
non-promotable.
Code

scripts/goldgen/generate.py[R126-157]

+def _query_types(example: dict[str, Any], entities: list[dict[str, Any]], default_negatives: list[str]) -> list[str]:
+    positives = []
+    for entity in entities:
+        if entity["type"] not in positives:
+            positives.append(entity["type"])
+
+    query_types = [item for item in example.get("query_types", []) if isinstance(item, str) and item]
+    for item in positives:
+        if item not in query_types:
+            query_types.append(item)
+
+    negative_count = len([item for item in query_types if item not in positives])
+    for item in default_negatives:
+        if negative_count >= 2:
+            break
+        if item not in query_types and item not in positives:
+            query_types.append(item)
+            negative_count += 1
+
+    if len([item for item in query_types if item not in positives]) > 5:
+        allowed = set(positives)
+        trimmed = []
+        negatives = 0
+        for item in query_types:
+            if item in allowed:
+                trimmed.append(item)
+            elif negatives < 5:
+                trimmed.append(item)
+                negatives += 1
+        query_types = trimmed
+
+    return query_types
Evidence
The generator currently returns query_types without deduplicating, but both the JSON schema and
the runtime validators reject duplicates, causing CandidateRejected for otherwise-usable examples.

scripts/goldgen/generate.py[126-157]
configs/schema.json[145-150]
scripts/lib/schema.py[44-52]
scripts/lib/schema.py[145-147]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`scripts/goldgen/generate.py::_query_types` can return duplicate `query_types` (it filters for strings but does not dedupe). `configs/schema.json` requires `query_types` to have `uniqueItems: true`, and `scripts/lib/schema.py` enforces this with assertions, so candidates will be rejected during `_validate_promotable_candidate` / `promoted_schema_record`.

## Issue Context
This is especially likely with LLM output where the model repeats types. The dedupe should preserve order and still guarantee 2–5 negatives after normalization.

## Fix Focus Areas
- scripts/goldgen/generate.py[126-157]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Comment thread scripts/goldgen/generate.py
Comment thread scripts/goldgen/generate.py Outdated
Comment thread scripts/goldgen/qa_import.py

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a scripts/goldgen/ scaffold to generate synthetic eval-only “gold candidate” NER records across multiple domain/language buckets, plus a QA export/import loop and tests to validate bucket coverage, span/hallucination gates, schema promotion, and env-based LLM config.

Changes:

  • Add bucket definitions and prompt template for chat/notes/journal/reminders across en/ja/zh/ko, including JA hard-case buckets.
  • Implement an OpenAI-compatible LLM client and a generator that enforces verbatim-surface + span computation and promotes schema-valid records.
  • Add QA export/import scripts and tests for schema validity, rejection paths, and QA round-tripping.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/test_goldgen.py Adds tests for bucket coverage, hallucination/span gating, schema promotion, env config, QA round trip, and dry-run prompt emission.
scripts/goldgen/init.py Introduces the goldgen package scaffold.
scripts/goldgen/buckets.yaml Defines bucket matrix + JA hard-case buckets and default negative types.
scripts/goldgen/generate.py Implements prompt rendering, LLM response parsing, candidate building, schema validation/promotion, and JSONL writing.
scripts/goldgen/llm.py Adds provider-agnostic OpenAI-compatible /chat/completions client and env loading.
scripts/goldgen/prompts/candidate.md Adds the candidate-generation prompt template.
scripts/goldgen/qa_export.py Exports candidate JSONL to Markdown + CSV for human QA.
scripts/goldgen/qa_import.py Imports QA decisions, applies optional fixes, and promotes accepted records to schema-valid synthetic gold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/goldgen/generate.py Outdated
Comment on lines +249 to +252
"schema_version": "1.0",
"source": SOURCE,
"source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}",
"language": bucket["language"],
Comment on lines +259 to +283
if bucket["domain"] == "chat" or "turns" in example:
turns = example.get("turns")
if not isinstance(turns, list) or len(turns) < 2:
raise CandidateRejected("Chat candidate must include at least two turns")
clean_turns = []
for turn in turns:
speaker = turn.get("speaker")
text = turn.get("text")
if not isinstance(speaker, str) or not speaker or not isinstance(text, str) or not text:
raise CandidateRejected(f"Bad turn shape: {turn!r}")
clean_turns.append({"speaker": speaker, "text": text})
spanned = _compute_conversation_entities(clean_turns, entities)
query_types = _query_types(example, spanned, default_negative_types)
record = {
**base,
"format": "conversation",
"turns": clean_turns,
"query_types": query_types,
"entities": spanned,
}
_validate_promotable_candidate(record)
return record

text = example.get("text")
if not isinstance(text, str) or not text:
Comment on lines +34 to +50
def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
fixed = dict(candidate)
if decision.get("fixed_text"):
fixed["text"] = decision["fixed_text"]
fixed.pop("turns", None)
fixed.pop("format", None)
if decision.get("fixed_turns_json"):
fixed["turns"] = json.loads(decision["fixed_turns_json"])
fixed["format"] = "conversation"
fixed.pop("text", None)
if decision.get("fixed_entities_json"):
raw_entities = _normal_entities(json.loads(decision["fixed_entities_json"]))
if fixed.get("format") == "conversation":
fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
else:
fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
return fixed
Comment on lines +53 to +57
def import_decisions(candidates: Path, decisions: Path, out: Path) -> list[dict[str, Any]]:
records = {record["source_id"]: record for record in _load_jsonl(candidates)}
decision_rows = _load_decisions(decisions)
promoted = []
rejected = []
Comment on lines +14 to +32
Return strict JSON only, with this shape:

```json
{{
"examples": [
{{
"text": "single passage text, unless domain is chat",
"turns": [
{{"speaker": "A", "text": "chat turn"}},
{{"speaker": "B", "text": "chat turn"}}
],
"entities": [
{{"surface": "verbatim text span", "type": "PERSON"}}
],
"query_types": ["PERSON", "DATE", "ORG", "PRODUCT"]
}}
]
}}
```
Comment on lines +79 to +84
def parse_llm_json(content: str) -> list[dict[str, Any]]:
payload = json.loads(content)
examples = payload.get("examples")
if not isinstance(examples, list):
raise CandidateRejected("LLM response must contain an examples array")
return examples
@olety olety marked this pull request as draft July 2, 2026 10:25
@olety

olety commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Boundary-check update: ONERB-3 is blocked on AC#4 as written.

The SOW requires output records to carry confidence: "synthetic-gold-candidate" plus generation_metadata and be valid against configs/schema.json. The current schema rejects both parts: the confidence enum does not include synthetic-gold-candidate, and top-level unevaluatedProperties: false rejects generation_metadata.

configs/schema.json is outside the ONERB-3 SOW file scope, so I marked this PR back to draft and wrote the required flag at /Users/olety/Desktop/code/fable-queue/oneironer-wave/onerb-3.flag.md. The scaffold diff remains scoped, but this PR should not merge until the schema/SOW conflict is resolved.

@olety olety marked this pull request as ready for review July 2, 2026 14:46
Copilot AI review requested due to automatic review settings July 2, 2026 14:46
@olety olety merged commit e0b55ea into main Jul 2, 2026
2 checks passed
@olety olety deleted the codex/onerb-3 branch July 2, 2026 14:49

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Comment on lines +79 to +84
def parse_llm_json(content: str) -> list[dict[str, Any]]:
payload = json.loads(content)
examples = payload.get("examples")
if not isinstance(examples, list):
raise CandidateRejected("LLM response must contain an examples array")
return examples
Comment on lines +337 to +340
content = llm.chat_completion(prompt)
examples = parse_llm_json(content)
records = []
rejected = []
Comment on lines +78 to +86
def import_decisions(candidates: Path, decisions: Path, out: Path) -> list[dict[str, Any]]:
records = {record["source_id"]: record for record in _load_jsonl(candidates)}
decision_rows = _load_decisions(decisions)
promoted = []
rejected = []

for source_id, candidate in records.items():
decision = decision_rows.get(source_id, {})
action = (decision.get("decision") or "").strip().lower()
Comment on lines +79 to +84
def parse_llm_json(content: str) -> list[dict[str, Any]]:
payload = json.loads(content)
examples = payload.get("examples")
if not isinstance(examples, list):
raise CandidateRejected("LLM response must contain an examples array")
return examples

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Llm parse can crash 🐞 Bug ☼ Reliability

parse_llm_json() calls json.loads() and then payload.get("examples") without guarding
JSONDecodeError or non-dict payloads, and generate() does not catch those failures. A single
malformed/unsupported LLM response aborts the whole generation run and prevents writing
candidates/rejected outputs.
Agent Prompt
### Issue description
`scripts/goldgen/generate.parse_llm_json()` assumes the LLM returns valid JSON with a top-level object and an `examples` array. If the provider returns malformed JSON, or a non-object JSON value (e.g., a list/string), the code raises `JSONDecodeError` or `AttributeError` and the whole `generate()` run aborts.

### Issue Context
This is a reliability problem because LLM responses can occasionally be malformed (provider outages, truncation, safety filters, etc.). Right now, those failures happen before any candidate/rejected JSONL is written.

### Fix Focus Areas
- scripts/goldgen/generate.py[79-84]
- scripts/goldgen/generate.py[323-359]

### Implementation notes
- In `parse_llm_json`, catch `json.JSONDecodeError` and raise `CandidateRejected` with a short message (optionally include a small preview of `content`).
- Also validate `payload` is a `dict` before calling `.get()`; otherwise raise `CandidateRejected`.
- In `generate()`, consider catching `CandidateRejected` around `parse_llm_json(content)` and writing a single rejected-row file (or raising a clean `SystemExit` that still leaves the prompt file behind), so the failure is diagnosable.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +56 to +75
def _apply_fix(candidate: dict[str, Any], decision: dict[str, str]) -> dict[str, Any]:
fixed = dict(candidate)
if decision.get("fixed_text"):
fixed["text"] = decision["fixed_text"]
fixed.pop("turns", None)
fixed.pop("format", None)
if decision.get("fixed_turns_json"):
fixed["turns"] = _normal_turns(_load_fix_json(decision, "fixed_turns_json"))
fixed["format"] = "conversation"
fixed.pop("text", None)
if decision.get("fixed_entities_json"):
loaded_entities = _load_fix_json(decision, "fixed_entities_json")
if not isinstance(loaded_entities, list):
raise CandidateRejected("fixed_entities_json must contain an entity list")
raw_entities = _normal_entities(loaded_entities)
if fixed.get("format") == "conversation":
fixed["entities"] = _compute_conversation_entities(fixed["turns"], raw_entities)
else:
fixed["entities"] = _compute_passage_entities(fixed["text"], raw_entities)
return fixed

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Qa fixes keep stale spans 🐞 Bug ≡ Correctness

qa_import._apply_fix() updates fixed_text/fixed_turns_json but does not recompute entity start/end
spans unless fixed_entities_json is also provided. Any text/turn edit that shifts offsets will cause
promoted_schema_record() validation to fail with span mismatch and reject otherwise-valid QA fixes.
Agent Prompt
### Issue description
When QA provides `fixed_text` or `fixed_turns_json`, `_apply_fix()` changes the underlying text but leaves the original `entities` list (including `start`/`end`) untouched unless `fixed_entities_json` is also provided. This creates stale offsets, and promotion fails later because schema validation enforces `text[start:end] == surface`.

### Issue Context
This makes the `fix` workflow fragile: simple edits like adding/removing a character before an entity will cause rejection even if the entity surfaces still appear verbatim.

### Fix Focus Areas
- scripts/goldgen/qa_import.py[56-75]
- scripts/lib/schema.py[44-68]

### Implementation notes
Choose one of the following (either is acceptable, but be explicit):
1) **Auto-recompute spans** whenever `fixed_text` or `fixed_turns_json` is provided and `fixed_entities_json` is empty:
   - Build a raw entity list from existing entities using only `surface/type/original_type/(turn_index)` (ignore existing `start/end`).
   - Re-run `_compute_passage_entities()` or `_compute_conversation_entities()` to regenerate offsets.
2) **Fail fast with a clear error**: if text/turns changed but entities weren’t provided, raise `CandidateRejected` with a message like "fixed_text/fixed_turns_json requires fixed_entities_json (offsets must be recomputed)".

Option (1) is more forgiving and reduces unnecessary rejections.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Comment on lines +260 to +271
entities = _normal_entities(example.get("entities", []))
metadata = _metadata(bucket, model=model)
base = {
"schema_version": "1.2",
"source": SOURCE,
"source_id": f"{bucket['name']}-{date.today().isoformat()}-{index}-{uuid.uuid4().hex[:8]}",
"language": bucket["language"],
"split": "eval",
"confidence": CANDIDATE_CONFIDENCE,
"provenance": _metadata_provenance(metadata),
"generation_metadata": metadata,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Informational

3. Source_id date may drift 🐞 Bug ⚙ Maintainability

build_candidate_record() embeds date.today() (local time) into source_id while
generation_metadata/provenance uses a UTC generated_at-derived date. Around midnight or on non-UTC
systems, a single record can carry inconsistent dates across its ID vs provenance metadata.
Agent Prompt
### Issue description
`source_id` uses `date.today()` (local timezone) but `generation_metadata.date` is derived from a UTC timestamp. These can disagree and create confusing traceability.

### Issue Context
This matters for debugging/deduping and for reliably grouping outputs by a single run date.

### Fix Focus Areas
- scripts/goldgen/generate.py[87-95]
- scripts/goldgen/generate.py[252-271]

### Implementation notes
- Use a single UTC-derived date for both fields (e.g., `metadata['date']`) when constructing `source_id`, or compute `run_date = datetime.now(timezone.utc).date().isoformat()` once and reuse it for both `source_id` and `generation_metadata.date`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

@qodo-code-review

Copy link
Copy Markdown

Code review by qodo was updated up to the latest commit 0b9c0cc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants