v1.0.2 #4
AronDaron
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hardening — Strict example schema validation prevents extra-key garbage in JSONL exports
Problem
A merged dataset (3378 rows) failed to load in Unsloth Studio with:
One row out of 3378 had an extra top-level
gptkey alongsideconversations— the model leaked an additional turn pair as a separate top-level field instead of appending it toconversations. HuggingFace Datasets / Arrow treats top-level keys as columns, so 3377 rows with one column + 1 row with two columns broke schema unification at load time.A second case was found during audit: another row had an extra
metadatakey inside individual turn objects (conversations[i].metadata.rationale). Arrow tolerates this by widening the struct schema (the extra field becomes nullable), and chat-template tokenisers ignore unknown turn keys — so this one slipped silently into fine-tuning. Still: schema noise that should not be in a clean dataset.The existing
_validate_example_structureinbackend/app/services/job_runner.pyaccepted both. It checked that required keys had valid content (e.g.conversationsis a non-empty alternating list ofhuman/gptturns) but never inspected the set of top-level keys, nor the set of keys inside each turn. The export and merge paths copiedcontent_json1:1 to JSONL with no filter.Fix
New module
backend/app/services/example_schema.pyis the single source of truth for format whitelists:alpacainstruction,input,outputsharegptconversationsfrom,valuehuman/gptchatmlmessagesrole,contentuser/assistantTwo operations on this schema:
validate_example(parsed, fmt) → ValidationResult— strict. Returnsok=Falsefor any extra top-level key, extra turn key, role mismatch, missing required, wrong type, empty value, too few turns (<2), unknown format, or non-dict payload. Wired into the generation pipeline (_validate_example_structureis now a thin wrapper). Rejected examples follow the existing 2-attempt retry; on second failure they are skipped. The structuredreasonanddetailare attached to thegeneration_invalid_structureactivity event for UI surfacing.strip_to_schema(parsed, fmt) → (cleaned, dropped_paths)— defensive cleanup that removes extra keys without validating types. Used byserialize_for_jsonl, called fromexport_service.export_jobandmerge_service._jsonl_line. Both paths emit a single warning summary log when any rows were stripped; merge additionally fires amerge_strip_extra_keysactivity event withrows_affected.The DB is never mutated. Strip only sanitises the JSONL output, so source
examples.content_jsonrows remain intact and any re-export is idempotent for legacy data.list_examples, the SSE stream, and dedup endpoints continue to return raw DB content (debug-friendly).Impact
POST /api/datasets/merge.generation_invalid_structureactivity event now carriesreasonanddetail, so the UI can show why a generation was rejected (e.g."extra top-level keys: ['gpt']").Why this didn't surface earlier
The old validator's
sharegptbranch iterated overconversationsand verified each turn'sfrom/valuefields, but it never inspected the set of top-level keys. A model returning{"conversations": [...], "gpt": [...]}looked valid becauseconversationswas internally well-formed; the second top-level key was simply ignored.The bug only surfaces at HuggingFace load time, not during generation, validation, export, or upload — so it sat dormant in datasets until someone tried to fine-tune.
Tests
402 tests pass; no regressions in the existing suite.
Files:
backend/app/services/example_schema.py(new)backend/app/services/job_runner.pybackend/app/services/export_service.pybackend/app/services/merge_service.pybackend/app/services/event_log.pyHotfix — JobConfig.categories max_length blocked merges with >10 sub-categories
Problem
GET /api/jobsreturned 500 Internal Server Error after merging 8 source jobs. The History page wouldn't load ("Failed to fetch"). The backend itself was healthy (/api/healthreturned 200) — only the jobs list endpoint crashed.Root cause in
backend/app/models/jobs.py:23:The
max_length=10was intended as a sanity cap for user-created jobs. But the same Pydantic model is used in the read path to parse existing jobs from the database. Merging 8 source jobs × 2-3 sub-categories each yields 22 sub-categories, which triggered:→
_row_to_list_itemraised → the entire endpoint failed.Fix
One line in
backend/app/models/jobs.py:23:64leaves headroom for nested merges (e.g. 8 sources × 3 sub × 2 levels of nesting).Impact
Why this didn't surface earlier
Earlier merge tests (including the 400k-example stress test referenced in internal notes) used jobs with ≤10 sub-categories each. The first 8-job merge with ~3 sub-categories each landed in production on 2026-04-27 and hit this previously dormant constraint.
File:
backend/app/models/jobs.pyThis discussion was created from the release v1.0.2.
Beta Was this translation helpful? Give feedback.
All reactions