fix: pre-ship dogfood fixes (DPI gate, UTF-8 I/O, scan orientation)#1
Merged
Conversation
Found while dogfooding the full pipeline on a real anonymized 20-exam batch before the v1 handoff. Three independent bugs, all with tests: 1. Configurable min scan DPI. The 150 DPI ingestion gate was hardcoded, so 144 DPI scans (the real sample) were rejected with no override. Add a `min_native_dpi` setting threaded through all discover() call sites. 2. Force UTF-8 on all text I/O. write_text/read_text defaulted to the platform encoding (cp1252 on Windows), crashing on any non-Latin-1 OCR output such as math superscripts or arrows. Pin encoding="utf-8" in the OCR cache, reporting, ingestion manifest, redaction sidecar, and assembler. 3. Dynamic page-orientation correction. Scans carried /Rotate 270, so the fixed top-band redaction masked the wrong edge and the student SID leaked into the page sent to Mistral and into the transcript; outputs also rendered sideways. Detect each exam's rotation (geometric text-axis test + identity-OCR to resolve the up/down flip), normalize to upright before masking, and bake the rotation into the embedded scan. Gated by redaction.auto_orient (default on). Verified on the full batch: 20/20 exams, 0 failures, every SID masked, 0 leaks, all scans upright. Suite: 59 passing (+6 orientation, +1 unicode-cache). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Found while dogfooding the full pipeline on a real anonymized 20-exam batch before the v1 handoff. Three independent bugs, all with tests:
Configurable min scan DPI. The 150 DPI ingestion gate was hardcoded, so 144 DPI scans (the real sample) were rejected with no override. Add a
min_native_dpisetting threaded through all discover() call sites.Force UTF-8 on all text I/O. write_text/read_text defaulted to the platform encoding (cp1252 on Windows), crashing on any non-Latin-1 OCR output such as math superscripts or arrows. Pin encoding="utf-8" in the OCR cache, reporting, ingestion manifest, redaction sidecar, and assembler.
Dynamic page-orientation correction. Scans carried /Rotate 270, so the fixed top-band redaction masked the wrong edge and the student SID leaked into the page sent to Mistral and into the transcript; outputs also rendered sideways. Detect each exam's rotation (geometric text-axis test + identity-OCR to resolve the up/down flip), normalize to upright before masking, and bake the rotation into the embedded scan. Gated by redaction.auto_orient (default on).
Verified on the full batch: 20/20 exams, 0 failures, every SID masked, 0 leaks,
all scans upright. Suite: 59 passing (+6 orientation, +1 unicode-cache).