Skip to content

fix: pre-ship dogfood fixes (DPI gate, UTF-8 I/O, scan orientation)#1

Merged
briacSck merged 1 commit into
mainfrom
fix/preship-dogfood-bugs
Jun 20, 2026
Merged

fix: pre-ship dogfood fixes (DPI gate, UTF-8 I/O, scan orientation)#1
briacSck merged 1 commit into
mainfrom
fix/preship-dogfood-bugs

Conversation

@briacSck

Copy link
Copy Markdown
Owner

Found while dogfooding the full pipeline on a real anonymized 20-exam batch before the v1 handoff. Three independent bugs, all with tests:

  1. Configurable min scan DPI. The 150 DPI ingestion gate was hardcoded, so 144 DPI scans (the real sample) were rejected with no override. Add a min_native_dpi setting threaded through all discover() call sites.

  2. Force UTF-8 on all text I/O. write_text/read_text defaulted to the platform encoding (cp1252 on Windows), crashing on any non-Latin-1 OCR output such as math superscripts or arrows. Pin encoding="utf-8" in the OCR cache, reporting, ingestion manifest, redaction sidecar, and assembler.

  3. Dynamic page-orientation correction. Scans carried /Rotate 270, so the fixed top-band redaction masked the wrong edge and the student SID leaked into the page sent to Mistral and into the transcript; outputs also rendered sideways. Detect each exam's rotation (geometric text-axis test + identity-OCR to resolve the up/down flip), normalize to upright before masking, and bake the rotation into the embedded scan. Gated by redaction.auto_orient (default on).

Verified on the full batch: 20/20 exams, 0 failures, every SID masked, 0 leaks,
all scans upright. Suite: 59 passing (+6 orientation, +1 unicode-cache).

Found while dogfooding the full pipeline on a real anonymized 20-exam batch
before the v1 handoff. Three independent bugs, all with tests:

1. Configurable min scan DPI. The 150 DPI ingestion gate was hardcoded, so
   144 DPI scans (the real sample) were rejected with no override. Add a
   `min_native_dpi` setting threaded through all discover() call sites.

2. Force UTF-8 on all text I/O. write_text/read_text defaulted to the platform
   encoding (cp1252 on Windows), crashing on any non-Latin-1 OCR output such as
   math superscripts or arrows. Pin encoding="utf-8" in the OCR cache,
   reporting, ingestion manifest, redaction sidecar, and assembler.

3. Dynamic page-orientation correction. Scans carried /Rotate 270, so the fixed
   top-band redaction masked the wrong edge and the student SID leaked into the
   page sent to Mistral and into the transcript; outputs also rendered sideways.
   Detect each exam's rotation (geometric text-axis test + identity-OCR to
   resolve the up/down flip), normalize to upright before masking, and bake the
   rotation into the embedded scan. Gated by redaction.auto_orient (default on).

Verified on the full batch: 20/20 exams, 0 failures, every SID masked, 0 leaks,
all scans upright. Suite: 59 passing (+6 orientation, +1 unicode-cache).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@briacSck briacSck merged commit 2b9fd81 into main Jun 20, 2026
1 check passed
@briacSck briacSck deleted the fix/preship-dogfood-bugs branch June 20, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant