doclayout-ai

Finnish README: README.fi.md

AI-assisted document layout parsing: images and PDFs → structured Markdown + structural PDF.

Hybrid Quality Pipeline: PaddleOCR-VL handles text, reading order and paragraph structure. PP-StructureV3 handles layout, bounding boxes, images and PDF geometry.


Repo	github.com/FoxRav/doclayout-ai
Quick intro	`README-Dummies.md`
Setup	`docs/SETUP.md`
Known issues & lessons	`docs/ERRORS.md`
Languages	Primarily Finnish; also Swedish, English and other PaddleOCR languages
Tests	`pytest tests/` — unit + regression

What the tool does

doclayout-ai converts images and PDF files into two main outputs:

structured Markdown
rebuilt structural PDF

The tool does not copy the entire scan as a PDF background image. It detects document structure, text blocks, photos, headings, reading order and layout, then reconstructs the output.

image or PDF
    → hybrid parser: PaddleOCR-VL + PP-StructureV3
    → PageModel / layout model
    → final text cleanup
    → markdown + structural PDF
    → quality report

Principles

Rule	Implementation
Text regions rendered as PDF text	Headings, meta, body, caption — no raster crops for text
Photos only as image crops	Main photo / photo block — no text regions as raster
Markdown from PageModel	Not directly from raw OCR; final-text cleanup before output
No facsimile by default	Facsimile only with `--emit-facsimile` → `ocr/<name>_facsimile.pdf`
Quality control	Quality gate: `PASS` / `PASS_WITH_WARNINGS` / `FAIL`

Output files

Main output directory (same as input)

File	When	Content
`<name>.md`	Always	Primary markdown — UTF-8 BOM, structured text from PageModel
`<name>_structural.pdf`	Default (`--pdf-mode structural`)	Rebuilt page: typography and layout
`<name>_photo.jpg`	When StructureV3 finds an `image` block	Cropped photo
`<name>_clean.pdf`	Only `--emit-clean`	Reflowed text PDF
`<name>_vl.md`	Hybrid/best	VL raw markdown (comparison)

Not written to repo root: *_test.pdf, *_facsimile*.pdf (facsimile → ocr/).

`ocr/` directory (QA & debug)

File	Content
`<name>_vl_res.json`	PaddleOCR-VL raw output
`<name>_structurev3_res.json`	PP-StructureV3 raw output (hybrid/best)
`<name>_res.json`	StructureV3 (only `--engine structurev3`)
`<name>_hybrid_res.json`	Combined hybrid result
`<name>_compare_report.json`	VL vs StructureV3 differences (best)
`<name>_layout_debug.jpg`	Bbox visualization from source
`<name>_page_model.json`	PageModel debug
`<name>_structural_report.json`	Render report
`<name>_style_debug.json`	Typography and font roles
`<name>_visual_metrics.json`	Layout metrics (gap, whitespace, fonts)
`<name>_source_alignment_metrics.json`	Source vs rendered ratios
`<name>_content_audit.json`	Content audit
`<name>_quality_report.json`	Quality gate checks
`<name>_search_text.txt`	Searchable text layer (debug)
`<name>_facsimile.pdf`	Only `--emit-facsimile`
`<name>_ocr_overlay_debug.pdf`	Only `--debug-pdf`

Raw OCR JSON is not written directly to the final .md.

Final text cleanup

Module src/kuvien_parsinta/text/final_text_cleanup.py — final markdown + PDF text only, not raw OCR:

Remove Unicode replacement / control characters
Soft hyphens and line-break hyphenation
Literal fixes for known OCR errors

Quality gate

After each run, the CLI prints:

Result	Meaning
`QUALITY: PASS`	Content, layout, typography, cleanup and visual metrics OK
`QUALITY: PASS_WITH_WARNINGS`	Content intact; visual polish still needs tuning
`QUALITY: FAIL`	Missing content, text crop, hard layout error or font size below threshold

Checks: content audit, layout zones, visual metrics, text cleanup, crop policy.

Details: docs/ERRORS.md.

Usage example

kuvien-parsinta parse parsittavat\example\document.jpg --engine hybrid --quality max

Output:

document.md
document_structural.pdf
ocr/document_quality_report.json

Quick start

git clone https://github.com/FoxRav/doclayout-ai.git
cd doclayout-ai
powershell -ExecutionPolicy Bypass -File scripts\setup.ps1

.\scripts\activate.ps1
pip install -e ".[dev,pdf]"

kuvien-parsinta parse parsittavat\example\document.jpg
pytest tests/ -q

Input: parsittavat/ — local input directory (not committed to Git; only .gitkeep in the repo).

Output: same directory as input + ocr/ subdirectory.

CLI

kuvien-parsinta parse <file> [OPTIONS]
kuvien-parsinta languages

kuvien-parsinta parse <file> `
  [--engine hybrid|vl|structurev3|best|auto] `
  [--quality standard|max] `
  [--pdf-mode structural|clean|facsimile|all] `
  [--mode auto|flowing|structural] `
  [--out-dir DIR] `
  [--emit-facsimile] [--emit-clean] [--debug-pdf] `
  [--no-pdf]

`--engine`	Usage
`hybrid`	Default — VL text + StructureV3 layout/bbox/photos
`vl`	PaddleOCR-VL only
`structurev3`	PP-StructureV3 only
`best`	Hybrid + comparison report + separate raw outputs

`--quality`	Usage
`max`	Default — orientation, unwarping, best accuracy
`standard`	Faster run

`--pdf-mode`	Usage
`structural`	Default — `<name>_structural.pdf`
`clean`	Only `<name>_clean.pdf`
`facsimile`	Structural; facsimile separately
`all`	Structural; others only via `--emit-*` / `--debug-pdf`

`--mode`	Usage
`auto`	Default — output follows document type
`flowing`	Single-column text
`structural`	Force columns and layout PDF

Flag	Extra output
`--emit-facsimile`	`ocr/<name>_facsimile.pdf`
`--emit-clean`	`<name>_clean.pdf`
`--debug-pdf`	`ocr/_ocr_overlay_debug.pdf`, `ocr/_search_layer_debug.pdf`
`--no-pdf`	Markdown only

Configuration (`.env`)

All variables use the PARSE_ prefix. Place .env in the repo root.

Engines & OCR

Variable	Default	Meaning
`PARSE_ENGINE`	`hybrid`	Default engine
`PARSE_QUALITY`	`max`	`standard` \| `max`
`PARSE_USE_VL_FOR_TEXT`	`true`	Markdown text from VL
`PARSE_USE_VL_FOR_READING_ORDER`	`true`	Reading order from VL
`PARSE_USE_STRUCTUREV3_FOR_LAYOUT`	`true`	Bbox/columns from StructureV3
`PARSE_USE_STRUCTUREV3_FOR_IMAGES`	`true`	Photo cropping
`PARSE_USE_STRUCTUREV3_FOR_PDF_GEOMETRY`	`true`	PDF page size
`PARSE_NO_SILENT_FALLBACK`	`true`	Fallback is logged
`PARSE_VL_PIPELINE_VERSION`	`v1.6`	VL pipeline
`PARSE_VL_DEVICE`	`auto`	VL device
`PARSE_OCR_DEVICE`	`auto`	StructureV3 device
`PARSE_OCR_PRIMARY_LANGUAGE`	`fi`	Primary language
`PARSE_OCR_EXTRA_LANGUAGES`	`sv,en,de,…`	Fallback order

PDF & output

Variable	Default	Meaning
`PARSE_PDF_MODE`	`structural`	Primary PDF type
`PARSE_WRITE_PDF`	`true`	Generate PDF
`PARSE_EMIT_FACSIMILE`	`false`	Facsimile only when requested
`PARSE_EMIT_CLEAN`	`false`	Clean PDF only when requested
`PARSE_EMIT_DEBUG_PDF`	`false`	Debug PDFs
`PARSE_DEBUG_OUTPUT_DIR`	`ocr`	QA directory under input

Structural layout

Variable	Default	Meaning
`PARSE_RENDER_TEXT_AS_IMAGE`	`false`	Text rendered as PDF text, not as image
`PARSE_ALLOW_TEXT_CROPS`	`false`	Raster crop of text regions disabled
`PARSE_ALLOW_PHOTO_CROPS`	`true`	Photo crops allowed
`PARSE_LAYOUT_PRESERVE`	`true`	Preserve source layout where possible
`PARSE_NEWSPAPER_COMPACT`	`true`	Compact structural layout (vertical spacing)
`PARSE_BOTTOM_COLUMN_MIN_FONT_SIZE`	`5.5`	Column text minimum font size
`PARSE_BODY_MIN_FONT_SIZE`	`6.0`	Body text minimum font size

CUDA update: scripts\install_cuda.ps1

Note. Finnish StructureV3 OCR uses the Latin model via Swedish (sv) — PaddleOCR does not provide a separate fi code for the Structure pipeline.

Hybrid Quality Pipeline

Role	PaddleOCR-VL	PP-StructureV3
Responsibility	Text, headings, paragraph structure, reading order	Bbox, columns, photos, block types
Raw JSON	`ocr/<name>_vl_res.json`	`ocr/<name>_structurev3_res.json`
Combined		`ocr/<name>_hybrid_res.json`

Conflicts → ocr/<name>_compare_report.json. Text resolved by VL, layout by StructureV3.

kuvien-parsinta parse parsittavat\example\document.jpg
kuvien-parsinta parse parsittavat\example\document.jpg --engine best
kuvien-parsinta parse parsittavat\example\document.jpg --engine structurev3

Repository structure

├── src/kuvien_parsinta/
│   ├── cli.py, pipeline.py
│   ├── engines/          # VL, StructureV3, hybrid
│   ├── layout/           # PageLayout, PageModel, typography
│   ├── text/             # content assembly, final_text_cleanup
│   ├── pdf/              # structural renderer, facsimile
│   └── quality/          # content audit, visual metrics, quality gate
├── tests/
│   ├── unit/
│   └── regression/
├── scripts/
├── docs/
├── parsittavat/          # local inputs (not committed to Git)
└── pyproject.toml

.venv/, PaddleOCR/ and parsittavat/* (except .gitkeep) are not tracked by Git.

Status

Hybrid pipeline (VL + StructureV3)
Structural PDF: source-anchored layout, typography, compact spacing
Text as PDF text; photos as crops
Final text cleanup
Quality gate + debug reports (ocr/)
pytest (unit + regression)
Batch folder / multiple files at once (v0.2)
CI without GPU

Roadmap: docs/ROADMAP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doclayout-ai

What the tool does

Principles

Output files

Main output directory (same as input)

`ocr/` directory (QA & debug)

Final text cleanup

Quality gate

Usage example

Quick start

CLI

Configuration (`.env`)

Engines & OCR

PDF & output

Structural layout

Hybrid Quality Pipeline

Repository structure

Status

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
docs		docs
parsittavat		parsittavat
requirements		requirements
scripts		scripts
src/kuvien_parsinta		src/kuvien_parsinta
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README-Dummies.md		README-Dummies.md
README.fi.md		README.fi.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

doclayout-ai

What the tool does

Principles

Output files

Main output directory (same as input)

ocr/ directory (QA & debug)

Final text cleanup

Quality gate

Usage example

Quick start

CLI

Configuration (.env)

Engines & OCR

PDF & output

Structural layout

Hybrid Quality Pipeline

Repository structure

Status

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages

`ocr/` directory (QA & debug)

Configuration (`.env`)