ocrd-pagetopdf

OCR-D wrapper for prima-page-to-pdf

Transforms all PAGE-XML+IMG to PDF with text layer and (optionally) polygon outlines.

(Converts original images together with text and layout annotations of all pages in the PAGE input file group to PDF. The text is rendered as an overlay.)

Requirements

GNU make
Python 3 with pip and venv
OCR-D
Java runtime (OpenJDK 8 works for PageToPdf 1.1.2)

Installation

Once you have installed Java, make, Python, and set up your virtual environment, do:

make deps # or: pip install ocrd
make install # copies into PREFIX or VIRTUAL_ENV

Usage

The command-line interface conforms to OCR-D processor specifications.

Assuming you have an OCR-D workspace in your current working directory, simply do:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word"}'

This will run the script and create PDF files for each page with a text layer based on word-level annotations.

There is also an option to create an additional multipage file with name merged.pdf, which contain all single pages in correct order:

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "multipage":"merged"}'

FAQ

Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner() If that appears, try installing OpenJDK 8.
java.lang.NullPointerException If that appears, try (a little workaround) and set negative coordinates to zero:
```
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "negative2zero": true}'
```
Some letters are illegible? Please note that the standard displayed font (AletheiaSans.ttf) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:
```
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "font": "/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf"}'
```

The multipage file pagelabelnames can be changed, e.g. consecutively pagenumber.

ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{"textequiv_level" : "word", "multipage":"merged", "pagelabelname":"pagenumber"}'

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
ptp		ptp
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ocrd-pagetopdf		ocrd-pagetopdf
ocrd-tool.json		ocrd-tool.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocrd-pagetopdf

Requirements

Installation

Usage

FAQ

About

Releases 2

Packages

Contributors 7

Languages

License

UB-Mannheim/ocrd_pagetopdf

Folders and files

Latest commit

History

Repository files navigation

ocrd-pagetopdf

Requirements

Installation

Usage

FAQ

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 7

Languages

Packages