Tesseract docker images for use in mass production

This repository contains instructions for building and using a Tesseract system in mass production (operationalizing the instructions at: https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation.html#release-builds-for-mass-production). Essentially, this forces Tesseract to use single-threading on image-level, parallelizing over images. The system is used in the DHLAB at the National Library of Norway.

Build docker image

docker build -t tesseract_massproduction .

Test run

docker run --rm tesseract_massproduction tesseract --list-langs

Example pipeline for IIIF (returning ALTO XML) - modify your paths accordingly

1. Run tesseract on a specific URN

docker run -it -v $PWD/data:/data tesseract_massproduction sh -c "python3 process.py URN:NBN:no-nb_digibok_2018080126011 nor-frak alto"

process.py takes the following arguments:

URN = must be a valid full URN from nb.no (e.g. URN:NBN:no-nb_digibok_2018080126011)
model_name = must be a Tesseract model available in the Docker image (see models/) or a model file otherwise mountained into the container where Tesseract expects to find models (/usr/local/share/tessdata)
output_format = one of the following (alto, hocr, page, pdf, text), if unspecified, the script will output text to stdout

2. Do a quick evaluation using word frequency lists

docker run -it -v $PWD/data:/data tesseract_massproduction sh -c "find /data -type f | python3 validate.py | head"

3. Transform each object (pixel to mm, de-hyphenation etc.)

find data -mindepth 1 -type d -printf "%f\n" | parallel -u -j 5 "docker run -i -v $PWD/data:/data tesseract_massproduction python3 transform_alto.py /data/{}"

4. Make tarballs of each object

cd data
find * -type d -name "*_transformed" | parallel -j10 -u "cd {} && tar cf ../{=s/_transformed// =}_ocr_xml.tar *"

Example pipeline for local files, using GNU Parallel (returning HOCR)

find * -type f -name "*.tif" | parallel -j 5 "echo {} && docker run -v -v $PWD/data:/data --rm tesseract_massproduction tesseract /data/{} /data/{} -c tessedit_create_hocr=1 -c hocr_font_info=0 -l eng"

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
data		data
models		models
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
alto2txt.py		alto2txt.py
process.py		process.py
requirements.txt		requirements.txt
transform_alto.py		transform_alto.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tesseract docker images for use in mass production

Build docker image

Test run

Example pipeline for IIIF (returning ALTO XML) - modify your paths accordingly

1. Run tesseract on a specific URN

2. Do a quick evaluation using word frequency lists

3. Transform each object (pixel to mm, de-hyphenation etc.)

4. Make tarballs of each object

Example pipeline for local files, using GNU Parallel (returning HOCR)

About

Releases

Packages

Languages

Sprakbanken/tesseract_massproduction_docker

Folders and files

Latest commit

History

Repository files navigation

Tesseract docker images for use in mass production

Build docker image

Test run

Example pipeline for IIIF (returning ALTO XML) - modify your paths accordingly

1. Run tesseract on a specific URN

2. Do a quick evaluation using word frequency lists

3. Transform each object (pixel to mm, de-hyphenation etc.)

4. Make tarballs of each object

Example pipeline for local files, using GNU Parallel (returning HOCR)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages