This repository contains instructions for building and using a Tesseract system in mass production (operationalizing the instructions at: https://tesseract-ocr.github.io/tessdoc/Compiling-%E2%80%93-GitInstallation.html#release-builds-for-mass-production). Essentially, this forces Tesseract to use single-threading on image-level, parallelizing over images. The system is used in the DHLAB at the National Library of Norway.
docker build -t tesseract_massproduction .
docker run --rm tesseract_massproduction tesseract --list-langs
docker run -it -v $PWD/data:/data tesseract_massproduction sh -c "python3 process.py URN:NBN:no-nb_digibok_2018080126011 nor-frak alto"
process.py takes the following arguments:
- URN = must be a valid full URN from nb.no (e.g. URN:NBN:no-nb_digibok_2018080126011)
- model_name = must be a Tesseract model available in the Docker image (see models/) or a model file otherwise mountained into the container where Tesseract expects to find models (/usr/local/share/tessdata)
- output_format = one of the following (alto, hocr, page, pdf, text), if unspecified, the script will output text to stdout
docker run -it -v $PWD/data:/data tesseract_massproduction sh -c "find /data -type f | python3 validate.py | head"
find data -mindepth 1 -type d -printf "%f\n" | parallel -u -j 5 "docker run -i -v $PWD/data:/data tesseract_massproduction python3 transform_alto.py /data/{}"
cd data
find * -type d -name "*_transformed" | parallel -j10 -u "cd {} && tar cf ../{=s/_transformed// =}_ocr_xml.tar *"
find * -type f -name "*.tif" | parallel -j 5 "echo {} && docker run -v -v $PWD/data:/data --rm tesseract_massproduction tesseract /data/{} /data/{} -c tessedit_create_hocr=1 -c hocr_font_info=0 -l eng"