- Tesseract 3
- ImageMagick
- Drop the scanned images for a particular state in the
training/input_images
folder - run
./preprocess.sh
to generate the training images - open
jTessBoxEditorFX
java program and edit the box files to be accurate for images in thetraining_images/processed
folder (make sure to save as and overwrite after you edit the box values) cd training
and edit thetrain.sh
file's last line to indicate where you want to move the finalized trained language file to. This should be thetessdata
folder of wherever tesseract is installed 4.1 the script is precoded to receipt_nh as the output language name. You can change that in thetrain.sh
file- run
./train.sh
and you're done! the new language file will be copied to tessdata and available to the tesseract command using the-l
flag
Happy OCRing