Vincent Rasneur vrasneur@free.fr
- pdftk
- ghostscript
- imagemagick
- tesseract
- aspell (optional)
By default, the script uses the French dictionaries of tesseract and aspell.
Use the -t argument to change the tesseract dictionary.
Use the -a argument to change the aspell dictionary.
By default, the script does not spell-check the output text. To do this, you must add -s (or use the -a argument).
To OCR a PDF file
ocr.sh document.pdfTo OCR a PDF file and spell-check each page
ocr.sh -s document.pdfTo OCR an english PDF and spell-check it
ocr.sh -t eng -a en document.pdfFor a PDF file named doc1.pdf, the script:
- creates a directory named
doc1 - for each PDF page, a file named
pg_<number>.txtis created inside this directory
Or, if the -c argument is used, the script:
- creates a directory named
doc1 - creates a unique file named
doc1/doc1.txt