pdf-ocr-overlay

Simple way to make scanned PDFs searchable based on Tesseract.

$ ./ocr.py scanned.pdf output.pdf

Adds an overlay to scanned PDF so that your document archive can be indexed and text can be found easier in large documents.

Dependencies

python
tesseract-ocr (tesseract)
exactimage (hocr2pdf)
poppler-utils (pdfimages)
ghostscript (gs)

Alternatives

OCRFeeder lets you rebuild documents from scanned images and documents. It has layout analysis, frontend-editing, spell-checking and much more. If you want to modify the extracted texts then this is the way to go. Exported PDFs with an overlay tend to be very large and it takes some time to process a bunch of documents.

Google Docs does OCR for uploaded documents. I've not tested this extensively but it's probably good and will probably get better over time. If you want to upload your documents to Google anyway, go for that.

pdfocr by Geza Kovacs is quite similar to this project also I haven't used it yet. It's based on cuneiform and implemented in Ruby.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
ocr.py		ocr.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdf-ocr-overlay

Dependencies

Alternatives

About

Releases

Packages

Languages

Pankrat/pdf-ocr-overlay

Folders and files

Latest commit

History

Repository files navigation

pdf-ocr-overlay

Dependencies

Alternatives

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages