Awesome OCR

This list contains links to great software tools and libraries and literature related to Optical Character Recognition (OCR).

Contributions are welcome, as is feedback.

Software
- OCR engines
- OCR file formats
  - hOCR
  - ALTO XML
- OCR GUI
- OCR Preprocessing
- OCR as a Service
- OCR evaluation
- OCR libraries by programming language
  - Go
  - Java
  - .Net
  - NodeJS
  - PHP
  - Python
  - Ruby
Literature

Software

OCR engines

tesseract - The definitive Open Source OCR engine Apache 2.0
ocropus - OCR engine based on CLSTM, Apache 2.0
Ocrad - The GNU OCR. GPL
ocrad.js - Javascript port (emscripten) of ocrad
ocracy - pure javascript lstm rnn implementation based on ocropus
kraken - Ocropus fork with sane defaults
digit - OCR for numbers in meter displays, such as a power meter, using caffe
ocular - Machine-learning OCR for historic documents
SwiftOCR - fast and simple OCR library written in Swift

OCR file formats

hOCR

hocr-tools - Tools for doing various useful things with hOCR files, Apache 2.0
hocr-spec - hOCR 1.1 specification
ocr-transform - CLI tool to convert between hOCR and ALTO, MIT
hocr-parser - hOCR Specification Python Parser
hOCRTools - hOCR to ALTO conversion XSLT

ALTO XML

ALTO XML Schema - XML Schema and development of the ALTO XML format
ALTO XML Documentation - Documentation and use cases for ALTO
alto-tools - Various tools to work with ALTO files, Python
AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML

OCR GUI

moz-hocr-editor - Firefox Addon for editing hOCR files Discontinued
qt-box-editor - QT4 editor of tesseract-ocr box files.
ocr-gt-tools - Client-Server application for editing OCR ground truth.
Paperwork - Using scanners and OCR to grep paper documents the easy way (Linux only).
gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including jTessBoxEditor a graphical Tesseract box data editor
PoToCo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.

OCR Preprocessing

NoiseRemove.java in MathOCR - Java implementation of
binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
binarizewolfjolion - Comparison of binarization algorithms. Blog post
crop_morphology.py in oldnyc - Cropping a page to just the text block

OCR as a Service

Open OCR - Run Tesseract in Docker containers
tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
docker-ocropy - A Docker container for running the ocropy OCR system.
ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
nidaba - An expandable and scalable OCR pipeline
gamera - A meta-framework for building document processing applications, e.g. OCR
ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
ocrad-docker - Run the ocrad OCR engine in a docker container
kraken-docker - Run the kraken OCR engine in a docker container

OCR evaluation

ISRI OCR Evaluation Tools with a User Guide from 1996 :!:
- isri-ocr-evaluation-tools - further development by @eddieantonio (2015, 2016)
- ancientgreekocr-evaluation-tools - further development by @nickjwhite (2013, 2014)
ocrevalUAtion - Cross-format evaluation, CLI and GUI
ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
quack - Quality-Assurance-tool for scans with corresponding ALTO-files

OCR libraries by programming language

Go

gosseract - Golang OCR library, wrapping Tesseract-ocr.

Java

Tess4J - Java Native Access bindings to Tesseract.
tess-two - Tools for compiling Tesseract on Android and Java API.

.Net

tesseract for .net - A .Net wrapper for tesseract-ocr.

NodeJS

node-tesseract - A simple wrapper for the Tesseract OCR package.
node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.

PHP

Tesseract OCR for PHP - Tesseract PHP bindings.

Python

pytesseract - A Python wrapper for Google Tesseract.
pyocr - A Python wrapper for Tesseract and Cuneiform.
ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract

Ruby

rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby

Literature

OCR-related publication and link lists

IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
OCR-D - List of OCR-related academic articles in the context of the OCR-D project. 🇩🇪
Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
Wikipedia: Comparison of optical character recognition software
OCR [and Deep Learning] by @handong1587
Ocropus Wiki: Publications

Blog Posts and Tutorials

Tesseract Blends Old and New OCR Technology (2016) @theraysmith
- Tutorial@DAS2016, Updated "What You Always Wanted to Know" slides
What You Always Wanted To Know About Tesseract (2014) @theraysmith
- Tutorial@DAS2014, includes demos
Extracting text from an image using Ocropus (2015)
Training an Ocropus OCR model (2015) @danvk
Ocropus Wiki: Compute errors and confusions (2016) @zuphilip
Ocropus Wiki: Working with Ground Truth (2016) @zuphilip
OCRopus (2016) @jze
- mostly on column separation in ocropus
10 Tips for making your OCR project succeed (2013) @cneud
- general things to consider for OCR projects
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology -
- feature list for a commercial image pre-processing library; has nice before-after samples for pre-processing steps related to OCR
Extracting Text from PDFs; Doing OCR; all within R @shawngraham
- How to work with OCR from PDFs in the R programming environment
Tutorial: Command-line OCR on a Mac @bmschmidt
- Tutorial on how to run tesseract in Mac OSX
Practical Expercience with OCRopus Model Training (2016) @jze

OCR Showcases

abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
MathOCR - A printed scientific document recognition system, pre-alpha

Academic articles

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome OCR

Software

OCR engines

OCR file formats

hOCR

ALTO XML

OCR GUI

OCR Preprocessing

OCR as a Service

OCR evaluation

OCR libraries by programming language

Go

Java

.Net

NodeJS

PHP

Python

Ruby

Literature

OCR-related publication and link lists

Blog Posts and Tutorials

OCR Showcases

Academic articles

2011 and before

2012

2013

2014

2015

2016

About

Releases

Packages

License

kyocen/awesome-ocr-1

Folders and files

Latest commit

History

Repository files navigation

Awesome OCR

Software

OCR engines

OCR file formats

hOCR

ALTO XML

OCR GUI

OCR Preprocessing

OCR as a Service

OCR evaluation

OCR libraries by programming language

Go

Java

.Net

NodeJS

PHP

Python

Ruby

Literature

OCR-related publication and link lists

Blog Posts and Tutorials

OCR Showcases

Academic articles

2011 and before

2012

2013

2014

2015

2016

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages