Textract: An end-to-end OCR python package for scanned documents.

Optical character recognition (OCR) is the technique that allows a computer to read static images of text, and convert them into editable, searchable data.

Current OCR model, such as Google Cloud Vision, can perform well on text recognition, however, it cannot provide a correct reading order. Textract utilizes image processing techniques for layout analysis and determines the reading order based on topological ordering. The result shows that we can improve 20% Levenshtein similarity on Google OCR model by applying our layout analys

Furthermore, this package also provides other deep learning model, CRNN, for the text recognition. The original CRNN paper can be referred by "An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition" and also thanks to the github repo for building the Tensorflow version CRNN model.

Pipeline

Install

Dependencies

You can install all python dependencies by both Anaconda or pip.

> conda env create -f conda_env.yml

(if you want to use tensorflow-gpu, please replace conda_env.yml to conda_env_gpu.yml) This will create an Anaconda environment textract.

or

> pip install -r requirements.txt

(if you want to use tensorflow-gpu, please replace requirements.txt to requiremetns_gpu.txt)

download pretrained model

Please download the pretrain weights and the model. Put both files in the ./textract/model.

Google OCR setting

Please follow the instructions in Google Vision API How-to Guild to setup your Google API Services. Remember to download the service account key (.json file) and add it as an environment variable in your computer.

> export GOOGLE_APPLICATION_CREDENTIALS = ~/path/to/your/service_account_key.json

For more detail, you can watch this youtube video, Setting up API and Vision Intro - Google Cloud Python Tutorials p.2

Quick Start

You can run a simple test easily by execute the below command in the terminal. The output text file will be generated into your output folder.

> python app.py --img_dir ./path/to/image/folder --out_dir ./path/to/output/folder

For example,

> python app.py --img_dir ./test/images --out_dir ./test/output

Then, you will get your ocr text files in your output folder.

Evaluation

If you want to do the similarity test for a batch images, you can utilize the evaluate.py. The image folder should be organized as below structure (but the folder name can be arbitrary).

your images folder

your groundtruth file folder

Then, run the below command in your terminal.

> python evaluate.py --img_dir path/to/image/folder --gd_dir path/to/groundtruth/folder --out_dir path/to/output/folder

For example

> python evaluate.py --img_dir ./evaluate/images --gd_dir ./evaluate/groundtruth --out_dir ./evaluate/output

the generated result folder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Textract: An end-to-end OCR python package for scanned documents.

Pipeline

Install

Dependencies

download pretrained model

Google OCR setting

Quick Start

Evaluation

Reference

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
evaluate		evaluate
images		images
test		test
textract		textract
.gitignore		.gitignore
README.md		README.md
app.py		app.py
conda_env.yml		conda_env.yml
conda_env_gpu.yml		conda_env_gpu.yml
config.py		config.py
evaluate.py		evaluate.py
requirements.txt		requirements.txt
requirements_gpu.txt		requirements_gpu.txt

hamsterhooey/textract

Folders and files

Latest commit

History

Repository files navigation

Textract: An end-to-end OCR python package for scanned documents.

Pipeline

Install

Dependencies

download pretrained model

Google OCR setting

Quick Start

Evaluation

Reference

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages