This repository contains code and resources for benchmarking various Optical Character Recognition (OCR) engines and APIs, including Adobe, Surya, and Tesseract.
The following OCR APIs are included in this repository:
- Adobe
- Surya
- Tesseract
The repository has the following file structure:
OCR-Benchmarking/
├── ADOBE/
│ ├── pdfservices-python-sdk-samples-main/
│ ├── img2pdf.py
| └── README.md
├── IMAGES/
├── OUTPUT/
| ├── ADOBE/
│ ├── SURYA/
| └── TESSERACT/
├── SURYA/
| ├── temp.py
│ └── README.md
├── TESSERACT/
| ├── temp.py
│ └── README.md
├── README.md
└── ...
ADOBE/
: Contains the setup and source code for the Adobe API.IMAGES/
: Contains the input images for testing the different OCR APIs.OUTPUT/
: The output generated by the OCR APIs will be stored in this directory.SURYA/
: Contains the source code for the Surya API.TESSERACT/
: Contains the source code for the Tesseract API.
- Python 3.6 or later
- Required Python packages (listed in
requirements.txt
)
- Clone the repository:
git clone https://github.com/NitinYadav1511/OCR-Benchmarking.git
- Install the required Python packages:
pip install -r requirements.txt
- Add your test images to the
IMAGES
directory. - Run the respective scripts for each OCR API you want to test:
- Adobe:
python ADOBE/pdfservices-python-sdk-samples-main/src/extractpdf/extract_txt_from_pdf.py
- Surya:
python SURYA/temp.py
- Tesseract:
python TESSERACT/temp.py
- Adobe:
The output generated by each OCR API will be stored in the OUTPUT
directory.
This repository is maintained by Nitin Yadav.
This repository contributes to a major project on Optical Character Recognition (OCR) for Indic languages under IIT Bombay, in collaboration with IIIT Hyderabad. The goal of the project is to create a robust OCR system for Indic scripts. As part of this effort, this repository benchmarks several existing OCR models, including Adobe, Surya, and Tesseract, by evaluating their performance across diverse datasets.
This benchmark suite utilizes the following OCR APIs and libraries:
- Adobe PDF Services API
- Surya OCR API
- Tesseract OCR Engine
Special thanks to the contributors and maintainers of these projects.