OCR-Benchmarking

This repository contains code and resources for benchmarking various Optical Character Recognition (OCR) engines and APIs, including Adobe, Surya, and Tesseract.

OCR APIs

The following OCR APIs are included in this repository:

Adobe
Surya
Tesseract

File System

The repository has the following file structure:

OCR-Benchmarking/
├── ADOBE/
│   ├── pdfservices-python-sdk-samples-main/
│   ├── img2pdf.py
|   └── README.md
├── IMAGES/
├── OUTPUT/
|   ├── ADOBE/
│   ├── SURYA/
|   └── TESSERACT/
├── SURYA/
|   ├── temp.py
│   └── README.md
├── TESSERACT/
|   ├── temp.py
│   └── README.md
├── README.md
└── ...

ADOBE/: Contains the setup and source code for the Adobe API.
IMAGES/: Contains the input images for testing the different OCR APIs.
OUTPUT/: The output generated by the OCR APIs will be stored in this directory.
SURYA/: Contains the source code for the Surya API.
TESSERACT/: Contains the source code for the Tesseract API.

Getting Started

Prerequisites

Python 3.6 or later
Required Python packages (listed in requirements.txt)

Installation

Clone the repository:

git clone https://github.com/NitinYadav1511/OCR-Benchmarking.git

Install the required Python packages:

pip install -r requirements.txt

Usage

Add your test images to the IMAGES directory.
Run the respective scripts for each OCR API you want to test:
- Adobe: python ADOBE/pdfservices-python-sdk-samples-main/src/extractpdf/extract_txt_from_pdf.py
- Surya: python SURYA/temp.py
- Tesseract: python TESSERACT/temp.py

The output generated by each OCR API will be stored in the OUTPUT directory.

Contribution

This repository is maintained by Nitin Yadav.

This repository contributes to a major project on Optical Character Recognition (OCR) for Indic languages under IIT Bombay, in collaboration with IIIT Hyderabad. The goal of the project is to create a robust OCR system for Indic scripts. As part of this effort, this repository benchmarks several existing OCR models, including Adobe, Surya, and Tesseract, by evaluating their performance across diverse datasets.

Acknowledgments

This benchmark suite utilizes the following OCR APIs and libraries:

Adobe PDF Services API
Surya OCR API
Tesseract OCR Engine

Special thanks to the contributors and maintainers of these projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR-Benchmarking

OCR APIs

File System

Getting Started

Prerequisites

Installation

Usage

Contribution

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
ADOBE		ADOBE
IMAGES		IMAGES
OUTPUT		OUTPUT
SURYA		SURYA
TESSERACT		TESSERACT
README.md		README.md
package-lock.json		package-lock.json

NitinYadav1511/OCR-Benchmarking

Folders and files

Latest commit

History

Repository files navigation

OCR-Benchmarking

OCR APIs

File System

Getting Started

Prerequisites

Installation

Usage

Contribution

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages