🧠 OCR-as-a-Service (VLM-Powered Optical Character Recognition)

Welcome to the OCR project! This repository hosts a performant and extensible web service that performs Optical Character Recognition (OCR) using Visual Language Models (VLMs) via API calls. The initial implementation is in Python, with plans to introduce Rust for further performance gains.

🚀 Project Vision

Our goal is to create a high-performance OCR web service that:

Extracts text from images using modern OCR techniques
Leverages VLMs for enhanced interpretation and post-processing
Prioritizes speed, scalability, and robustness
Serves real-time and batch OCR use cases in business environments

🧾 What is OCR and Why It Matters

Optical Character Recognition (OCR) is the process of converting text from scanned documents, photos, PDFs, or image files into machine-readable text.

💼 Business Relevance

OCR is a critical enabler of digital transformation. It helps businesses:

Automate data entry from paper forms or invoices
Extract structured data from unstructured documents
Enable search, indexing, and archiving of scanned files
Improve accessibility and compliance

Industries like finance, logistics, healthcare, law, and government rely heavily on OCR to streamline operations and reduce manual processing time.

🧩 Problem This Project Solves

Despite existing solutions, many OCR tools:

Struggle with low-quality images
Lack semantic understanding of the extracted text
Are hard to integrate or deploy as scalable web services
Offer poor performance in real-time applications

This project addresses these limitations by:

Using VLMs to interpret ambiguous or noisy text
Designing a modular web API that's easy to extend
Focusing on low-latency and high throughput
Enabling multi-language and multi-format support

🛠️ Stack

Layer	Tooling
Language	Python (Rust planned)
Model API	OpenAI / Claude / Other LLM APIs
API Framework	Litserve (Python)
Performance Focus	Rust rewrite (planned) for speed-critical modules
Testing	Pytest + Benchmark tools

🚀 Installation

Prerequisites

Python 3.11+
Docker (optional)
API keys for your chosen LLM provider

Quickstart with uv (Recommended)

This project uses uv for dependency management. Install it first:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/FadelMamar/ocr.git
cd ocr
uv sync

Quickstart with pip

git clone https://github.com/FadelMamar/ocr.git
cd ocr
pip install -e .

Quickstart with Docker

# Clone the repository
git clone https://github.com/FadelMamar/ocr.git
cd ocr

# Setup environment
cp example.env .env
# Edit .env with your API keys

# Run with Docker Compose
docker compose up

⚙️ Configuration

Environment Variables

Create a .env file based on example.env:

# Required API Keys (choose your provider)
GOOGLE_API_KEY=your_google_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_API_BASE=http://localhost:8000/v1

# Model Configuration
MODEL=gemini/gemini-2.5-flash-preview-05-20
EXTRACTOR=smoldocling
TEMPERATURE=0.7

API Key Setup

Google Gemini

Visit Google AI Studio
Create an API key
Set GOOGLE_API_KEY=your_key_here

OpenAI/Claude

Visit OpenAI Platform
Create an API key
Set OPENAI_API_KEY=your_key_here
Set OPENAI_API_BASE if using a custom endpoint

Local Models (Ollama)

Install Ollama
Pull your preferred model: ollama pull llama3.2
Use model names like ollama_chat/llama3.2

📡 API Documentation

Endpoints

POST `/predict`

Extract text from images or PDFs using OCR.

Request Body:

{
  "data": "base64_encoded_image_or_pdf",
  "prompt": "Extract the text from this image",
  "extractor": "smoldocling",
  "filetype": "image"
}

Parameters:

data (required): Base64-encoded image or PDF bytes
prompt (optional): Custom extraction prompt
extractor (optional): OCR extractor type (default: smoldocling)
filetype (optional): "image" or "pdf" (default: "image")

Response:

{
  "output": "Extracted text content"
}

Usage Examples

Python Client

import base64
import requests

# Encode image
with open("document.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

# Make request
response = requests.post(
    "http://localhost:4242/predict",
    json={
        "data": image_data,
        "prompt": "Extract all text from this document",
        "extractor": "gemini",
        "filetype": "image"
    }
)

print(response.json()["output"])

cURL Example

# Encode image to base64
IMAGE_B64=$(base64 -w 0 document.jpg)

# Make request
curl -X POST http://localhost:4242/predict \
  -H "Content-Type: application/json" \
  -d '{
    "data": "'$IMAGE_B64'",
    "prompt": "Extract all text from this document",
    "extractor": "smoldocling",
    "filetype": "image"
  }'

JavaScript/Node.js

const fs = require('fs');

// Read and encode image
const imageBuffer = fs.readFileSync('document.jpg');
const imageBase64 = imageBuffer.toString('base64');

// Make request
fetch('http://localhost:4242/predict', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    data: imageBase64,
    prompt: 'Extract all text from this document',
    extractor: 'smoldocling',
    filetype: 'image'
  })
})
.then(response => response.json())
.then(data => console.log(data.output));

🔧 Available Extractors

The service supports multiple OCR extractors, each optimized for different use cases:

1. SmolDoclingExtractor (Default)

Type: VLM-based OCR
Best for: High-quality text extraction with semantic understanding
Requirements: None (works out of the box)
Performance: Fast, good accuracy

2. RapidOCRExtractor

Type: based on PaddleOCR
Best for: Fast processing of standard documents
Requirements: Downloads models on first use
Performance: Very fast, moderate accuracy

3. GeminiExtractor

Type: Google Gemini VLM
Best for: Complex documents requiring interpretation
Requirements: GOOGLE_API_KEY
Performance: High accuracy, moderate speed

4. DspyExtractor

Type: DSPy framework with multiple model support
Best for: Advanced prompting and reasoning
Requirements: Model configuration (Gemini, OpenAI, Ollama)
Performance: High accuracy, flexible prompting

🏗️ Project Architecture

Codebase Structure

src/
├── app.py              # FastAPI/Litserve application
├── orchestrator.py     # Main orchestration logic
├── extractor.py        # OCR extractor implementations
├── loader.py           # Data loading utilities
└── ui.py              # Streamlit web interface

examples/
├── run_ocr.py         # CLI examples and testing
└── webservice.py      # Web service examples

data/                  # Sample images for testing

Architecture Overview

The service follows a modular architecture:

API Layer (app.py): Handles HTTP requests and responses
Orchestrator (orchestrator.py): Coordinates between data loading and extraction
Extractors (extractor.py): Different OCR implementations
Data Loader (loader.py): Handles image/PDF loading and preprocessing
UI (ui.py): Streamlit web interface for easy testing

Data Flow

Image/PDF → DataLoader → Orchestrator → Extractor → Response

🧪 Development

Running the Service

Development Mode

# Start the API service
python src/app.py

# Start the Streamlit UI (in another terminal)
streamlit run src/ui.py

Using the Example Scripts

# Test all extractors
python examples/run_ocr.py test_all

# Test specific extractor
python examples/run_ocr.py test_smoldocling

# Test with custom image
python examples/run_ocr.py test_custom_image path/to/image.jpg

Code Quality

The project uses ruff for linting and formatting:

# Check code quality
uvx ruff check src/

# Auto-fix issues
uvx ruff check --fix src/

# Format code
uvx ruff format src/

Testing

# Run tests (when implemented)
pytest tests/

# Run benchmarks
python examples/run_ocr.py test_all

Development Workflow

Setup: Clone repo and install dependencies with uv sync
Configure: Copy example.env to .env and add API keys
Develop: Use the modular architecture to add new extractors
Test: Use the example scripts to test functionality
Format: Run uvx ruff check --fix src/ before committing

Adding New Extractors

Create a new class in src/extractor.py inheriting from Extractor
Implement the run(image: bytes, prompt: str) -> str method
Add the extractor to EXTRACTOR_MAP in orchestrator.py
Update the factory function in build_orchestrator()
Add tests in examples/run_ocr.py

🧭 Roadmap

✅ Phase 1: Python MVP (In Progress)

Image upload endpoint (Litserve)
LLM integration to enhance or correct OCR output
Dockerized deployment
Multiple extractor support
Streamlit UI

🔜 Phase 2: Performance & Scalability

Introduce Rust modules for performance hotspots (image decoding, pre/post-processing)
Batch processing mode
Async and queue-based inference
CI/CD and monitoring integration

🔮 Phase 3: Business-Ready Features

Multi-language OCR support
Document structure detection (tables, forms)
Advanced error handling and retry logic
Performance monitoring and metrics

🤝 Contributing

We welcome contributions! Here's how to get started:

Development Setup

Fork the repository
Clone your fork: git clone https://github.com/your-username/ocr.git
Install dependencies: uv sync
Create a feature branch: git checkout -b feature/your-feature
Make your changes and test with the example scripts
Format code: uvx ruff check --fix src/
Submit a pull request

Code Style

Follow PEP 8 with 88 character line length
Use type hints for all function parameters and return values
Add docstrings for all public functions and classes
Run uvx ruff check src/ before committing

Testing

Add tests for new extractors in examples/run_ocr.py
Test with various image formats and quality levels
Ensure error handling works correctly

🆘 Support

Issues: Report bugs and feature requests on GitHub
Discussions: Join community discussions for questions and ideas
Documentation: Check the examples folder for usage patterns

🧠 Inspiration

This project draws on:

The power of LLMs to understand context and correct OCR noise
The need for enterprise-grade OCR tools that are fast, reliable, and easy to deploy

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
data		data
examples		examples
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
example.env		example.env
launch_app.bat		launch_app.bat
load_env.bat		load_env.bat
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

FadelMamar/ocr

Folders and files

Latest commit

History

Repository files navigation

🧠 OCR-as-a-Service (VLM-Powered Optical Character Recognition)

🚀 Project Vision

🧾 What is OCR and Why It Matters

💼 Business Relevance

🧩 Problem This Project Solves

🛠️ Stack

📋 Table of Contents

🚀 Installation

Prerequisites

Quickstart with uv (Recommended)

Quickstart with pip

Quickstart with Docker

⚙️ Configuration

Environment Variables

API Key Setup

Google Gemini

OpenAI/Claude

Local Models (Ollama)

📡 API Documentation

Endpoints

POST /predict

Usage Examples

Python Client

cURL Example

JavaScript/Node.js

🔧 Available Extractors

1. SmolDoclingExtractor (Default)

2. RapidOCRExtractor

3. GeminiExtractor

4. DspyExtractor

🏗️ Project Architecture

Codebase Structure

Architecture Overview

Data Flow

🧪 Development

Running the Service

Development Mode

Using the Example Scripts

Code Quality

Testing

Development Workflow

Adding New Extractors

🧭 Roadmap

✅ Phase 1: Python MVP (In Progress)

🔜 Phase 2: Performance & Scalability

🔮 Phase 3: Business-Ready Features

🤝 Contributing

Development Setup

Code Style

Testing

🆘 Support

🧠 Inspiration

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

POST `/predict`

Packages