Welcome to the OCR project! This repository hosts a performant and extensible web service that performs Optical Character Recognition (OCR) using Visual Language Models (VLMs) via API calls. The initial implementation is in Python, with plans to introduce Rust for further performance gains.
Our goal is to create a high-performance OCR web service that:
- Extracts text from images using modern OCR techniques
- Leverages VLMs for enhanced interpretation and post-processing
- Prioritizes speed, scalability, and robustness
- Serves real-time and batch OCR use cases in business environments
Optical Character Recognition (OCR) is the process of converting text from scanned documents, photos, PDFs, or image files into machine-readable text.
OCR is a critical enabler of digital transformation. It helps businesses:
- Automate data entry from paper forms or invoices
- Extract structured data from unstructured documents
- Enable search, indexing, and archiving of scanned files
- Improve accessibility and compliance
Industries like finance, logistics, healthcare, law, and government rely heavily on OCR to streamline operations and reduce manual processing time.
Despite existing solutions, many OCR tools:
- Struggle with low-quality images
- Lack semantic understanding of the extracted text
- Are hard to integrate or deploy as scalable web services
- Offer poor performance in real-time applications
This project addresses these limitations by:
- Using VLMs to interpret ambiguous or noisy text
- Designing a modular web API that's easy to extend
- Focusing on low-latency and high throughput
- Enabling multi-language and multi-format support
| Layer | Tooling |
|---|---|
| Language | Python (Rust planned) |
| Model API | OpenAI / Claude / Other LLM APIs |
| API Framework | Litserve (Python) |
| Performance Focus | Rust rewrite (planned) for speed-critical modules |
| Testing | Pytest + Benchmark tools |
- Installation
- Configuration
- API Documentation
- Available Extractors
- Project Architecture
- Development
- Roadmap
- Contributing
- Python 3.11+
- Docker (optional)
- API keys for your chosen LLM provider
This project uses uv for dependency management. Install it first:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and setup
git clone https://github.com/FadelMamar/ocr.git
cd ocr
uv syncgit clone https://github.com/FadelMamar/ocr.git
cd ocr
pip install -e .# Clone the repository
git clone https://github.com/FadelMamar/ocr.git
cd ocr
# Setup environment
cp example.env .env
# Edit .env with your API keys
# Run with Docker Compose
docker compose upCreate a .env file based on example.env:
# Required API Keys (choose your provider)
GOOGLE_API_KEY=your_google_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_API_BASE=http://localhost:8000/v1
# Model Configuration
MODEL=gemini/gemini-2.5-flash-preview-05-20
EXTRACTOR=smoldocling
TEMPERATURE=0.7- Visit Google AI Studio
- Create an API key
- Set
GOOGLE_API_KEY=your_key_here
- Visit OpenAI Platform
- Create an API key
- Set
OPENAI_API_KEY=your_key_here - Set
OPENAI_API_BASEif using a custom endpoint
- Install Ollama
- Pull your preferred model:
ollama pull llama3.2 - Use model names like
ollama_chat/llama3.2
Extract text from images or PDFs using OCR.
Request Body:
{
"data": "base64_encoded_image_or_pdf",
"prompt": "Extract the text from this image",
"extractor": "smoldocling",
"filetype": "image"
}Parameters:
data(required): Base64-encoded image or PDF bytesprompt(optional): Custom extraction promptextractor(optional): OCR extractor type (default:smoldocling)filetype(optional):"image"or"pdf"(default:"image")
Response:
{
"output": "Extracted text content"
}import base64
import requests
# Encode image
with open("document.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
# Make request
response = requests.post(
"http://localhost:4242/predict",
json={
"data": image_data,
"prompt": "Extract all text from this document",
"extractor": "gemini",
"filetype": "image"
}
)
print(response.json()["output"])# Encode image to base64
IMAGE_B64=$(base64 -w 0 document.jpg)
# Make request
curl -X POST http://localhost:4242/predict \
-H "Content-Type: application/json" \
-d '{
"data": "'$IMAGE_B64'",
"prompt": "Extract all text from this document",
"extractor": "smoldocling",
"filetype": "image"
}'const fs = require('fs');
// Read and encode image
const imageBuffer = fs.readFileSync('document.jpg');
const imageBase64 = imageBuffer.toString('base64');
// Make request
fetch('http://localhost:4242/predict', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
data: imageBase64,
prompt: 'Extract all text from this document',
extractor: 'smoldocling',
filetype: 'image'
})
})
.then(response => response.json())
.then(data => console.log(data.output));The service supports multiple OCR extractors, each optimized for different use cases:
- Type: VLM-based OCR
- Best for: High-quality text extraction with semantic understanding
- Requirements: None (works out of the box)
- Performance: Fast, good accuracy
- Type: based on PaddleOCR
- Best for: Fast processing of standard documents
- Requirements: Downloads models on first use
- Performance: Very fast, moderate accuracy
- Type: Google Gemini VLM
- Best for: Complex documents requiring interpretation
- Requirements:
GOOGLE_API_KEY - Performance: High accuracy, moderate speed
- Type: DSPy framework with multiple model support
- Best for: Advanced prompting and reasoning
- Requirements: Model configuration (Gemini, OpenAI, Ollama)
- Performance: High accuracy, flexible prompting
src/
โโโ app.py # FastAPI/Litserve application
โโโ orchestrator.py # Main orchestration logic
โโโ extractor.py # OCR extractor implementations
โโโ loader.py # Data loading utilities
โโโ ui.py # Streamlit web interface
examples/
โโโ run_ocr.py # CLI examples and testing
โโโ webservice.py # Web service examples
data/ # Sample images for testing
The service follows a modular architecture:
- API Layer (
app.py): Handles HTTP requests and responses - Orchestrator (
orchestrator.py): Coordinates between data loading and extraction - Extractors (
extractor.py): Different OCR implementations - Data Loader (
loader.py): Handles image/PDF loading and preprocessing - UI (
ui.py): Streamlit web interface for easy testing
Image/PDF โ DataLoader โ Orchestrator โ Extractor โ Response
# Start the API service
python src/app.py
# Start the Streamlit UI (in another terminal)
streamlit run src/ui.py# Test all extractors
python examples/run_ocr.py test_all
# Test specific extractor
python examples/run_ocr.py test_smoldocling
# Test with custom image
python examples/run_ocr.py test_custom_image path/to/image.jpgThe project uses ruff for linting and formatting:
# Check code quality
uvx ruff check src/
# Auto-fix issues
uvx ruff check --fix src/
# Format code
uvx ruff format src/# Run tests (when implemented)
pytest tests/
# Run benchmarks
python examples/run_ocr.py test_all- Setup: Clone repo and install dependencies with
uv sync - Configure: Copy
example.envto.envand add API keys - Develop: Use the modular architecture to add new extractors
- Test: Use the example scripts to test functionality
- Format: Run
uvx ruff check --fix src/before committing
- Create a new class in
src/extractor.pyinheriting fromExtractor - Implement the
run(image: bytes, prompt: str) -> strmethod - Add the extractor to
EXTRACTOR_MAPinorchestrator.py - Update the factory function in
build_orchestrator() - Add tests in
examples/run_ocr.py
- Image upload endpoint (Litserve)
- LLM integration to enhance or correct OCR output
- Dockerized deployment
- Multiple extractor support
- Streamlit UI
- Introduce Rust modules for performance hotspots (image decoding, pre/post-processing)
- Batch processing mode
- Async and queue-based inference
- CI/CD and monitoring integration
- Multi-language OCR support
- Document structure detection (tables, forms)
- Advanced error handling and retry logic
- Performance monitoring and metrics
We welcome contributions! Here's how to get started:
- Fork the repository
- Clone your fork:
git clone https://github.com/your-username/ocr.git - Install dependencies:
uv sync - Create a feature branch:
git checkout -b feature/your-feature - Make your changes and test with the example scripts
- Format code:
uvx ruff check --fix src/ - Submit a pull request
- Follow PEP 8 with 88 character line length
- Use type hints for all function parameters and return values
- Add docstrings for all public functions and classes
- Run
uvx ruff check src/before committing
- Add tests for new extractors in
examples/run_ocr.py - Test with various image formats and quality levels
- Ensure error handling works correctly
- Issues: Report bugs and feature requests on GitHub
- Discussions: Join community discussions for questions and ideas
- Documentation: Check the examples folder for usage patterns
This project draws on:
- The power of LLMs to understand context and correct OCR noise
- The need for enterprise-grade OCR tools that are fast, reliable, and easy to deploy