Name	Name	Last commit message	Last commit date
Latest commit History 16 Commits
.claude	.claude
.github	.github
MinerU @ fa1149c	MinerU @ fa1149c
deploy	deploy
docs	docs
examples	examples
models	models
scripts	scripts
src/two_tier_parser	src/two_tier_parser
tests	tests
.dockerignore	.dockerignore
.gitignore	.gitignore
.gitmodules	.gitmodules
.pre-commit-config.yaml	.pre-commit-config.yaml
CHANGELOG.md	CHANGELOG.md
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
IMPROVEMENTS.md	IMPROVEMENTS.md
LICENSE	LICENSE
Makefile	Makefile
NOTICE	NOTICE
PRODUCTION_READY_SUMMARY.md	PRODUCTION_READY_SUMMARY.md
README.md	README.md
SECURITY.md	SECURITY.md
pyproject.toml	pyproject.toml
requirements-dev.txt	requirements-dev.txt
requirements.txt	requirements.txt
ruff.toml	ruff.toml

Two-Tier Document Parser 📄🚀

A production-ready, two-tier document parsing API designed for RAG pipelines and data extraction workflows. It intelligently routes parsing requests to either a high-speed CPU-based parser or a high-accuracy VLM parser (with automatic CPU pipeline fallback when no GPU is available).

Features
Quick Start
- Docker (Recommended)
- Python Package
Architecture
API Documentation
Examples
Performance Benchmarks
FAQ
Troubleshooting
Documentation
Future Improvements
License
Acknowledgements

✨ Features

Feature	🏎️ Fast Parser (Tier 1)	🎯 Accurate Parser (Tier 2)
Engine	`PyMuPDF4LLM`	`MinerU` (VLM or Pipeline)
Hardware	CPU Only	GPU (VLM) or CPU (Pipeline)
Auto Fallback	N/A	✅ Pipeline mode if no GPU
Accuracy	70-75%	95% (VLM) / 80-85% (Pipeline)
Speed	< 1s per page	15s-60s (VLM) / 5-15s (Pipeline)
Output	Markdown Text	Markdown + Images + Tables + Formulas
Best For	High volume, simple text	Complex layouts, scanned docs, scientific papers
Concurrency	High (CPU-bound)	Medium (GPU-bound) / High (CPU fallback)

🚀 Quick Start

Docker (Recommended)

The easiest way to get started is using Docker Compose. Both parsers work with or without GPU - the accurate parser automatically detects hardware and adjusts accordingly.

# Clone the repository
git clone https://github.com/daddal001/two_tier_document_parser.git
cd two-tier-document-parser

# Start services (works with or without GPU)
docker-compose -f deploy/docker-compose.yml up --build -d

# Check service health
curl http://localhost:8004/health  # Fast parser
curl http://localhost:8005/health  # Accurate parser

Note for CPU-only machines: If Docker fails to start with a "nvidia runtime" error, see the GPU Configuration Guide below for CPU-only setup instructions.

Services will be available at:

Fast Parser: http://localhost:8004 (Swagger UI: /docs)
Accurate Parser: http://localhost:8005 (Swagger UI: /docs)
- With GPU: VLM backend (95%+ accuracy)
- Without GPU: Pipeline backend (80-85% accuracy)

Python Package

Install as a Python package for development:

# Install the package
pip install -e .[fast]  # For fast parser only
pip install -e .[accurate]  # For accurate parser (works with or without GPU)

# Or install with all dependencies
pip install -e .[dev]

🏗️ Architecture

graph TD
    Client[Client / RAG Pipeline]
    
    subgraph "Docker Host"
        Fast[Fast Parser Service<br/>Port 8004<br/>CPU-Only]
        Accurate[Accurate Parser Service<br/>Port 8005<br/>Auto-Detects Hardware]
    end
    
    Client -->|POST /parse<br/>Quick Text| Fast
    Client -->|POST /parse<br/>Deep Extraction| Accurate
    
    Fast -->|PyMuPDF4LLM| CPU1[CPU]
    
    Accurate -->|Auto-Detect| Decision{GPU<br/>Available?}
    Decision -->|Yes| Transformers[Transformers Backend<br/>MinerU VLM Model<br/>95%+ Accuracy]
    Decision -->|No| Pipeline[Pipeline Backend<br/>Traditional CV Models<br/>80-85% Accuracy]
    
    Transformers -->|Uses| GPU[NVIDIA GPU<br/>Tesla T4 / Ampere+]
    Pipeline -->|Uses| CPU2[CPU<br/>Traditional CV Models]
    
    style Fast fill:#2563eb,stroke:#1e40af,stroke-width:3px,color:#fff
    style Accurate fill:#ea580c,stroke:#c2410c,stroke-width:3px,color:#fff
    style Decision fill:#9333ea,stroke:#7e22ce,stroke-width:3px,color:#fff
    style Transformers fill:#16a34a,stroke:#15803d,stroke-width:3px,color:#fff
    style Pipeline fill:#ca8a04,stroke:#a16207,stroke-width:3px,color:#fff
    style CPU1 fill:#16a34a,stroke:#15803d,stroke-width:3px,color:#fff
    style CPU2 fill:#16a34a,stroke:#15803d,stroke-width:3px,color:#fff
    style GPU fill:#dc2626,stroke:#b91c1c,stroke-width:3px,color:#fff

Key Design Decisions:

Two-tier approach: Fast parser for simple documents, accurate parser for complex ones
Automatic GPU fallback: Accurate parser detects hardware and selects optimal backend (Transformers or Pipeline)
Independent services: Each parser runs in its own container for isolation
RESTful API: Simple HTTP interface for easy integration
Docker-first: Containerized deployment for consistency
Hardware agnostic: Works on any hardware - CPU-only to high-end GPUs

📚 API Documentation

Fast Parser API

Endpoint: POST http://localhost:8004/parse

Request:

curl -X POST "http://localhost:8004/parse" \
  -F "file=@document.pdf"

Response:

{
  "markdown": "# Document Title\n\nContent here...",
  "metadata": {
    "pages": 10,
    "processing_time_ms": 5234,
    "parser": "pymupdf4llm",
    "version": "1.0.0"
  }
}

Accurate Parser API

Endpoint: POST http://localhost:8005/parse

Request:

curl -X POST "http://localhost:8005/parse" \
  -F "file=@document.pdf" \
  --max-time 600

Response:

{
  "markdown": "# Document Title\n\nContent with images, tables, and formulas...",
  "metadata": {
    "pages": 10,
    "processing_time_ms": 450000,
    "parser": "mineru",
    "backend": "transformers",
    "gpu_used": true,
    "accuracy_tier": "very-high",
    "version": "2.6.4",
    "filename": "document.pdf"
  },
  "images": [
    {
      "image_id": "page_1_img_0",
      "image_base64": "iVBORw0KGgo...",
      "page": 1,
      "bbox": [100, 200, 300, 400]
    }
  ],
  "tables": [
    {
      "table_id": "page_2_table_0",
      "markdown": "| Col1 | Col2 |\n|------|------|\n| Val1 | Val2 |",
      "page": 2,
      "bbox": [50, 100, 500, 300]
    }
  ],
  "formulas": [
    {
      "formula_id": "page_3_formula_0",
      "latex": "E = mc^2",
      "page": 3,
      "bbox": [200, 150, 300, 200]
    }
  ]
}

For complete API documentation, see docs/API.md or visit the interactive Swagger UI at /docs on each service.

💡 Examples

Python Client

import requests

# Fast parsing (CPU)
def parse_fast(pdf_path: str) -> dict:
    with open(pdf_path, 'rb') as f:
        files = {'file': f}
        response = requests.post('http://localhost:8004/parse', files=files)
        return response.json()

# Accurate parsing (GPU)
def parse_accurate(pdf_path: str) -> dict:
    with open(pdf_path, 'rb') as f:
        files = {'file': f}
        # Note: Increase timeout for VLM processing
        response = requests.post(
            'http://localhost:8005/parse', 
            files=files, 
            timeout=6000
        )
        return response.json()

# Usage
result = parse_fast('document.pdf')
print(result['markdown'])

result = parse_accurate('complex_document.pdf')
print(f"Extracted {len(result['images'])} images")
print(f"Extracted {len(result['tables'])} tables")

CLI Demo Client

We provide a professional CLI client with Rich UI:

# Install client dependencies
pip install rich requests

# Run the demo
python examples/demo_client.py examples/data/sample.pdf --mode fast
python examples/demo_client.py examples/data/sample.pdf --mode accurate --timeout 6000

Jupyter Notebook

See examples/notebooks/parser_visualization.ipynb for an interactive example with visualizations.

⚡ Performance Benchmarks

Metric	Fast Parser	Accurate Parser (GPU)	Accurate Parser (CPU)
Hardware	CPU	Tesla T4 / Ampere+	Any CPU (multi-core recommended)
Backend	PyMuPDF4LLM	Transformers (MinerU VLM)	Pipeline (OCR + CV Models)
Latency per Page	< 1s	20-30s (T4) / 15-20s (Ampere+)	30-60s
Total Time (11 pages)	~8s	8-10 min (T4) / 5-8 min (Ampere+)	8-12 min
Accuracy	70-75%	95%+	80-85%
Content Types	Text only	Text + Images + Tables + Formulas	Text + Images + Tables + Formulas
GPU Required	❌ No	✅ Yes (optional)	❌ No
Auto-Fallback	N/A	✅ Falls back to CPU if no GPU	N/A
Concurrent Requests	High (CPU-bound)	Low-Medium (GPU-bound)	Medium (GPU-bound)

Hardware Notes:

Fast Parser: Scales with CPU cores (Python no-GIL mode enabled for true parallelism). Tested on 4-core systems.
Accurate Parser (GPU):
- Tesla T4 / Turing GPUs (CC 7.5): Uses transformers backend for universal compatibility. VRAM configured for 15GB.
- Ampere+ GPUs (A10, A100, RTX 3090+, CC 8.0+): Also uses transformers backend (current implementation). Future vLLM engine support could provide 2-3x speedup.
Accurate Parser (CPU Fallback): Automatically uses pipeline backend with OCR and traditional CV models when no GPU detected. Performs comparably to GPU mode in time but uses different inference approach. Multi-core CPUs (8+ cores) recommended for better throughput.

❓ FAQ

Which parser should I use?

Fast Parser: Use for high-volume text extraction, simple documents, or when speed is critical
Accurate Parser: Use for complex layouts, scientific papers, documents with tables/formulas, or when accuracy is paramount

Can I use both parsers together?

Yes! Many production systems use both:

Fast parser for initial processing and filtering
Accurate parser for complex documents that need detailed extraction

What happens if I don't have a GPU?

The accurate parser automatically detects GPU availability and falls back gracefully - no code or configuration changes needed:

GPU Status	Backend Used	Accuracy	Speed	Use Case
✅ GPU Available	Transformers (MinerU VLM)	95%+	15-60s/page	Highest accuracy
❌ No GPU	Pipeline (CPU)	80-85%	5-15s/page	Good accuracy without GPU

Key Benefits:

✅ Fully automatic detection - no environment variables or config files to edit
✅ Service starts successfully on any hardware
✅ Check metadata.backend and metadata.gpu_used in API response to see which mode was used
✅ Pipeline mode still extracts images, tables, and formulas using traditional CV models
✅ Same API interface regardless of backend

Docker Note: On machines without NVIDIA drivers, you may need to comment out runtime: nvidia in deploy/docker-compose.yml if Docker fails to start the container. The Python service will still auto-detect and use CPU mode.

What GPU do I need?

GPU is completely optional! Both parsers work on any hardware:

Fast Parser: Always uses CPU (no GPU needed)
Accurate Parser:
- ✅ Automatically detects and adapts to available hardware
- With GPU: Uses VLM mode (95%+ accuracy) - NVIDIA Tesla T4, A10, A100, RTX 3090+
- Without GPU: Uses Pipeline mode (80-85% accuracy) - any CPU

No configuration changes needed - the service detects GPU availability at startup and selects the optimal backend automatically.

GPU Configuration Guide

If you want to optimize Docker settings for your specific GPU, here's a comprehensive configuration guide:

Hardware	Backend	Docker Runtime	GPU VRAM Required	System RAM (Docker Memory)	Virtual VRAM Setting	Expected Performance	Docker Changes Needed
Tesla T4 / Turing (CC 7.5)	Transformers (MinerU VLM)	`runtime: nvidia`	8-16 GB (T4 has 16GB)	`16G` (min) `32G` (recommended)	`MINERU_VIRTUAL_VRAM_SIZE=15`	95%+ accuracy ~20-30s/page	✅ Keep GPU config as-is 💡 Optional: Increase RAM to 32G
Ampere+ (A10/A100/RTX 3090+)	Transformers (MinerU VLM)	`runtime: nvidia`	8-24 GB (A10: 24GB, A100: 40/80GB)	`16G` (min) `32G` (recommended)	`MINERU_VIRTUAL_VRAM_SIZE=20-32`	95%+ accuracy ~15-20s/page	✅ Keep GPU config 💡 Optional: Increase RAM/VRAM
CPU Only (No GPU)	Pipeline (CPU)	❌ Comment out	N/A (no GPU)	`8G` (min) `16-32G` (better performance)	N/A	80-85% accuracy ~30-60s/page	⚠️ Comment out `runtime: nvidia` ⚠️ Comment out `devices` section

Understanding the Settings:

GPU VRAM Required: Video memory on the GPU chip (hardware specification)
System RAM (Docker Memory): Host system memory allocated to the Docker container
Virtual VRAM Setting: MINERU_VIRTUAL_VRAM_SIZE tells MinerU how much GPU VRAM to use (set slightly below physical VRAM to leave buffer)

Quick Setup by Hardware:

🔧 Tesla T4 / Turing GPUs (Compute Capability 7.5)

Current configuration is already optimized for Tesla T4! No changes needed.

Your docker-compose.yml already has:

environment:
  - MINERU_VIRTUAL_VRAM_SIZE=15
  - CUDA_VISIBLE_DEVICES=0
  - TOKENIZERS_PARALLELISM=false
deploy:
  resources:
    limits:
      memory: 16G

Why these settings:

VRAM 15GB: Optimal for T4's 16GB memory
Transformers backend: Most stable for Turing architecture
Memory 16GB: Sufficient for VLM processing

⚡ Ampere+ GPUs (A10, A100, RTX 3090+)

Optional optimization for more VRAM:

Edit deploy/docker-compose.yml:

environment:
  - MINERU_VIRTUAL_VRAM_SIZE=20  # Increase from 15 to 20+
  - CUDA_VISIBLE_DEVICES=0
  - TOKENIZERS_PARALLELISM=false
deploy:
  resources:
    limits:
      memory: 24G  # Increase from 16G

Performance gains:

Larger batch sizes
Faster processing (~15-20s/page vs 20-30s)
Better handling of large documents

💻 CPU-Only Deployment (No GPU)

Service auto-detects and uses pipeline backend. Only Docker config needs adjustment:

Edit deploy/docker-compose.yml and comment out GPU settings:

accurate-parser:
  # ... other settings ...
  restart: unless-stopped
  # runtime: nvidia  # ← Comment this out
  deploy:
    resources:
      limits:
        cpus: '2'
        memory: 8G  # Reduced from 16G
      reservations:
        cpus: '1'
        memory: 4G
        # devices:  # ← Comment out entire section
        #   - driver: nvidia
        #     count: 1
        #     capabilities: [gpu]

No environment variable changes needed - Python service auto-detects CPU mode.

How do I scale this?

Fast Parser: Scale horizontally (multiple containers) since it's CPU-bound
Accurate Parser: Scale vertically (better GPU) or use multiple GPU instances

Is this production-ready?

Yes! The codebase is designed for production use with:

Health checks
Error handling
Structured logging
Docker deployment
API documentation

🛠️ Troubleshooting

1. `RuntimeError: too many resources requested for launch`

Cause: Using vLLM backend on a Tesla T4 GPU.

Fix: The default configuration uses transformers backend which is stable on T4. Ensure you haven't manually forced vllm-engine.

2. `ReadTimeout` Error

Cause: VLM processing takes time (up to 10 mins on T4).

Fix: Increase your client timeout:

response = requests.post('http://localhost:8005/parse', files=files, timeout=600)

3. GPU not detected

Cause: Docker may not have GPU access configured.

Fix: Ensure NVIDIA Container Toolkit is installed and Docker Compose has GPU runtime configured. See docs/DOCKER_SETUP.md for details.

4. Out of memory errors

Cause: GPU memory exhausted during VLM processing.

Fix: Reduce MINERU_VIRTUAL_VRAM_SIZE in docker-compose.yml or use a GPU with more memory.

For more troubleshooting, see docs/DOCKER_SETUP.md.

📖 Documentation

API Reference: Complete API documentation
Docker Setup: Detailed Docker configuration and troubleshooting
Setup Guide: Step-by-step setup instructions
Testing Guide: Testing procedures and benchmarks
Git Submodules: Managing MinerU submodule

📄 License

This project is licensed under the AGPL-3.0 License due to dependencies on MinerU and PyMuPDF. See LICENSE for details.

🚀 Future Improvements

vLLM Engine Support: Add vLLM engine backend for Ampere+ GPUs (A10, A100, RTX 3090+) to achieve 2-3x faster inference compared to Transformers backend. Currently uses Transformers for universal compatibility with Tesla T4/Turing GPUs.
Batch Processing API: Support multiple document uploads in a single request
Streaming Responses: Stream parsing results for large documents
Multi-GPU Support: Distribute processing across multiple GPUs
Model Caching: Optimize model loading for faster cold starts

🤝 Acknowledgements

This project builds upon excellent open-source software:

MinerU (AGPL-3.0) - State-of-the-art multimodal document parsing with VLM
PyMuPDF (AGPL-3.0) - Lightning-fast PDF text extraction
vLLM (Apache-2.0) - High-performance LLM inference engine
Transformers (Apache-2.0) - State-of-the-art ML models and inference
FastAPI (MIT) - Modern, fast web framework for building APIs
Uvicorn (BSD-3-Clause) - Lightning-fast ASGI server

See NOTICE for complete licensing information.

daddal001/two_tier_document_parser

Folders and files

Latest commit

History

Repository files navigation