A production-ready, two-tier document parsing API designed for RAG pipelines and data extraction workflows. It intelligently routes parsing requests to either a high-speed CPU-based parser or a high-accuracy VLM parser (with automatic CPU pipeline fallback when no GPU is available).
- Features
- Quick Start
- Architecture
- API Documentation
- Examples
- Performance Benchmarks
- FAQ
- Troubleshooting
- Documentation
- Future Improvements
- License
- Acknowledgements
| Feature | ποΈ Fast Parser (Tier 1) | π― Accurate Parser (Tier 2) |
|---|---|---|
| Engine | PyMuPDF4LLM |
MinerU (VLM or Pipeline) |
| Hardware | CPU Only | GPU (VLM) or CPU (Pipeline) |
| Auto Fallback | N/A | β Pipeline mode if no GPU |
| Accuracy | 70-75% | 95% (VLM) / 80-85% (Pipeline) |
| Speed | < 1s per page | 15s-60s (VLM) / 5-15s (Pipeline) |
| Output | Markdown Text | Markdown + Images + Tables + Formulas |
| Best For | High volume, simple text | Complex layouts, scanned docs, scientific papers |
| Concurrency | High (CPU-bound) | Medium (GPU-bound) / High (CPU fallback) |
The easiest way to get started is using Docker Compose. Both parsers work with or without GPU - the accurate parser automatically detects hardware and adjusts accordingly.
# Clone the repository
git clone https://github.com/daddal001/two_tier_document_parser.git
cd two-tier-document-parser
# Start services (works with or without GPU)
docker-compose -f deploy/docker-compose.yml up --build -d
# Check service health
curl http://localhost:8004/health # Fast parser
curl http://localhost:8005/health # Accurate parserNote for CPU-only machines: If Docker fails to start with a "nvidia runtime" error, see the GPU Configuration Guide below for CPU-only setup instructions.
Services will be available at:
- Fast Parser:
http://localhost:8004(Swagger UI:/docs) - Accurate Parser:
http://localhost:8005(Swagger UI:/docs)- With GPU: VLM backend (95%+ accuracy)
- Without GPU: Pipeline backend (80-85% accuracy)
Install as a Python package for development:
# Install the package
pip install -e .[fast] # For fast parser only
pip install -e .[accurate] # For accurate parser (works with or without GPU)
# Or install with all dependencies
pip install -e .[dev]graph TD
Client[Client / RAG Pipeline]
subgraph "Docker Host"
Fast[Fast Parser Service<br/>Port 8004<br/>CPU-Only]
Accurate[Accurate Parser Service<br/>Port 8005<br/>Auto-Detects Hardware]
end
Client -->|POST /parse<br/>Quick Text| Fast
Client -->|POST /parse<br/>Deep Extraction| Accurate
Fast -->|PyMuPDF4LLM| CPU1[CPU]
Accurate -->|Auto-Detect| Decision{GPU<br/>Available?}
Decision -->|Yes| Transformers[Transformers Backend<br/>MinerU VLM Model<br/>95%+ Accuracy]
Decision -->|No| Pipeline[Pipeline Backend<br/>Traditional CV Models<br/>80-85% Accuracy]
Transformers -->|Uses| GPU[NVIDIA GPU<br/>Tesla T4 / Ampere+]
Pipeline -->|Uses| CPU2[CPU<br/>Traditional CV Models]
style Fast fill:#2563eb,stroke:#1e40af,stroke-width:3px,color:#fff
style Accurate fill:#ea580c,stroke:#c2410c,stroke-width:3px,color:#fff
style Decision fill:#9333ea,stroke:#7e22ce,stroke-width:3px,color:#fff
style Transformers fill:#16a34a,stroke:#15803d,stroke-width:3px,color:#fff
style Pipeline fill:#ca8a04,stroke:#a16207,stroke-width:3px,color:#fff
style CPU1 fill:#16a34a,stroke:#15803d,stroke-width:3px,color:#fff
style CPU2 fill:#16a34a,stroke:#15803d,stroke-width:3px,color:#fff
style GPU fill:#dc2626,stroke:#b91c1c,stroke-width:3px,color:#fff
Key Design Decisions:
- Two-tier approach: Fast parser for simple documents, accurate parser for complex ones
- Automatic GPU fallback: Accurate parser detects hardware and selects optimal backend (Transformers or Pipeline)
- Independent services: Each parser runs in its own container for isolation
- RESTful API: Simple HTTP interface for easy integration
- Docker-first: Containerized deployment for consistency
- Hardware agnostic: Works on any hardware - CPU-only to high-end GPUs
Endpoint: POST http://localhost:8004/parse
Request:
curl -X POST "http://localhost:8004/parse" \
-F "file=@document.pdf"Response:
{
"markdown": "# Document Title\n\nContent here...",
"metadata": {
"pages": 10,
"processing_time_ms": 5234,
"parser": "pymupdf4llm",
"version": "1.0.0"
}
}Endpoint: POST http://localhost:8005/parse
Request:
curl -X POST "http://localhost:8005/parse" \
-F "file=@document.pdf" \
--max-time 600Response:
{
"markdown": "# Document Title\n\nContent with images, tables, and formulas...",
"metadata": {
"pages": 10,
"processing_time_ms": 450000,
"parser": "mineru",
"backend": "transformers",
"gpu_used": true,
"accuracy_tier": "very-high",
"version": "2.6.4",
"filename": "document.pdf"
},
"images": [
{
"image_id": "page_1_img_0",
"image_base64": "iVBORw0KGgo...",
"page": 1,
"bbox": [100, 200, 300, 400]
}
],
"tables": [
{
"table_id": "page_2_table_0",
"markdown": "| Col1 | Col2 |\n|------|------|\n| Val1 | Val2 |",
"page": 2,
"bbox": [50, 100, 500, 300]
}
],
"formulas": [
{
"formula_id": "page_3_formula_0",
"latex": "E = mc^2",
"page": 3,
"bbox": [200, 150, 300, 200]
}
]
}For complete API documentation, see docs/API.md or visit the interactive Swagger UI at /docs on each service.
import requests
# Fast parsing (CPU)
def parse_fast(pdf_path: str) -> dict:
with open(pdf_path, 'rb') as f:
files = {'file': f}
response = requests.post('http://localhost:8004/parse', files=files)
return response.json()
# Accurate parsing (GPU)
def parse_accurate(pdf_path: str) -> dict:
with open(pdf_path, 'rb') as f:
files = {'file': f}
# Note: Increase timeout for VLM processing
response = requests.post(
'http://localhost:8005/parse',
files=files,
timeout=6000
)
return response.json()
# Usage
result = parse_fast('document.pdf')
print(result['markdown'])
result = parse_accurate('complex_document.pdf')
print(f"Extracted {len(result['images'])} images")
print(f"Extracted {len(result['tables'])} tables")We provide a professional CLI client with Rich UI:
# Install client dependencies
pip install rich requests
# Run the demo
python examples/demo_client.py examples/data/sample.pdf --mode fast
python examples/demo_client.py examples/data/sample.pdf --mode accurate --timeout 6000See examples/notebooks/parser_visualization.ipynb for an interactive example with visualizations.
| Metric | Fast Parser | Accurate Parser (GPU) | Accurate Parser (CPU) |
|---|---|---|---|
| Hardware | CPU | Tesla T4 / Ampere+ | Any CPU (multi-core recommended) |
| Backend | PyMuPDF4LLM | Transformers (MinerU VLM) | Pipeline (OCR + CV Models) |
| Latency per Page | < 1s | 20-30s (T4) / 15-20s (Ampere+) | 30-60s |
| Total Time (11 pages) | ~8s | 8-10 min (T4) / 5-8 min (Ampere+) | 8-12 min |
| Accuracy | 70-75% | 95%+ | 80-85% |
| Content Types | Text only | Text + Images + Tables + Formulas | Text + Images + Tables + Formulas |
| GPU Required | β No | β Yes (optional) | β No |
| Auto-Fallback | N/A | β Falls back to CPU if no GPU | N/A |
| Concurrent Requests | High (CPU-bound) | Low-Medium (GPU-bound) | Medium (GPU-bound) |
Hardware Notes:
- Fast Parser: Scales with CPU cores (Python no-GIL mode enabled for true parallelism). Tested on 4-core systems.
- Accurate Parser (GPU):
- Tesla T4 / Turing GPUs (CC 7.5): Uses
transformersbackend for universal compatibility. VRAM configured for 15GB. - Ampere+ GPUs (A10, A100, RTX 3090+, CC 8.0+): Also uses
transformersbackend (current implementation). Future vLLM engine support could provide 2-3x speedup.
- Tesla T4 / Turing GPUs (CC 7.5): Uses
- Accurate Parser (CPU Fallback): Automatically uses
pipelinebackend with OCR and traditional CV models when no GPU detected. Performs comparably to GPU mode in time but uses different inference approach. Multi-core CPUs (8+ cores) recommended for better throughput.
- Fast Parser: Use for high-volume text extraction, simple documents, or when speed is critical
- Accurate Parser: Use for complex layouts, scientific papers, documents with tables/formulas, or when accuracy is paramount
Yes! Many production systems use both:
- Fast parser for initial processing and filtering
- Accurate parser for complex documents that need detailed extraction
The accurate parser automatically detects GPU availability and falls back gracefully - no code or configuration changes needed:
| GPU Status | Backend Used | Accuracy | Speed | Use Case |
|---|---|---|---|---|
| β GPU Available | Transformers (MinerU VLM) | 95%+ | 15-60s/page | Highest accuracy |
| β No GPU | Pipeline (CPU) | 80-85% | 5-15s/page | Good accuracy without GPU |
Key Benefits:
- β Fully automatic detection - no environment variables or config files to edit
- β Service starts successfully on any hardware
- β
Check
metadata.backendandmetadata.gpu_usedin API response to see which mode was used - β Pipeline mode still extracts images, tables, and formulas using traditional CV models
- β Same API interface regardless of backend
Docker Note: On machines without NVIDIA drivers, you may need to comment out runtime: nvidia in deploy/docker-compose.yml if Docker fails to start the container. The Python service will still auto-detect and use CPU mode.
GPU is completely optional! Both parsers work on any hardware:
- Fast Parser: Always uses CPU (no GPU needed)
- Accurate Parser:
- β Automatically detects and adapts to available hardware
- With GPU: Uses VLM mode (95%+ accuracy) - NVIDIA Tesla T4, A10, A100, RTX 3090+
- Without GPU: Uses Pipeline mode (80-85% accuracy) - any CPU
No configuration changes needed - the service detects GPU availability at startup and selects the optimal backend automatically.
If you want to optimize Docker settings for your specific GPU, here's a comprehensive configuration guide:
| Hardware | Backend | Docker Runtime | GPU VRAM Required | System RAM (Docker Memory) | Virtual VRAM Setting | Expected Performance | Docker Changes Needed |
|---|---|---|---|---|---|---|---|
| Tesla T4 / Turing (CC 7.5) | Transformers (MinerU VLM) | runtime: nvidia |
8-16 GB (T4 has 16GB) |
16G (min)32G (recommended) |
MINERU_VIRTUAL_VRAM_SIZE=15 |
95%+ accuracy ~20-30s/page |
β
Keep GPU config as-is π‘ Optional: Increase RAM to 32G |
| Ampere+ (A10/A100/RTX 3090+) | Transformers (MinerU VLM) | runtime: nvidia |
8-24 GB (A10: 24GB, A100: 40/80GB) |
16G (min)32G (recommended) |
MINERU_VIRTUAL_VRAM_SIZE=20-32 |
95%+ accuracy ~15-20s/page |
β
Keep GPU config π‘ Optional: Increase RAM/VRAM |
| CPU Only (No GPU) | Pipeline (CPU) | β Comment out | N/A (no GPU) | 8G (min)16-32G (better performance) |
N/A | 80-85% accuracy ~30-60s/page |
runtime: nvidiadevices section |
Understanding the Settings:
- GPU VRAM Required: Video memory on the GPU chip (hardware specification)
- System RAM (Docker Memory): Host system memory allocated to the Docker container
- Virtual VRAM Setting:
MINERU_VIRTUAL_VRAM_SIZEtells MinerU how much GPU VRAM to use (set slightly below physical VRAM to leave buffer)
Quick Setup by Hardware:
π§ Tesla T4 / Turing GPUs (Compute Capability 7.5)
Current configuration is already optimized for Tesla T4! No changes needed.
Your docker-compose.yml already has:
environment:
- MINERU_VIRTUAL_VRAM_SIZE=15
- CUDA_VISIBLE_DEVICES=0
- TOKENIZERS_PARALLELISM=false
deploy:
resources:
limits:
memory: 16GWhy these settings:
- VRAM 15GB: Optimal for T4's 16GB memory
- Transformers backend: Most stable for Turing architecture
- Memory 16GB: Sufficient for VLM processing
β‘ Ampere+ GPUs (A10, A100, RTX 3090+)
Optional optimization for more VRAM:
Edit deploy/docker-compose.yml:
environment:
- MINERU_VIRTUAL_VRAM_SIZE=20 # Increase from 15 to 20+
- CUDA_VISIBLE_DEVICES=0
- TOKENIZERS_PARALLELISM=false
deploy:
resources:
limits:
memory: 24G # Increase from 16GPerformance gains:
- Larger batch sizes
- Faster processing (~15-20s/page vs 20-30s)
- Better handling of large documents
π» CPU-Only Deployment (No GPU)
Service auto-detects and uses pipeline backend. Only Docker config needs adjustment:
Edit deploy/docker-compose.yml and comment out GPU settings:
accurate-parser:
# ... other settings ...
restart: unless-stopped
# runtime: nvidia # β Comment this out
deploy:
resources:
limits:
cpus: '2'
memory: 8G # Reduced from 16G
reservations:
cpus: '1'
memory: 4G
# devices: # β Comment out entire section
# - driver: nvidia
# count: 1
# capabilities: [gpu]No environment variable changes needed - Python service auto-detects CPU mode.
- Fast Parser: Scale horizontally (multiple containers) since it's CPU-bound
- Accurate Parser: Scale vertically (better GPU) or use multiple GPU instances
Yes! The codebase is designed for production use with:
- Health checks
- Error handling
- Structured logging
- Docker deployment
- API documentation
Cause: Using vLLM backend on a Tesla T4 GPU.
Fix: The default configuration uses transformers backend which is stable on T4. Ensure you haven't manually forced vllm-engine.
Cause: VLM processing takes time (up to 10 mins on T4).
Fix: Increase your client timeout:
response = requests.post('http://localhost:8005/parse', files=files, timeout=600)Cause: Docker may not have GPU access configured.
Fix: Ensure NVIDIA Container Toolkit is installed and Docker Compose has GPU runtime configured. See docs/DOCKER_SETUP.md for details.
Cause: GPU memory exhausted during VLM processing.
Fix: Reduce MINERU_VIRTUAL_VRAM_SIZE in docker-compose.yml or use a GPU with more memory.
For more troubleshooting, see docs/DOCKER_SETUP.md.
- API Reference: Complete API documentation
- Docker Setup: Detailed Docker configuration and troubleshooting
- Setup Guide: Step-by-step setup instructions
- Testing Guide: Testing procedures and benchmarks
- Git Submodules: Managing MinerU submodule
This project is licensed under the AGPL-3.0 License due to dependencies on MinerU and PyMuPDF. See LICENSE for details.
- vLLM Engine Support: Add vLLM engine backend for Ampere+ GPUs (A10, A100, RTX 3090+) to achieve 2-3x faster inference compared to Transformers backend. Currently uses Transformers for universal compatibility with Tesla T4/Turing GPUs.
- Batch Processing API: Support multiple document uploads in a single request
- Streaming Responses: Stream parsing results for large documents
- Multi-GPU Support: Distribute processing across multiple GPUs
- Model Caching: Optimize model loading for faster cold starts
This project builds upon excellent open-source software:
- MinerU (AGPL-3.0) - State-of-the-art multimodal document parsing with VLM
- PyMuPDF (AGPL-3.0) - Lightning-fast PDF text extraction
- vLLM (Apache-2.0) - High-performance LLM inference engine
- Transformers (Apache-2.0) - State-of-the-art ML models and inference
- FastAPI (MIT) - Modern, fast web framework for building APIs
- Uvicorn (BSD-3-Clause) - Lightning-fast ASGI server
See NOTICE for complete licensing information.