An enterprise-grade, fully offline document processing system with built-in LLM inference. Designed for on-premises deployment with zero external dependencies.
✅ Fully Offline & On-Premises
- No external API calls
- No telemetry or analytics
- Suitable for air-gapped environments
✅ Open Source Only
- Tesseract OCR
- PaddleOCR
- llama.cpp for local inference
- FastAPI for backend
- OpenCV for image processing
✅ CPU-Only Operation
- No GPU requirements
- Quantized models for efficiency
- Suitable for cost-effective deployment
✅ Deterministic Processing
- Low temperature LLM inference
- No hallucination mode
- Strict JSON schema validation
- Rule-based layout extraction
✅ Enterprise Features
- Document upload and management
- Interactive web UI for review
- Manual field correction
- Validation and consistency checks
- Export to JSON/CSV
IDP System
├── Backend (FastAPI)
│ ├── Document Management
│ ├── Processing Orchestration
│ └── REST API
├── Inference Pipeline
│ ├── OCR Layer (Tesseract/PaddleOCR)
│ ├── Layout Extraction
│ └── LLM Inference (llama.cpp)
└── Web UI
├── Document Upload
├── Results Review
└── Field Editing
- Python: 3.10+
- OS: Linux, macOS, Windows
- CPU: 4+ cores recommended
- RAM: 8GB minimum (16GB+ recommended)
- Disk: 10GB+ for models and uploads
cd /path/to/IDPCPU
python3.10 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtUbuntu/Debian:
sudo apt-get install -y \
tesseract-ocr \
libtesseract-dev \
python3-devmacOS:
brew install tesseractWindows: Download from: https://github.com/UB-Mannheim/tesseract/wiki
Download a quantized GGUF model (required for extraction):
# Create models directory
mkdir -p /models
# Download a model (e.g., Mistral 7B Quantized)
# ~3GB file
wget -O /models/mistral-7b-instruct-q4.gguf \
https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.ggufRecommended Models:
- Mistral 7B Quantized (~3GB)
- Llama 2 7B Quantized (~3-4GB)
- Neural Chat 7B (~3GB)
Models should be:
- Quantized (Q4_0, Q4_K_M, or Q5 format)
- Around 3-7GB in size
- In GGUF format
Create a .env file in the project root:
# Server
HOST=0.0.0.0
PORT=8000
DEBUG=false
# File handling
UPLOAD_DIR=/tmp/idp_uploads
MAX_UPLOAD_SIZE_MB=50
# OCR Configuration
OCR_ENGINE=tesseract # or paddleocr
OCR_LANGUAGE=eng
TESSERACT_PATH=/usr/bin/tesseract # Optional
# LLM Configuration
LLM_MODEL_PATH=/models/mistral-7b-instruct-q4.gguf
LLM_CONTEXT_TOKENS=2048
LLM_MAX_TOKENS=512
LLM_TEMPERATURE=0.1
LLM_TOP_P=0.95
LLM_N_THREADS=4 # Auto-detected if not set
# Processing Pipeline
ENABLE_OCR=true
ENABLE_LAYOUT_ANALYSIS=true
ENABLE_LLM_EXTRACTION=true
ENABLE_VALIDATION=true
# Logging
LOG_LEVEL=INFO
LOG_SUPPRESS_SENSITIVE_DATA=truepython -m uvicorn backend.app.main:app --host 0.0.0.0 --port 8000Server will start at: http://localhost:8000
Open browser: http://localhost:8000/ui
curl -X POST http://localhost:8000/api/v1/documents/upload \
-F "file=@invoice.pdf" \
-F "document_type=invoice"Response:
{
"document_id": "doc_abc123xyz456",
"filename": "invoice.pdf",
"document_type": "invoice",
"size_bytes": 245600,
"upload_timestamp": "2024-01-11T10:30:00",
"status": "pending"
}curl http://localhost:8000/api/v1/documents/status/doc_abc123xyz456curl -X POST http://localhost:8000/api/v1/documents/correct/doc_abc123xyz456 \
-H "Content-Type: application/json" \
-d '{
"corrections": {
"invoice_number": "INV-2024-001",
"total_amount": 5250.00
},
"notes": "Fixed OCR errors"
}'curl -X POST http://localhost:8000/api/v1/documents/export/doc_abc123xyz456 \
-H "Content-Type: application/json" \
-d '{"format": "json", "include_ocr": true}' \
> results.jsoncurl http://localhost:8000/api/v1/documents/health- Validate file type and size
- Generate unique document ID
- Store in upload directory
- Deskew image
- Denoise
- Binarization
- Optimize for OCR
- Extract text and bounding boxes
- Use Tesseract or PaddleOCR
- Filter by confidence threshold
- Generate text blocks
- Group text blocks by spatial proximity
- Detect tables
- Extract key-value pairs
- Identify document structure
- Use quantized model for inference
- Deterministic extraction (temperature=0.1)
- Strict JSON schema validation
- No hallucination mode
- No inferred values
- Check against business rules
- Verify data consistency
- Confidence scoring
- Detect anomalies
- JSON format
- CSV format
- Include OCR and layout if requested
{
"invoice_number": "INV-2024-001",
"invoice_date": "2024-01-08",
"due_date": "2024-02-08",
"vendor_name": "ACME Corporation",
"total_amount": 5250.00,
"currency": "USD",
"items": [
{
"description": "Professional Services",
"quantity": 1,
"unit_price": 5000.00,
"total": 5000.00
}
]
}{
"receipt_number": "REC-2024-001",
"date_time": "2024-01-11T14:30:00",
"merchant_name": "Coffee Shop ABC",
"items": [
{"item_name": "Espresso", "quantity": 1, "price": 2.50}
],
"total": 2.50,
"payment_method": "card"
}IDPCPU/
├── backend/
│ └── app/
│ ├── main.py # FastAPI app
│ ├── services.py # Business logic
│ ├── models/
│ │ └── schemas.py # Pydantic models
│ └── routers/
│ └── documents.py # API routes
├── inference/
│ ├── ocr/ # OCR engines
│ ├── layout/ # Layout analysis
│ └── llm/ # LLM inference
├── ui/
│ ├── templates/
│ │ └── index.html # Web UI
│ └── static/ # CSS, JS
├── config/
│ ├── settings.py # Configuration
│ ├── schemas.json # JSON schemas
│ └── prompts.py # LLM prompts
└── examples/
└── example_usage.py # Usage examples
pytest tests/ -v- Format: Black
- Lint: Flake8
- Type hints: Enabled
- Docstrings: Google style
- Verify Tesseract is installed:
which tesseract - Check
TESSERACT_PATHin .env - Ensure OCR engine is enabled:
ENABLE_OCR=true
- Verify model file exists at
LLM_MODEL_PATH - Check file format (must be GGUF)
- Ensure sufficient disk space
- Try reducing context tokens if OOM errors
- Increase
LLM_N_THREADSup to CPU count - Use more aggressive quantization (Q4 vs Q8)
- Reduce
LLM_MAX_TOKENS - Enable
DESKEW_ENABLED=truefor preprocessing
- Reduce
LLM_CONTEXT_TOKENS - Use Q4_K_M quantized models
- Process smaller batches
- Monitor with:
toporhtop
Typical processing times (single document, 4-core CPU):
| Stage | Time | Notes |
|---|---|---|
| Upload | <1s | |
| Preprocessing | 2-5s | Image optimization |
| OCR | 5-30s | Depends on image quality |
| Layout Analysis | <1s | Deterministic |
| LLM Extraction | 10-60s | Model & inference |
| Total | 20-120s | Typical: 30-60s |
Batch processing scales linearly. Use background tasks for better UX.
✅ No External Connections
- Verify with network monitoring
- Use in air-gapped networks
✅ Data Handling
- Models kept in
/modelsdirectory - Uploads in
/tmp/idp_uploads(configurable) - No data logging by default
- Optional sensitive data masking
✅ Access Control
- Add authentication layer for multi-user deployment
- Use firewall rules to restrict API access
- Run behind reverse proxy (nginx, etc.)
docker build -t idp-system .
docker run -p 8000:8000 \
-v /models:/models \
-v /data:/tmp/idp_uploads \
idp-system- Set
DEBUG=false - Configure reverse proxy (nginx)
- Set up authentication
- Configure logging and monitoring
- Set upload size limits
- Use production-grade ASGI server
- Monitor system resources
- Regular model updates
- Backup configuration
Contributions welcome! Please ensure:
- Code follows Black formatting
- Type hints are present
- Docstrings are comprehensive
- Tests pass
- No external dependencies added without approval
This system uses:
- Tesseract OCR (Apache 2.0)
- PaddleOCR (Apache 2.0)
- llama.cpp (MIT)
- FastAPI (MIT)
- OpenCV (Apache 2.0)
Ensure compliance with all open-source licenses.
For issues and questions:
- Check troubleshooting section
- Review logs:
LOG_LEVEL=DEBUG - Run health check:
/api/v1/documents/health - Verify configuration: Check
.envfile
- Database support for job tracking
- Advanced table extraction
- Handwriting recognition
- Multi-language support improvement
- Quantized model optimization
- Web UI enhancements
- REST API v2 with batch operations
- WebSocket for real-time updates
Last Updated: 2024-01-11 Version: 1.0.0