AI-powered invoice processing system using Donut (Document Understanding Transformer) for extracting structured data from invoice documents (PDFs and images).
Cloudx Invoice AI is an end-to-end solution for training and deploying AI models that extract text and structured information from invoices. Built on the Donut architecture, it provides:
- Document understanding without OCR
- Support for millions of invoice documents
- REST API for invoice processing
- High accuracy text extraction
- Structured JSON output
- Multi-format Support: Process PDFs and images (PNG, JPG, TIFF, etc.)
- Transformer-based: Uses state-of-the-art Donut architecture
- Scalable: Handles millions of training samples
- Production-ready: FastAPI server with Docker support
- Comprehensive: Training, evaluation, and inference pipelines
- Flexible: Configurable for different invoice formats
Cloudx Invoice AI/
├── src/
│ ├── data/
│ │ ├── preprocessor.py # Data preprocessing
│ │ └── dataset.py # Dataset loaders
│ ├── training/
│ │ └── trainer.py # Training module
│ ├── evaluation/
│ │ └── metrics.py # Evaluation metrics
│ ├── api/
│ │ └── app.py # FastAPI server
│ └── utils/ # Utilities
├── configs/
│ └── train_config.yaml # Training configuration
├── data/
│ ├── raw/ # Raw invoice files
│ ├── processed/ # Processed data
│ ├── train/ # Training split
│ ├── val/ # Validation split
│ └── test/ # Test split
├── models/
│ └── checkpoints/ # Model checkpoints
├── logs/ # Training logs
├── donut_base/ # Donut base repository
├── train.py # Training script
├── evaluate.py # Evaluation script
├── run_api.py # API server runner
├── Dockerfile # Docker configuration
├── docker-compose.yml # Docker Compose config
├── requirements.txt # Python dependencies
└── README.md # This file
- Python 3.8+
- PyTorch 1.11+
- CUDA 11.x (for GPU training)
- Docker (optional)
- Clone the repository:
git clone <repository-url>
cd Donut- Create virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Install PyTorch (adjust for your CUDA version):
# CPU version
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
# GPU version (CUDA 11.8)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118- Install package:
pip install -e .# Build images
docker-compose build
# For GPU support
docker-compose --profile gpu build training-gpuPlace your invoice files and ground truth data in the data/raw/ directory.
from src.data.preprocessor import InvoicePreprocessor
# Initialize preprocessor
preprocessor = InvoicePreprocessor()
# Process dataset
metadata_file = preprocessor.process_dataset(
document_paths=["path/to/invoice1.pdf", "path/to/invoice2.pdf"],
ground_truths=[{"invoice_number": "INV-001", ...}, {...}],
output_dir="data/processed"
)# Local training
python train.py --config configs/train_config.yaml
# Docker training
docker-compose up training
# GPU training
docker-compose --profile gpu up training-gpupython evaluate.py --checkpoint models/checkpoints/best_model.ckpt# Local
python run_api.py --checkpoint models/checkpoints/best_model.ckpt
# Docker
docker-compose up apiBasic training:
python train.py --config configs/train_config.yamlWith custom parameters:
python train.py \
--config configs/train_config.yaml \
--gpus 2 \
--batch_size 8 \
--epochs 50 \
--learning_rate 5e-5Resume from checkpoint:
python train.py \
--config configs/train_config.yaml \
--resume models/checkpoints/last.ckptpython evaluate.py \
--config configs/train_config.yaml \
--checkpoint models/checkpoints/best_model.ckpt \
--test_data data/processed/test_metadata.jsonl \
--output results/predictions.jsonlStart the server:
python run_api.py --checkpoint models/checkpoints/best_model.ckpt --port 8000Process single invoice:
curl -X POST "http://localhost:8000/api/v1/extract-invoice" \
-H "Content-Type: multipart/form-data" \
-F "file=@invoice.pdf"Response:
{
"status": "success",
"invoice_data": {
"invoice_number": "INV-001",
"invoice_date": "2024-01-15",
"vendor_name": "Acme Corp",
"total": "1500.00",
...
}
}Process multiple invoices:
curl -X POST "http://localhost:8000/api/v1/extract-invoice-batch" \
-F "files=@invoice1.pdf" \
-F "files=@invoice2.pdf"Health check:
curl http://localhost:8000/healthEdit configs/train_config.yaml to customize:
- Model settings: Image size, max length, pretrained model
- Training parameters: Batch size, learning rate, epochs
- Data paths: Training/validation/test datasets
- Hardware: GPU count, precision, strategy
- Invoice fields: Fields to extract
Example configuration:
model:
pretrained_model: "naver-clova-ix/donut-base"
image_size: [1280, 960]
max_length: 768
training:
batch_size: 4
learning_rate: 3e-5
max_epochs: 30
task:
fields:
- invoice_number
- invoice_date
- vendor_name
- totalInvoice files in data/raw/:
- PDFs:
.pdf - Images:
.png,.jpg,.jpeg,.tiff,.bmp
Ground truth JSON format:
{
"invoice_number": "INV-001",
"invoice_date": "2024-01-15",
"vendor_name": "Acme Corp",
"vendor_address": "123 Main St",
"total": "1500.00",
"currency": "USD"
}Metadata JSONL format (one sample per line):
{"image_path": "data/processed/images/invoice_001_page0.png", "ground_truth": {...}, "original_path": "data/raw/invoice_001.pdf", "page_number": 0}Root endpoint with API information
Health check endpoint
Returns:
{
"status": "healthy",
"model_loaded": true,
"device": "cuda"
}Extract data from single invoice
Request:
file: Invoice file (multipart/form-data)
Response:
{
"status": "success",
"invoice_data": {...},
"confidence": null
}Extract data from multiple invoices
Request:
files: List of invoice files
Response:
{
"results": [
{
"filename": "invoice1.pdf",
"status": "success",
"invoice_data": {...}
}
]
}# Build all services
docker-compose build
# Run training
docker-compose up training
# Run API server
docker-compose up api
# Run with GPU
docker-compose --profile gpu up training-gpu
# Stop all services
docker-compose down
# View logs
docker-compose logs -f apiExpected metrics on well-formatted invoices:
- Exact Match Accuracy: 85-95%
- Field Accuracy: 90-98%
- Inference Speed: 1-2 seconds per invoice (CPU), 0.3-0.5s (GPU)
Reduce batch size in config:
training:
batch_size: 2
accumulate_grad_batches: 4Use mixed precision:
hardware:
precision: 16Check checkpoint path:
ls -l models/checkpoints/Run tests:
pytest tests/Format code:
black src/ train.py evaluate.py run_api.pyLint code:
flake8 src/Cloudx internal project. Contact the AI team for contributions.
Proprietary - Cloudx
Built on Donut by Clova AI Research.
For issues and questions, contact:
- Email: ai-team@cloudx.com
- Internal Slack: #cloudx-invoice-ai