🚀 PS-05 Challenge - Complete Backend Solution

🎯 Overview

Complete 3-stage document understanding pipeline optimized for A100 GPU with 2-hour evaluation time limit. This solution processes document images through layout detection, text extraction with language identification, and content understanding to generate structured JSON output.

✅ 100% Compliant with PS-05 Requirements Document!

📓 Start here (Canonical Docs)

Main guide: this README.md (setup, run, API, troubleshooting)
Evaluation-day runbook: docs/EVALUATION_DAY_RUNBOOK.md
Stage-wise training: docs/STAGE_TRAINING_GUIDE.md
Swagger UI (runtime): http://localhost:8000/docs
GPU deployment (compose): docker-compose.gpu.yml

Notes:

Prefer this README and the runbook. Other markdown files are reference-only and marked deprecated to reduce confusion.

Key Features:

GPU Optimization: Full A100 GPU acceleration with CUDA 12.1+
Parallel Processing: All 3 stages run simultaneously for maximum speed
Large Dataset Support: Handles 20GB+ datasets efficiently
Docker Ready: Complete containerization with GPU support
Existing API: Enhanced existing endpoints with GPU optimization (no confusion!)
Complete Preprocessing: De-skew, denoise, augmentation as per requirements
Exact Class Labels: Background, Text, Title, List, Table, Figure
Multilingual Support: English, Hindi, Urdu, Arabic, Nepali, Persian

🖥️ System Requirements

Hardware (Challenge Infrastructure)

GPU: NVIDIA A100 (40GB/80GB)
CPU: 48-core CPU
RAM: 256GB
OS: Ubuntu 24.04
Storage: 1TB+ SSD

Software Requirements

Docker: 24.0+
NVIDIA Docker: 2.0+
CUDA: 12.1+
NVIDIA Driver: 535+

🚀 Quick Start (2-Hour Evaluation)

1. Clone and Setup

git clone <repository-url>
cd multilingual-docai

2. Build GPU-Optimized Container

# Build with GPU support
docker build -f Dockerfile.gpu -t ps05-gpu:latest .

# Verify GPU access
docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu24.04 nvidia-smi

3. Deploy with Docker Compose

# Start GPU-optimized services
docker-compose -f docker-compose.gpu.yml up -d

# Check status
docker-compose -f docker-compose.gpu.yml ps

4. Verify Deployment

# Check API health
curl http://localhost:8000/health

# Check GPU status
curl http://localhost:8000/processing-stats

🔬 Optional specialized models (enable via env vars)

Layout refinement (6-class):
- LAYOUTLMV3_CHECKPOINT=/app/models/layoutlmv3-6class
- Uses LayoutLMv3 to re-score YOLO regions (applied when confident).
Chart captioning:
- CHART_CAPTION_CHECKPOINT=/app/models/pix2struct-chart
- Uses Pix2Struct for charts; falls back to BLIP-2 if unavailable.
Table-to-text:
- TABLE_T2T_CHECKPOINT=/app/models/table-t2t
- Uses a seq2seq LM (e.g., T5/TableT5) on OCR text from the table region; falls back to BLIP-2.

Mount models to persist:

-v /host/models:/app/models \
-e TRANSFORMERS_CACHE=/app/models -e HF_HOME=/app/models -e MPLCONFIGDIR=/tmp

📝 OCR engine selection (optional)

Default: EasyOCR (multilingual) is used.
Optional: Enable PaddleOCR as primary (with EasyOCR fallback):

-e USE_PADDLEOCR=1

Ensure PaddleOCR is installed in your image before offline evaluation:
- Add to your build (internet allowed during build):
  - In Dockerfile: pip install paddleocr
- Or install locally and rebuild the image so it’s available offline at run time.

📦 Fully offline operation

Prepare models directory before build/run:

YOLOv8 weights (e.g., yolov8x.pt), LayoutLMv3 (fine-tuned 6-class optional), BLIP‑2, fastText lid.176.bin, Pix2Struct (optional), Table T2T (optional).
Place under ./models and build the GPU image to embed them, or mount with -v /host/models:/app/models.

Build (GPU, offline‑ready):

docker build --build-arg INSTALL_GPU_DEPS=1 -t ps05-backend:gpu .

Save/Load image (no internet at venue):

docker save -o ps05-backend-gpu-offline.tar ps05-backend:gpu
docker load -i ps05-backend-gpu-offline.tar

⏱️ Timed rehearsal and schema check

Timed rehearsal (dataset must be mounted in container):

bash scripts/utilities/rehearsal.sh <DATASET_ID> http://localhost:8000

Schema check (validate [x,y,w,h] and required keys on outputs):

python scripts/utilities/schema_check.py results/<DATASET_ID>

Output spec:

All bounding boxes standardized to [x, y, w, h] (HBB) across stages.
Per-element captions are produced for Table/Figure regions; whole-image caption may also be included.

🔧 Architecture Overview

3-Stage Pipeline (100% PS-05 Compliant)

Stage 1: Layout Detection (YOLOv8x, LayoutLMv3, Mask R-CNN)
- Classes: Background, Text, Title, List, Table, Figure ✓
- Output: Bounding boxes [x, y, w, h] + labels ✓
- Evaluation: mAP calculation ✓
Stage 2: Text Extraction + Language Identification (EasyOCR, Tesseract, fastText)
- OCR: Multilingual support ✓
- Languages: English, Hindi, Urdu, Arabic, Nepali, Persian ✓
- Output: Line-wise text + bbox + language ID ✓
Stage 3: Content Understanding + Natural Language Generation (Table Transformer, BLIP, OFA)
- Tables: Natural language descriptions ✓
- Charts: Textual descriptions ✓
- Maps: Image captioning ✓
- Figures: General image descriptions ✓

Preprocessing (100% PS-05 Compliant)

De-skew: Hough transform for orientation normalization ✓
Denoise: Non-local means denoising ✓
Augmentation: Blur, rotation, noise for training robustness ✓
Normalization: Contrast enhancement ✓

Core Services

OptimizedProcessingService: GPU-accelerated parallel processing
GPUTrainingService: A100-optimized model training
DocumentProcessor: Document handling and preprocessing
StageProcessor: Stage-by-stage processing orchestration
EvaluationService: mAP calculation and evaluation
UnifiedCleaningService: Image and document cleaning

📊 Performance Optimization

GPU Memory Management

Batch Size: 50 (optimized for A100)
Mixed Precision: FP16 enabled
Memory Fraction: 90% GPU utilization
CUDA Optimization: TF32 enabled

Processing Speed Targets

Stage 1 (Layout): 100+ images/second
Stage 2 (Text+Lang): 80+ images/second
Stage 3 (Content): 60+ images/second
Overall Pipeline: 50+ images/second

Expected Performance (A100 GPU)

20GB Dataset: 1.5-2.5 hours (target: under 2 hours)
Images/Second: 50-80 (optimized pipeline)
Memory Usage: 35-38GB GPU, 180-200GB RAM

🔄 API Usage (Simplified!)

Root Information

GET /
# Returns complete API information and capabilities

1. Upload Dataset (20GB)

POST /upload-dataset
# Supports multiple files, automatic dataset ID generation

2. Process All Stages (GPU Optimized)

POST /process-all
# All stages in parallel, maximum speed (existing endpoint!)

curl -X POST "http://localhost:8000/process-all" \
  -H "accept: application/json" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "dataset_id=YOUR_DATASET_ID&parallel_processing=true&max_workers=8&gpu_acceleration=true&batch_size=50&optimization_level=speed"

3. Process Single Stage (GPU Optimized)

POST /process-stage
# Individual stage processing with GPU optimization (existing endpoint!)

# Stage 1: Layout Detection
curl -X POST "http://localhost:8000/process-stage" \
  -d "dataset_id=YOUR_DATASET_ID&stage=1&optimization_level=speed&batch_size=50&gpu_acceleration=true"

# Stage 2: Text + Language
curl -X POST "http://localhost:8000/process-stage" \
  -d "dataset_id=YOUR_DATASET_ID&stage=2&optimization_level=speed&batch_size=50&gpu_acceleration=true"

# Stage 3: Content Understanding
curl -X POST "http://localhost:8000/process-stage" \
  -d "dataset_id=YOUR_DATASET_ID&stage=3&optimization_level=speed&batch_size=50&gpu_acceleration=true"

4. Get Results

GET /predictions/{dataset_id}
# JSON output for each image (no annotations mode)

GET /results/{dataset_id}
# Complete results with evaluation metrics

5. Training (Optional)

POST /train-layout-model
# Train LayoutLMv3 model

POST /train-yolo-model
# Train YOLOv8 model

6. Monitoring

GET /processing-stats
# GPU and processing statistics

GET /training-stats
# Training statistics and GPU usage

GET /status
# Overall system status

7. Dataset Management

GET /datasets
# List all datasets

DELETE /datasets/{dataset_id}
# Delete dataset and results

8. Cleaning & EDA

POST /clean-dataset
# Clean dataset (image + document cleaning)

POST /run-eda
# Run exploratory data analysis

GET /eda-results/{dataset_id}
# Get EDA results

🧪 Training Pipeline (Optional)

Train Layout Model

curl -X POST "http://localhost:8000/train-layout-model" \
  -H "accept: application/json" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "train_data_dir=/app/datasets/train&val_data_dir=/app/datasets/val&output_dir=/app/models/layout&epochs=50&batch_size=16&learning_rate=0.0001&mixed_precision=true"

Train YOLO Model

curl -X POST "http://localhost:8000/train-yolo-model" \
  -H "accept: application/json" \
  -H "Content-Type: application/x-www-form-urlencoded" \
  -d "data_yaml_path=/app/data.yaml&output_dir=/app/models/yolo&epochs=50&batch_size=16&learning_rate=0.0001"

📈 Monitoring and Debugging

GPU Monitoring

# Real-time GPU usage
docker exec ps05-gpu-challenge nvidia-smi -l 1

# GPU memory usage
docker exec ps05-gpu-challenge python -c "import torch; print(f'GPU Memory: {torch.cuda.memory_allocated()/1e9:.2f} GB')"

Container Logs

# Application logs
docker-compose -f docker-compose.gpu.yml logs -f ps05-gpu

# GPU monitor logs
docker-compose -f docker-compose.gpu.yml logs -f gpu-monitor

Performance Metrics

# Processing statistics
curl http://localhost:8000/processing-stats

# Training statistics
curl http://localhost:8000/training-stats

# System status
curl http://localhost:8000/status

🚨 Troubleshooting

Common Issues

GPU Not Accessible

# Check NVIDIA Docker installation
sudo docker run --rm --gpus all nvidia/cuda:12.1-base-ubuntu24.04 nvidia-smi

# Restart Docker service
sudo systemctl restart docker

Out of Memory

# Reduce batch size in API call
# Default: batch_size=50, reduce to 25-30 if needed

# Clear GPU cache
docker exec ps05-gpu-challenge python -c "import torch; torch.cuda.empty_cache()"

Model Loading Errors

# Check internet connection for model downloads
# Verify disk space (need 50GB+ for models)
# Check CUDA compatibility

Performance Tuning

For Maximum Speed

# Use these parameters in API calls
optimization_level=speed
batch_size=50
gpu_acceleration=true
parallel_processing=true
max_workers=8

For Memory Efficiency

# Use these parameters in API calls
optimization_level=memory
batch_size=25
gpu_acceleration=true
parallel_processing=true
max_workers=4

📋 Evaluation Checklist

Pre-Evaluation

GPU container built successfully
All models loaded (YOLOv8, LayoutLMv3, BLIP-2, fastText)
API endpoints responding
GPU memory accessible
Test with small dataset

During Evaluation

Post-Evaluation

🔗 Useful Commands

Quick Status Check

# System health
curl http://localhost:8000/health

# GPU status
curl http://localhost:8000/processing-stats

# Container status
docker-compose -f docker-compose.gpu.yml ps

Resource Monitoring

# GPU usage
nvidia-smi -l 1

# Container resources
docker stats ps05-gpu-challenge

# Disk usage
df -h

Logs and Debugging

# Application logs
docker-compose -f docker-compose.gpu.yml logs -f ps05-gpu

# GPU monitor
docker-compose -f docker-compose.gpu.yml logs -f gpu-monitor

# Container shell
docker exec -it ps05-gpu-challenge bash

🎯 Requirements Fulfillment - 100% Complete!

✅ Problem Statement Requirements

Input: JPEG/PNG document images ✓
Output: JSON per image with bounding boxes ✓
Classes: Background, Text, Title, List, Table, Figure ✓
Languages: English, Hindi, Urdu, Arabic, Nepali, Persian ✓
Stages: 3-stage pipeline with evaluation ✓

✅ Solution Roadmap Requirements

Preprocessing: De-skew, denoise, augmentation ✓
Layout Detection: YOLOv8, LayoutLMv3, Detectron2 ✓
Text Extraction: EasyOCR, Tesseract, multilingual ✓
Language ID: fastText, XLM-RoBERTa ✓
Content Understanding: Table Transformer, BLIP, OFA ✓
Training Pipeline: PyTorch with GPU optimization ✓
REST API: FastAPI with GPU acceleration ✓
Docker: Optimized for A100 GPU ✓

✅ Evaluation Requirements

2-Hour Time Limit: Optimized for speed ✓
20GB Dataset: Large-scale processing ✓
No Annotations: Prediction-only mode ✓
JSON Output: Per-image results ✓
Performance Metrics: Real-time monitoring ✓

✅ Additional PS-05 Requirements

De-skew & Denoise: OpenCV Hough transform ✓
Augmentation: Blur, rotation, noise ✓
Model Choice: LayoutLMv3, YOLOv8 ✓
OCR: Tesseract, EasyOCR multilingual ✓
Language ID: fastText (176 languages) ✓
Content Understanding: BLIP-2, OFA ✓
Output Format: Exact JSON structure ✓
Training: PyTorch pipeline ✓
Deployment: FastAPI REST API ✓
Infrastructure: Ubuntu 24.04, A100 GPU ✓

🎉 Summary

This implementation provides a complete, production-ready solution for the PS-05 challenge that:

Maximizes Speed: Parallel processing + GPU optimization
Optimizes for A100: Full CUDA utilization + memory optimization
Meets Time Limits: 2-hour evaluation target achievable
Provides Quality: State-of-the-art models + robust pipeline
Ensures Reliability: Error handling + monitoring + health checks
Maintains Simplicity: Existing endpoints enhanced, no confusion!
100% Compliant: All PS-05 requirements fully implemented!

Key Advantage: Your existing API workflow remains the same, but now with full A100 GPU optimization and complete PS-05 compliance!

The solution is ready for immediate deployment and should successfully process your 20GB dataset within the 2-hour evaluation window while maintaining high quality output and meeting all specified requirements.

Ready for your PS-05 Challenge evaluation! 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
backend		backend
configs		configs
core		core
docs		docs
frontend		frontend
monitoring		monitoring
nginx		nginx
scripts		scripts
test_data		test_data
yolo_dataset		yolo_dataset
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.gpu		Dockerfile.gpu
EDA_CLEANING_INTEGRATION_GUIDE.md		EDA_CLEANING_INTEGRATION_GUIDE.md
LICENSE		LICENSE
README.md		README.md
SWAGGER_IMPLEMENTATION_SUMMARY.md		SWAGGER_IMPLEMENTATION_SUMMARY.md
build.bat		build.bat
build.sh		build.sh
corrupt.py		corrupt.py
docker-compose.gpu.yml		docker-compose.gpu.yml
docker-compose.yml		docker-compose.yml
ps05.py		ps05.py
requirements.txt		requirements.txt
run-compose.txt		run-compose.txt
simple_backend.py		simple_backend.py
start_backend.py		start_backend.py
test_docker.py		test_docker.py

License

kalpit-sharma-dev/multilingual-docai

Folders and files

Latest commit

History

Repository files navigation