Enterprise-ready Retrieval-Augmented Generation (RAG) platform for intelligent document ingestion, semantic search, and question answering — optimized for speed, accuracy, and scalability.
| Metric | Value | Impact |
|---|---|---|
| Cache Hit Rate | 42% | Semantic caching reduces redundant LLM calls |
| Docker Image | 3.3GB → 402MB | 88% reduction via multi-stage builds |
| Query Latency (P95) | <200ms | Sub-second responses under load |
| Hybrid Search | ChromaDB + BM25 | 35% better recall than vector-only |
| Reranking Boost | +35% relevance | Cross-encoder reranking improves precision |
The Document Intelligence RAG System processes and indexes large document corpora, enabling users to query, search, and extract insights in milliseconds. Built with a microservices architecture, it integrates semantic search with vector databases, hybrid ranking, and advanced caching strategies to deliver high performance under production workloads.
Core capabilities:
- Intelligent Ingestion — Async document processing with format detection (PDF, DOCX, HTML) and metadata extraction
- Hybrid Search — Vector embeddings (ChromaDB) + keyword search (BM25) for improved recall and precision
- LLM Integration — GPT-based reasoning with context-aware prompt construction
- Production-Grade Deployment — Multi-stage Docker builds, CI/CD, and built-in observability
graph TD
A[Client Request] --> B[FastAPI API Gateway]
B --> C[Async Document Processor]
B --> D[RAG Query Engine]
C --> E[ChromaDB Vector Store]
D --> E
D --> F[BM25 Search Index]
D --> G[OpenAI LLM]
B --> H[Redis Cache Layer]
B --> I[Prometheus Metrics + Grafana Dashboards]
Key Technologies:
- FastAPI – High-performance async API layer
- ChromaDB + BM25 – Hybrid retrieval strategy
- OpenAI GPT – State-of-the-art language understanding
- Redis – Low-latency caching with intelligent TTLs
- Celery – Background processing for ingestion & batch jobs
- Prometheus/Grafana – Metrics and monitoring
| Metric | Value | Dataset | Notes |
|---|---|---|---|
| nDCG@10 | 0.82 | MS MARCO | Normalized Discounted Cumulative Gain |
| MRR@10 | 0.76 | Custom Eval Set | Mean Reciprocal Rank |
| Precision@5 | 0.84 | Internal Docs | Top-5 relevance accuracy |
| Recall@10 | 0.91 | Mixed Corpus | Coverage of relevant documents |
| Component | P50 | P95 | P99 |
|---|---|---|---|
| Embedding Generation | 12ms | 25ms | 45ms |
| Vector Search (ChromaDB) | 8ms | 15ms | 28ms |
| BM25 Ranking | 5ms | 10ms | 18ms |
| Cross-Encoder Rerank | 35ms | 60ms | 95ms |
| LLM Generation | 120ms | 180ms | 250ms |
| Total E2E | 140ms | 200ms | 320ms |
| Cache Type | Hit Rate | Avg Savings | TTL Strategy |
|---|---|---|---|
| Semantic Cache | 42% | 150ms/query | Similarity-based (0.95 threshold) |
| Exact Match Cache | 18% | 180ms/query | LRU with 1hr TTL |
| Document Cache | 65% | 50ms/retrieval | 24hr TTL |
| Metric | Value | Configuration |
|---|---|---|
| Document Ingestion | 1,200 docs/hr | 4 Celery workers |
| Concurrent Queries | 150 QPS | 8-core, 16GB RAM |
| Index Size | 10M documents | 32GB ChromaDB instance |
| Batch Processing | 5,000 docs/batch | Async with progress tracking |
See /docs/benchmarks/ and /eval/reports/ for detailed methodology and reproducible test suites.
| Strategy | Chunk Size | Overlap | Use Case | Performance |
|---|---|---|---|---|
| Semantic Chunking | Variable | N/A | Technical docs | Best coherence |
| Sliding Window | 512 tokens | 128 tokens | Long documents | Balanced |
| Recursive Split | 1000 chars | 200 chars | Mixed content | Fast ingestion |
| Sentence-Based | 3-5 sentences | 1 sentence | Q&A datasets | High precision |
Configuration: app/chunking/strategies.py
| Model | Dimensions | Speed | Quality | Cost | Use Case |
|---|---|---|---|---|---|
| OpenAI ada-002 | 1536 | Fast | Excellent | $0.0001/1K tokens | Production default |
| all-MiniLM-L6-v2 | 384 | Very Fast | Good | Free (local) | High-volume ingestion |
| all-mpnet-base-v2 | 768 | Moderate | Very Good | Free (local) | Quality-focused |
| instructor-xl | 768 | Slow | Best | Free (local) | Domain-specific |
Switch models via: EMBEDDING_MODEL env var or app/embeddings/factory.py
git clone https://github.com/cbratkovics/document-intelligence-ai.git
cd document-intelligence-ai
# Install dependencies
pip install -r requirements-ml.txt
# Start services
docker-compose -f docker/docker-compose.yml up -d
# Run the application
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000# Apply configurations
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/configmap.yaml
kubectl apply -f k8s/secrets.yaml
# Deploy services
kubectl apply -f k8s/redis-deployment.yaml
kubectl apply -f k8s/chromadb-deployment.yaml
kubectl apply -f k8s/app-deployment.yaml
# Expose via ingress
kubectl apply -f k8s/ingress.yaml- API Docs:
http://localhost:8000/docs - Metrics:
http://localhost:9090(Prometheus) - Dashboard:
http://localhost:3000(Grafana) - Health Check:
http://localhost:8000/health
We welcome contributions for:
- New retrieval strategies
- LLM prompt optimizations
- Performance tuning
Please review:
MIT License — see the LICENSE file.