Production-grade Retrieval-Augmented Generation for Document Intelligence
Transform 500+ page documents into instant, accurate answers with confidence scoring and source citations.
| Problem | Solution |
|---|---|
| Analysts spend 40+ hours reviewing documents | Query any document in seconds |
| Information buried in 100s of pages | AI extracts exactly what you need |
| No way to compare across documents | Cross-document analysis built-in |
| LLMs hallucinate | Confidence scoring + source citations |
π Ingested: 3 companies (Meta, Tesla, NVIDIA) - 500+ pages
β±οΈ Ingestion time: 2.3 seconds
π Query: "What are the main cybersecurity risks?"
β
Response: 2.4 seconds with HIGH confidence
π Sources: 4 cited passages with relevance scores
- Hybrid Search - Combines semantic (dense) + keyword (sparse) search
- Cross-Encoder Reranking - Re-ranks results for precision
- Parent-Child Retrieval - Expands context automatically
- Confidence Scoring - Know when to trust the answer (high/medium/low)
- Source Validation - Minimum source requirements
- Hallucination Prevention - Won't answer without evidence
- Embedding Cache - 436x speedup on repeated content
- Query Cache - 15,000x speedup on repeated queries
- Structure-Aware Chunking - 96% noise reduction
- Cross-Company Comparison - Compare entities side-by-side
- Document Registry - Track all ingested documents
- Metadata Filtering - Filter by company, date, type
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Client Layer β
β (Streamlit UI / FastAPI / CLI) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Pipeline Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
β β Loaders ββ β Chunkers ββ β Enrichment β β
β β PDF/MD/SEC β β Structure β β Entities/Topics β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Retrieval Layer β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β
β β Dense β β Sparse β β Hybrid β β Reranker β β
β β Embeddingsβ β BM25 β β Fusion β βCrossEncoderβ β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Storage Layer β
β Qdrant (Hybrid Vector Store) + Caching β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Generation Layer β
β LLM (Ollama/OpenAI) + Guardrails + Citations β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π See TECHNICAL_ARCHITECTURE.md for deep dive on architectural decisions including:
- Why RRF over Weighted Sum for hybrid search
- Deterministic confidence scoring (not LLM-based)
- NLI-based faithfulness evaluation
We use DeBERTa NLI model to verify answers are grounded in retrieved context:
| Query Type | Faithfulness | Confidence |
|---|---|---|
| Tesla manufacturing risks | 100% | HIGH |
| Meta advertising revenue | 100% | HIGH |
| NVIDIA data center | 90% | MEDIUM |
| Average | 97.5% | - |
| Metric | Score |
|---|---|
| Context Relevance | 75%+ |
| Precision@5 | 0.7+ |
| MRR | 0.8+ |
| Component | Technology |
|---|---|
| Embeddings | Ollama (nomic-embed-text), OpenAI-compatible |
| Vector Store | Qdrant (hybrid dense + sparse) |
| Sparse Encoder | FastEmbed BM25 |
| LLM | Ollama (Llama 3.2), OpenAI-compatible |
| Reranking | Cross-Encoder (ms-marco-MiniLM) |
| API | FastAPI |
| UI | Streamlit |
| Infrastructure | Docker, Docker Compose |
| Testing | pytest (275+ tests) |
- Docker & Docker Compose
- 16GB+ RAM recommended
git clone https://github.com/[your-username]/rag-system.git
cd rag-system
# Start Qdrant and Ollama
docker-compose up -d
# Pull required models
docker exec rag-ollama ollama pull nomic-embed-text
docker exec rag-ollama ollama pull llama3.2python -m venv rag-env
source rag-env/bin/activate
pip install -r requirements.txtstreamlit run src/ui/app.pyuvicorn src.api.main:app --reload
# Query endpoint
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"question": "What are the risk factors?"}'from src.documents import MultiDocumentPipeline
from src.embeddings import OllamaEmbeddings, CachedEmbeddings
from src.vectorstores.qdrant_hybrid_store import QdrantHybridStore
from src.retrieval import HybridRetriever
from src.generation.ollama_llm import OllamaLLM
# Initialize
embeddings = CachedEmbeddings(OllamaEmbeddings(model="nomic-embed-text"))
vectorstore = QdrantHybridStore(collection_name="my_docs", dense_dimensions=768)
retriever = HybridRetriever(embeddings=embeddings, vectorstore=vectorstore)
llm = OllamaLLM(model="llama3.2")
pipeline = MultiDocumentPipeline(
embeddings=embeddings,
vectorstore=vectorstore,
retriever=retriever,
llm=llm,
)
# Ingest documents
pipeline.ingest_directory("./documents/")
# Query
response = pipeline.query("What are the key findings?")
print(f"Answer: {response.answer}")
print(f"Confidence: {response.confidence}")
print(f"Sources: {len(response.sources)}")# Query specific company only
response = pipeline.query(
"What is the revenue growth?",
filter_companies=["Tesla"],
)# Compare across multiple companies
response = pipeline.compare_companies(
"Compare AI strategies",
companies=["Meta", "Tesla", "NVIDIA"],
)rag-system/
βββ src/
β βββ api/ # FastAPI endpoints
β βββ cache/ # Embedding & query caching
β βββ chunkers/ # Document chunking strategies
β βββ documents/ # Multi-document pipeline
β βββ embeddings/ # Embedding providers
β βββ enrichment/ # Metadata extraction
β βββ evaluation/ # Retrieval metrics
β βββ generation/ # LLM providers
β βββ guardrails/ # Quality controls
β βββ loaders/ # Document loaders
β βββ pipeline/ # RAG orchestration
β βββ reranking/ # Cross-encoder reranking
β βββ retrieval/ # Search strategies
β βββ summarization/ # Hierarchical summaries
β βββ ui/ # Streamlit interface
β βββ vectorstores/ # Vector databases
βββ tests/ # 275+ unit tests
βββ config/ # YAML configuration
βββ docker-compose.yml # Infrastructure
βββ requirements.txt
All settings in config/rag.yaml:
# Chunking
chunking:
strategy: structure_aware
chunk_size: 1500
# Retrieval
retrieval:
search_type: hybrid
retrieval_top_k: 20
reranking:
enabled: true
top_n: 5
# Guardrails
guardrails:
score_threshold: 0.35
min_sources: 2
# Caching
caching:
embeddings:
enabled: true
queries:
enabled: true
ttl_seconds: 300# Run all tests
pytest tests/ --ignore=tests/integration
# Run with coverage
pytest tests/ --cov=src --cov-report=html| Operation | Time | Improvement |
|---|---|---|
| Ingest 500 pages | 2.3s | - |
| Query (cold) | 1.8s | - |
| Query (cached) | 0.0001s | 15,000x |
| Embedding (cold) | 1.4s | - |
| Embedding (cached) | 0.003s | 436x |
| Metric | Before | After Optimizations |
|---|---|---|
| Faithfulness | ~30% | 97.5% |
| Hallucination Rate | ~40% | <3% |
I build production RAG systems for companies. Services include:
- Custom RAG Development - Tailored to your documents and domain
- AI Chatbot Integration - Over your internal knowledge base
- Performance Optimization - Make your existing RAG faster
- Architecture Consulting - Design review and best practices
If this helped you, consider starring the repo. It helps others find it!