Skip to content

simoncampos1022/Enterprise-RAG-System

Repository files navigation

πŸ” Enterprise RAG System

Production-grade Retrieval-Augmented Generation for Document Intelligence

Transform 500+ page documents into instant, accurate answers with confidence scoring and source citations.

Python Tests Faithfulness Docker License


🎯 What This Does

Problem Solution
Analysts spend 40+ hours reviewing documents Query any document in seconds
Information buried in 100s of pages AI extracts exactly what you need
No way to compare across documents Cross-document analysis built-in
LLMs hallucinate Confidence scoring + source citations

Demo: SEC Filing Analysis

πŸ“ Ingested: 3 companies (Meta, Tesla, NVIDIA) - 500+ pages
⏱️  Ingestion time: 2.3 seconds
πŸ” Query: "What are the main cybersecurity risks?"
βœ… Response: 2.4 seconds with HIGH confidence
πŸ“‘ Sources: 4 cited passages with relevance scores

⚑ Key Features

🧠 Intelligent Retrieval

  • Hybrid Search - Combines semantic (dense) + keyword (sparse) search
  • Cross-Encoder Reranking - Re-ranks results for precision
  • Parent-Child Retrieval - Expands context automatically

πŸ›‘οΈ Production Guardrails

  • Confidence Scoring - Know when to trust the answer (high/medium/low)
  • Source Validation - Minimum source requirements
  • Hallucination Prevention - Won't answer without evidence

πŸš€ Performance Optimized

  • Embedding Cache - 436x speedup on repeated content
  • Query Cache - 15,000x speedup on repeated queries
  • Structure-Aware Chunking - 96% noise reduction

πŸ“Š Multi-Document Analysis

  • Cross-Company Comparison - Compare entities side-by-side
  • Document Registry - Track all ingested documents
  • Metadata Filtering - Filter by company, date, type

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Client Layer                         β”‚
β”‚              (Streamlit UI / FastAPI / CLI)                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Pipeline Layer                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚   Loaders   β”‚β†’ β”‚  Chunkers   β”‚β†’ β”‚    Enrichment       β”‚ β”‚
β”‚  β”‚ PDF/MD/SEC  β”‚  β”‚  Structure  β”‚  β”‚ Entities/Topics     β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Retrieval Layer                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Dense   β”‚  β”‚  Sparse  β”‚  β”‚  Hybrid  β”‚  β”‚  Reranker  β”‚  β”‚
β”‚  β”‚ Embeddingsβ”‚  β”‚  BM25    β”‚  β”‚  Fusion  β”‚  β”‚CrossEncoderβ”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Storage Layer                           β”‚
β”‚         Qdrant (Hybrid Vector Store) + Caching              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Generation Layer                         β”‚
β”‚        LLM (Ollama/OpenAI) + Guardrails + Citations         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“š See TECHNICAL_ARCHITECTURE.md for deep dive on architectural decisions including:

  • Why RRF over Weighted Sum for hybrid search
  • Deterministic confidence scoring (not LLM-based)
  • NLI-based faithfulness evaluation

πŸ“Š Evaluation Metrics

Faithfulness (NLI-Based)

We use DeBERTa NLI model to verify answers are grounded in retrieved context:

Query Type Faithfulness Confidence
Tesla manufacturing risks 100% HIGH
Meta advertising revenue 100% HIGH
NVIDIA data center 90% MEDIUM
Average 97.5% -

Retrieval Quality

Metric Score
Context Relevance 75%+
Precision@5 0.7+
MRR 0.8+

πŸ› οΈ Tech Stack

Component Technology
Embeddings Ollama (nomic-embed-text), OpenAI-compatible
Vector Store Qdrant (hybrid dense + sparse)
Sparse Encoder FastEmbed BM25
LLM Ollama (Llama 3.2), OpenAI-compatible
Reranking Cross-Encoder (ms-marco-MiniLM)
API FastAPI
UI Streamlit
Infrastructure Docker, Docker Compose
Testing pytest (275+ tests)

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • 16GB+ RAM recommended

1. Clone & Start Services

git clone https://github.com/[your-username]/rag-system.git
cd rag-system

# Start Qdrant and Ollama
docker-compose up -d

# Pull required models
docker exec rag-ollama ollama pull nomic-embed-text
docker exec rag-ollama ollama pull llama3.2

2. Install Dependencies

python -m venv rag-env
source rag-env/bin/activate
pip install -r requirements.txt

3. Run the UI

streamlit run src/ui/app.py

4. Or Use the API

uvicorn src.api.main:app --reload

# Query endpoint
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What are the risk factors?"}'

πŸ“– Usage Examples

Basic Query

from src.documents import MultiDocumentPipeline
from src.embeddings import OllamaEmbeddings, CachedEmbeddings
from src.vectorstores.qdrant_hybrid_store import QdrantHybridStore
from src.retrieval import HybridRetriever
from src.generation.ollama_llm import OllamaLLM

# Initialize
embeddings = CachedEmbeddings(OllamaEmbeddings(model="nomic-embed-text"))
vectorstore = QdrantHybridStore(collection_name="my_docs", dense_dimensions=768)
retriever = HybridRetriever(embeddings=embeddings, vectorstore=vectorstore)
llm = OllamaLLM(model="llama3.2")

pipeline = MultiDocumentPipeline(
    embeddings=embeddings,
    vectorstore=vectorstore,
    retriever=retriever,
    llm=llm,
)

# Ingest documents
pipeline.ingest_directory("./documents/")

# Query
response = pipeline.query("What are the key findings?")
print(f"Answer: {response.answer}")
print(f"Confidence: {response.confidence}")
print(f"Sources: {len(response.sources)}")

Filtered Query

# Query specific company only
response = pipeline.query(
    "What is the revenue growth?",
    filter_companies=["Tesla"],
)

Cross-Document Comparison

# Compare across multiple companies
response = pipeline.compare_companies(
    "Compare AI strategies",
    companies=["Meta", "Tesla", "NVIDIA"],
)

πŸ“ Project Structure

rag-system/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/              # FastAPI endpoints
β”‚   β”œβ”€β”€ cache/            # Embedding & query caching
β”‚   β”œβ”€β”€ chunkers/         # Document chunking strategies
β”‚   β”œβ”€β”€ documents/        # Multi-document pipeline
β”‚   β”œβ”€β”€ embeddings/       # Embedding providers
β”‚   β”œβ”€β”€ enrichment/       # Metadata extraction
β”‚   β”œβ”€β”€ evaluation/       # Retrieval metrics
β”‚   β”œβ”€β”€ generation/       # LLM providers
β”‚   β”œβ”€β”€ guardrails/       # Quality controls
β”‚   β”œβ”€β”€ loaders/          # Document loaders
β”‚   β”œβ”€β”€ pipeline/         # RAG orchestration
β”‚   β”œβ”€β”€ reranking/        # Cross-encoder reranking
β”‚   β”œβ”€β”€ retrieval/        # Search strategies
β”‚   β”œβ”€β”€ summarization/    # Hierarchical summaries
β”‚   β”œβ”€β”€ ui/               # Streamlit interface
β”‚   └── vectorstores/     # Vector databases
β”œβ”€β”€ tests/                # 275+ unit tests
β”œβ”€β”€ config/               # YAML configuration
β”œβ”€β”€ docker-compose.yml    # Infrastructure
└── requirements.txt

βš™οΈ Configuration

All settings in config/rag.yaml:

# Chunking
chunking:
  strategy: structure_aware
  chunk_size: 1500

# Retrieval
retrieval:
  search_type: hybrid
  retrieval_top_k: 20
  reranking:
    enabled: true
    top_n: 5

# Guardrails
guardrails:
  score_threshold: 0.35
  min_sources: 2

# Caching
caching:
  embeddings:
    enabled: true
  queries:
    enabled: true
    ttl_seconds: 300

πŸ§ͺ Testing

# Run all tests
pytest tests/ --ignore=tests/integration

# Run with coverage
pytest tests/ --cov=src --cov-report=html

πŸ“ˆ Performance Benchmarks

Operation Time Improvement
Ingest 500 pages 2.3s -
Query (cold) 1.8s -
Query (cached) 0.0001s 15,000x
Embedding (cold) 1.4s -
Embedding (cached) 0.003s 436x

Quality Metrics

Metric Before After Optimizations
Faithfulness ~30% 97.5%
Hallucination Rate ~40% <3%

🀝 Need Custom Development?

I build production RAG systems for companies. Services include:

  • Custom RAG Development - Tailored to your documents and domain
  • AI Chatbot Integration - Over your internal knowledge base
  • Performance Optimization - Make your existing RAG faster
  • Architecture Consulting - Design review and best practices

⭐ Star This Repo

If this helped you, consider starring the repo. It helps others find it!

About

This is Enterprise RAG System, a production-grade RAG framework for turning massive documents (500+ pages) into instant, accurate answers with confidence scores and source citations. It tackles key pain points like manual document review, buried info, hallucinations, and cross-document analysis.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors