Local Retrieval-Augmented Generation (RAG) system for answering questions over PDF documents using semantic retrieval, reranking, and a locally hosted LLM.
The D-RAGon System is a fully local RAG pipeline that enables natural-language querying over private PDF documents. It retrieves relevant document passages using dense embeddings and reranking, then generates grounded answers using a local LLM via Ollama.
This approach improves factual accuracy and reduces hallucinations compared to standalone LLM generation.
The system runs entirely locally, requiring no external API calls.
Full technical documentation, architecture details, and evaluation methodology are available in Notion:
- Fully local inference (no API dependency)
- Semantic retrieval using BGE embeddings
- Vector storage using ChromaDB
- Cross-encoder reranking for improved retrieval precision
- Local Llama-3.1 inference via Ollama
- Conversational chat support
- Source citation display
- Gradio-based web interface
- Evaluation framework with accuracy and hallucination metrics
Pipeline:
PDF → Chunking → Embedding → Vector DB → Retrieval → Re-Ranking → Prompt → LLM → Answer → Gradio UI
Retrieval configuration:
- Initial retrieval: Top-10 (cosine similarity)
- Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2
- Context sent to LLM: Top-4 chunks
Retrieval performance:
| Recall@K | Score |
|---|---|
| Recall@4 | 0.83 |
| Recall@6 | 0.87 |
| Recall@8 | 0.90 |
| Recall@10 | 0.97 |
End-to-end performance:
| Mode | Accuracy | Hallucination Rate | Avg Latency |
|---|---|---|---|
| Stateless | 1.00 | 0.13 | 3.78s |
| Conversational | 0.97 | 0.27 | 3.72s |
- Python
- LangChain
- ChromaDB
- BAAI/bge-large-en-v1.5 embeddings
- cross-encoder/ms-marco-MiniLM-L-6-v2 reranker
- Llama-3.1-8B-Instruct via Ollama
- Gradio
- RTX 4080 local inference
D-RAGon_System/
│
├── Code/
│ ├── Rag_pdf_QA.ipynb # Development and experimentation notebook
│ ├── Final_pipeline.ipynb # Final integrated pipeline notebook
│ ├── Simple_pipeline.py # Basic stateless RAG pipeline
│ ├── Pipeline_With_hist.py # Conversational RAG pipeline
│ ├── Updated_pipeline.py # Final production RAG pipeline
│ └── app.py # Gradio UI interface
│
├── Data/
│ ├── Faster-RCNN.pdf
│ ├── Cant-Hurt-Me.pdf
│ └── Deep-Work.pdf
│
├── Eval/
│ └── evaluation dataset and scripts
│
├── requirements.txt # Python dependencies
├── README.md # Project documentation
└── .gitignore
Due to copyright restrictions, books are not included.
Please download the required PDFs manually and place them in: data/
git clone https://github.com/Daddy-Myth/D-RAGon_System.git
cd D-RAGon_Systemconda create -n dragon python=3.10
conda activate dragonpip install -r requirements.txtRun in a separate terminal:
ollama serveIf the model is not installed yet, run once:
ollama run llama3.1python Code/app.pyOpen your browser and go to:
http://localhost:7860You can now:
- Upload and index PDF documents
- Ask questions using natural language
- View grounded answers with source citations
- Use conversational chat mode
Ingest PDFs into the vector database:
python Code/Updated_pipeline.py ingestAsk a single question:
python Code/Updated_pipeline.py query --q "What was David Goggins max weight?"Start conversational chat mode:
python Code/Updated_pipeline.py chatReset chat history:
python Code/Updated_pipeline.py reset-chatShow database statistics:
python Code/Updated_pipeline.py infoSample indexed documents include:
- Faster R-CNN research paper
- Can't Hurt Me — David Goggins
- Deep Work — Cal Newport
- Larger Document Corpus
- Domain Specific Knowledge Integration
- Hybrid retrieval (BM25 + dense embeddings)
- Faster inference via caching
- FastAPI deployment
- Docker containerization
- Larger embedding and LLM models
- Cloud deployment
Archit Yadav
Samsung Innovation Campus Capstone Project
