D-RAGon System

Local Retrieval-Augmented Generation (RAG) system for answering questions over PDF documents using semantic retrieval, reranking, and a locally hosted LLM.

Overview

The D-RAGon System is a fully local RAG pipeline that enables natural-language querying over private PDF documents. It retrieves relevant document passages using dense embeddings and reranking, then generates grounded answers using a local LLM via Ollama.

This approach improves factual accuracy and reduces hallucinations compared to standalone LLM generation.

The system runs entirely locally, requiring no external API calls.

Documentation

Full technical documentation, architecture details, and evaluation methodology are available in Notion:

D-RAGon System Notion

Key Features

Fully local inference (no API dependency)
Semantic retrieval using BGE embeddings
Vector storage using ChromaDB
Cross-encoder reranking for improved retrieval precision
Local Llama-3.1 inference via Ollama
Conversational chat support
Source citation display
Gradio-based web interface
Evaluation framework with accuracy and hallucination metrics

System Architecture

Pipeline:

PDF → Chunking → Embedding → Vector DB → Retrieval → Re-Ranking → Prompt → LLM → Answer → Gradio UI

Retrieval configuration:

Initial retrieval: Top-10 (cosine similarity)
Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2
Context sent to LLM: Top-4 chunks

Evaluation Results

Retrieval performance:

Recall@K	Score
Recall@4	0.83
Recall@6	0.87
Recall@8	0.90
Recall@10	0.97

End-to-end performance:

Mode	Accuracy	Hallucination Rate	Avg Latency
Stateless	1.00	0.13	3.78s
Conversational	0.97	0.27	3.72s

Tech Stack

Python
LangChain
ChromaDB
BAAI/bge-large-en-v1.5 embeddings
cross-encoder/ms-marco-MiniLM-L-6-v2 reranker
Llama-3.1-8B-Instruct via Ollama
Gradio
RTX 4080 local inference

Project Structure

D-RAGon_System/
│
├── Code/
│ ├── Rag_pdf_QA.ipynb      # Development and experimentation notebook
│ ├── Final_pipeline.ipynb  # Final integrated pipeline notebook
│ ├── Simple_pipeline.py    # Basic stateless RAG pipeline
│ ├── Pipeline_With_hist.py # Conversational RAG pipeline
│ ├── Updated_pipeline.py   # Final production RAG pipeline 
│ └── app.py                # Gradio UI interface 
│
├── Data/
│ ├── Faster-RCNN.pdf
│ ├── Cant-Hurt-Me.pdf
│ └── Deep-Work.pdf
│
├── Eval/
│ └── evaluation dataset and scripts
│
├── requirements.txt        # Python dependencies
├── README.md               # Project documentation
└── .gitignore

How to Run the System

Data Instructions

Due to copyright restrictions, books are not included.
Please download the required PDFs manually and place them in: data/

1. Clone the repository

git clone https://github.com/Daddy-Myth/D-RAGon_System.git
cd D-RAGon_System

2. Create and activate the conda environment

conda create -n dragon python=3.10
conda activate dragon

3. Install dependencies

pip install -r requirements.txt

4. Start Ollama (LLM inference server)

Run in a separate terminal:

ollama serve

If the model is not installed yet, run once:

ollama run llama3.1

5. Launch the Gradio UI

python Code/app.py

6. Open the interface

Open your browser and go to:

http://localhost:7860

You can now:

Upload and index PDF documents
Ask questions using natural language
View grounded answers with source citations
Use conversational chat mode

Optional: CLI Usage

Ingest PDFs into the vector database:

python Code/Updated_pipeline.py ingest

Ask a single question:

python Code/Updated_pipeline.py query --q "What was David Goggins max weight?"

Start conversational chat mode:

python Code/Updated_pipeline.py chat

Reset chat history:

python Code/Updated_pipeline.py reset-chat

Show database statistics:

python Code/Updated_pipeline.py info

Dataset

Sample indexed documents include:

Faster R-CNN research paper
Can't Hurt Me — David Goggins
Deep Work — Cal Newport

Future Work

Larger Document Corpus
Domain Specific Knowledge Integration
Hybrid retrieval (BM25 + dense embeddings)
Faster inference via caching
FastAPI deployment
Docker containerization
Larger embedding and LLM models
Cloud deployment

Author

Archit Yadav

Samsung Innovation Campus Capstone Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

D-RAGon System

Overview

Documentation

Key Features

System Architecture

Evaluation Results

Tech Stack

Project Structure

How to Run the System

Data Instructions

1. Clone the repository

2. Create and activate the conda environment

3. Install dependencies

4. Start Ollama (LLM inference server)

5. Launch the Gradio UI

6. Open the interface

Optional: CLI Usage

Dataset

Future Work

Author

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Code		Code
Eval		Eval
Mats		Mats
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

D-RAGon System

Overview

Documentation

Key Features

System Architecture

Evaluation Results

Tech Stack

Project Structure

How to Run the System

Data Instructions

1. Clone the repository

2. Create and activate the conda environment

3. Install dependencies

4. Start Ollama (LLM inference server)

5. Launch the Gradio UI

6. Open the interface

Optional: CLI Usage

Dataset

Future Work

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages