GitHub - MeidiLprog/DataDocChatbot: A chatbot made using RAG to answer questions regarding dataengineering/sql/mlflow

DataDocChatbot

DataDocChatbot is a fully custom Retrieval-Augmented Generation (RAG) assistant built without LangChain, meaning all internal steps (OCR, chunking, embeddings, vector search, prompt-building, inference) were engineered manually.

It allows a user to ask natural language questions about private PDF documents (SQL manuals, Data Engineering books, Pandas references, MLflow documentation, etc.) including scanned PDFs and returns grounded answers with verifiable citations.

The chatbot reads, processes, and understands entire documents with or without images stored locally, then uses a combination of semantic search + LLM reasoning to answer the user.

What makes this projects unique ?

Unlike many RAG projects that rely on LangChain abstractions, every stage is built from scratch thus the whole pipeline, which demonstrates a deeper engineering understanding:

OCR reading for scanned PDFs
Manual chunking strategy
Mathematical vector embeddings
Direct Pinecone API usage (no wrapper)
Prompt engineering
LLM inference using Groq (Llama-3.1-8B)
Gradio interface with streaming responses

This gives full control over the pipeline, nothing is “magic”.

Architecture

PDFs (docs/)
     |
[PyMuPDF + Tesseract OCR]
     |
     v
Normalize + Chunk  --->  chunks.jsonl
     |                       |
     v                       v
 [Sentence Transformers]   [Question embedding]
      |                           |
      v                           v
   Pinecone Upsert <--- Pinecone Query (top_k)
      |
      v
   RAG Prompt → Groq (Llama-3.1-8B) → Final Answer + Citations

Mathematics Used

Sentence Embeddings

Each text chunk $t$ becomes a 384-dimensional semantic vector:

$$ e(t) \in \mathbb{R}^{384} $$

All embeddings are L2-normalized:

$$ \lVert e \rVert_2 = 1 \quad \Longrightarrow \quad \sum_{i=1}^{384} e_i^2 = 1 $$

This makes cosine similarity equal to the dot product. Based on Cauchy-Schwartz' inequality

Cosine Similarity

Given question embedding $q$ and chunk embedding $e$:

$$ \cos(\theta) = \frac{q \cdot e}{\lVert q \rVert , \lVert e \rVert} $$

With normalized vectors:

$$ \cos(\theta) = q \cdot e $$

Higher score ⇒ more relevant chunk. Pinecone ranks results with this metric.

Overlapping Chunking

Splitting text into fixed-size chunks can break sentences across boundaries.
To preserve context, overlapping windows are used (e.g., 900 words, overlap 120).
This increases recall and improves retrieval accuracy.

Stable Vector IDs

Each vector uses a SHA-1 hash of (doc || page || text):

$$ ID = \mathrm{SHA1}(\text{doc} ,||, \text{page} ,||, \text{text}) $$

This prevents duplicates and makes ingestion idempotent.

Project Files

File	Purpose
`chunky_cut.py`	PDF → OCR → chunking → `.jsonl`
`embedding.py`	JSONL → embeddings → Pinecone upsert
`retrivial_question.py`	Retrieval debug (top‑k check)
`rag_answers.py`	Full RAG: retrieval + Groq + citations
`app.py`	Gradio chat UI (streaming)

Installation

python -m venv chatbot
source chatbot/Scripts/activate       # Windows: .\chatbot\Scripts\activate

pip install -U pip
pip install -U gradio==4.44.1 gradio_client
pip install sentence-transformers pinecone-client groq python-dotenv pymupdf pytesseract pillow

Create .env:

PINECONE_API_KEY=your_key
PINECONE_INDEX=rag-data
PINECONE_NAMESPACE=data_docs
GROQ_API_KEY=your_key

Windows OCR (if Tesseract not in PATH):

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

Running the Pipeline

PDF → chunks.jsonl

python chunky_cut.py

Embeddings → Pinecone

python embedding.py

Test retrieval

python retrivial_question.py

Launch chatbot

python app.py

Troubleshooting

Issue	Fix
`TesseractNotFoundError`	Install Tesseract or set explicit path
Empty retrieval results	Check namespace + ingestion done
Gradio schema error	`pip install -U gradio==4.44.1`
Localhost blocked	launch with `server_name="127.0.0.1"`

Security

Do not commit .env to Git
Rotate API keys if exposed
Separate Pinecone namespaces for dev/prod

License

MIT — see LICENSE

Acknowledgments

Sentence Transformers
Pinecone
Groq (Llama‑3.1)
Gradio
PyMuPDF & Tesseract OCR

Conclusion

A project carried out with dedication and commitment to unveil the secrets of RAGS and agents to acquire a solid foundation of the domain, and ready to bring it forth to a larger scale

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
Scripts		Scripts
demo		demo
docs		docs
share/man/man1		share/man/man1
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
chunks_SQL_manual.jsonl		chunks_SQL_manual.jsonl
pyvenv.cfg		pyvenv.cfg
requirements.txt		requirements.txt
versions.sh		versions.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataDocChatbot

The chatbot reads, processes, and understands entire documents with or without images stored locally, then uses a combination of semantic search + LLM reasoning to answer the user.

What makes this projects unique ?

This gives full control over the pipeline, nothing is “magic”.

Architecture

Mathematics Used

Sentence Embeddings

Cosine Similarity

Overlapping Chunking

Stable Vector IDs

Project Files

Installation

Running the Pipeline

PDF → chunks.jsonl

Embeddings → Pinecone

Test retrieval

Launch chatbot

Troubleshooting

Security

License

Acknowledgments

Conclusion

About

Uh oh!

Releases

Packages

Languages

License

MeidiLprog/DataDocChatbot

Folders and files

Latest commit

History

Repository files navigation

DataDocChatbot

The chatbot reads, processes, and understands entire documents with or without images stored locally, then uses a combination of semantic search + LLM reasoning to answer the user.

What makes this projects unique ?

This gives full control over the pipeline, nothing is “magic”.

Architecture

Mathematics Used

Sentence Embeddings

Cosine Similarity

Overlapping Chunking

Stable Vector IDs

Project Files

Installation

Running the Pipeline

PDF → chunks.jsonl

Embeddings → Pinecone

Test retrieval

Launch chatbot

Troubleshooting

Security

License

Acknowledgments

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages