DataDocChatbot is a fully custom Retrieval-Augmented Generation (RAG) assistant built without LangChain, meaning all internal steps (OCR, chunking, embeddings, vector search, prompt-building, inference) were engineered manually.
It allows a user to ask natural language questions about private PDF documents (SQL manuals, Data Engineering books, Pandas references, MLflow documentation, etc.) including scanned PDFs and returns grounded answers with verifiable citations.
The chatbot reads, processes, and understands entire documents with or without images stored locally, then uses a combination of semantic search + LLM reasoning to answer the user.
Unlike many RAG projects that rely on LangChain abstractions, every stage is built from scratch thus the whole pipeline, which demonstrates a deeper engineering understanding:
- OCR reading for scanned PDFs
- Manual chunking strategy
- Mathematical vector embeddings
- Direct Pinecone API usage (no wrapper)
- Prompt engineering
- LLM inference using Groq (Llama-3.1-8B)
- Gradio interface with streaming responses
PDFs (docs/)
|
[PyMuPDF + Tesseract OCR]
|
v
Normalize + Chunk ---> chunks.jsonl
| |
v v
[Sentence Transformers] [Question embedding]
| |
v v
Pinecone Upsert <--- Pinecone Query (top_k)
|
v
RAG Prompt → Groq (Llama-3.1-8B) → Final Answer + Citations
Each text chunk
All embeddings are L2-normalized:
This makes cosine similarity equal to the dot product. Based on Cauchy-Schwartz' inequality
Given question embedding
With normalized vectors:
Higher score ⇒ more relevant chunk. Pinecone ranks results with this metric.
Splitting text into fixed-size chunks can break sentences across boundaries.
To preserve context, overlapping windows are used (e.g., 900 words, overlap 120).
This increases recall and improves retrieval accuracy.
Each vector uses a SHA-1 hash of (doc || page || text):
This prevents duplicates and makes ingestion idempotent.
| File | Purpose |
|---|---|
chunky_cut.py |
PDF → OCR → chunking → .jsonl |
embedding.py |
JSONL → embeddings → Pinecone upsert |
retrivial_question.py |
Retrieval debug (top‑k check) |
rag_answers.py |
Full RAG: retrieval + Groq + citations |
app.py |
Gradio chat UI (streaming) |
python -m venv chatbot
source chatbot/Scripts/activate # Windows: .\chatbot\Scripts\activate
pip install -U pip
pip install -U gradio==4.44.1 gradio_client
pip install sentence-transformers pinecone-client groq python-dotenv pymupdf pytesseract pillowCreate .env:
PINECONE_API_KEY=your_key
PINECONE_INDEX=rag-data
PINECONE_NAMESPACE=data_docs
GROQ_API_KEY=your_key
Windows OCR (if Tesseract not in PATH):
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"python chunky_cut.pypython embedding.pypython retrivial_question.pypython app.py| Issue | Fix |
|---|---|
TesseractNotFoundError |
Install Tesseract or set explicit path |
| Empty retrieval results | Check namespace + ingestion done |
| Gradio schema error | pip install -U gradio==4.44.1 |
| Localhost blocked | launch with server_name="127.0.0.1" |
- Do not commit
.envto Git - Rotate API keys if exposed
- Separate Pinecone namespaces for dev/prod
MIT — see LICENSE
- Sentence Transformers
- Pinecone
- Groq (Llama‑3.1)
- Gradio
- PyMuPDF & Tesseract OCR
A project carried out with dedication and commitment to unveil the secrets of RAGS and agents to acquire a solid foundation of the domain, and ready to bring it forth to a larger scale

