📚 Documentation Support Agent

A Retrieval-Augmented Generation (RAG) system for document-based Q&A.

.

🎯 Project Overview

This system enables users to upload documents (PDF, TXT, URLs, or raw text) and ask questions with guaranteed source-based answers. The agent refuses to hallucinate - if information isn't in the provided documents, it clearly states so.

Built for: Documentation Support Agent Technical Assessment

🖥️ User Interface Preview

✨ Key Features

✅ Multi-source ingestion: PDF, TXT files, web URLs, and raw text
✅ Semantic chunking: Uses LangChain's SemanticChunker for intelligent text splitting
✅ Pure semantic search: sentence-transformers embeddings + FAISS vector store
✅ Cosine similarity: Normalized vectors for meaning-based retrieval
✅ Zero hallucination: Multi-layer guardrails prevent made-up answers
✅ Source highlighting: Shows exact passages with similarity scores
✅ Web interface: Clean Streamlit UI with document management

🏗️ Architecture

User Query
    ↓
[Embedding Model] → all-MiniLM-L6-v2 (384-dim vectors)
    ↓
[FAISS Search] → Cosine similarity (IndexFlatIP)
    ↓
[Top-5 Chunks] → Most relevant passages retrieved
    ↓
[Gemini LLM] → Answer generation (temp=0.1, strict prompt)
    ↓
[Response] → Answer + source citations + similarity scores

Core Components

DocumentProcessor

Extracts text from PDFs, TXT files, URLs
Uses LangChain SemanticChunker for context-aware splitting
Preserves semantic coherence across chunks

VectorStore

sentence-transformers for embeddings
FAISS IndexFlatIP for fast cosine similarity search
Normalized vectors for semantic (not magnitude) comparison

AnswerGenerator

Gemini 2.5 Flash with strict source-only prompting
Temperature: 0.1 (low creativity = high factuality)
Mandatory source citations in responses

ChatBot

Orchestrates the full RAG pipeline
Manages document lifecycle (ingest/clear)
Coordinates retrieval and generation

🚀 Quick Start

Prerequisites

Python 3.8 or higher
Gemini API key (Get free key)

Installation

Install dependencies

pip install -r requirements.txt

Run the application

streamlit run doc_support_agent.py

Access the interface

Opens automatically at http://localhost:8501
Enter your Gemini API key when prompted

📖 Usage

Step 1: Initialize

Enter your Gemini API key in the text field. Wait for "✅ Chatbot initialized successfully!"

Step 2: Upload Documents

Choose from three options:

Upload PDF/TXT: Select local files
Enter URL: Paste webpage URLs for scraping
Paste Text: Directly input text content

Multiple documents can be added sequentially.

Step 3: Ask Questions

Type your question in the text field. The system will:

Search for relevant chunks (semantic search)
Generate answer using only those sources
Display answer with source citations
Show source excerpts with similarity scores

Step 4: Clear Documents (Optional)

Click "🗑️ Clear All Documents" to remove all ingested data and start fresh.

🛡️ Hallucination Prevention Strategy

Three-Layer Defense

Layer 1: Strict Prompting

"Answer ONLY using information from the sources below"
"DO NOT use any external knowledge"
"If sources don't contain enough info, say so clearly"

Layer 2: Low Temperature (0.1)

Minimizes LLM creativity and randomness
Ensures deterministic, grounded responses
Reduces likelihood of invented information

Layer 3: Mandatory Citations

LLM must reference [Source 1], [Source 2], etc.
Makes grounding transparent and verifiable
Easy to trace answers back to documents

Layer 4: Semantic Filtering

Only retrieves chunks above relevance threshold
Top-k retrieval (default: 5 chunks)
Prevents irrelevant context from confusing LLM

🔬 Technical Deep Dive

Why Semantic Chunking?

Traditional fixed-size chunking (e.g., 1000 characters) often breaks mid-sentence or mid-thought. LangChain's SemanticChunker splits text based on semantic coherence:

# Traditional chunking problems:
"...Python supports OOP. |CHUNK BREAK| Python has simple syntax..."
# Context lost! Each chunk lacks full meaning.

# Semantic chunking preserves context:
"...Python supports OOP. Python has simple syntax..." 
# Complete thoughts stay together.

Why Normalize Vectors?

# Without normalization (Euclidean distance)
v1 = [0.5, 0.5]   # Short vector
v2 = [5.0, 5.0]   # Long vector, SAME direction
distance = 6.36   # Seems very different!

# With normalization (Cosine similarity)
faiss.normalize_L2(embeddings)
v1_norm = [0.707, 0.707]
v2_norm = [0.707, 0.707]
similarity = 1.0  # Correctly identifies as similar!

Key insight: For text, we care about semantic direction (meaning), not vector magnitude (arbitrary scale). Normalization + IndexFlatIP gives us pure cosine similarity.

Retrieval Pipeline

Query encoding: Convert question to 384-dim embedding
Normalization: L2-normalize query vector
FAISS search: IndexFlatIP computes dot products (= cosine similarity for normalized vectors)
Top-k selection: Return 5 most similar chunks
Context building: Combine chunks for LLM

🔧 Configuration Options

Chunking (in DocumentProcessor)

# Automatic semantic-based chunking
# No manual chunk_size or overlap needed
# LangChain determines optimal boundaries

Retrieval (in VectorStore.search)

k = 5  # Number of chunks to retrieve
# Adjustable: chatbot.query(question, k=10)

LLM Generation (in AnswerGenerator)

generation_config = {
    "temperature": 0.1,        # Low = factual, high = creative
    "top_p": 0.9,             # Nucleus sampling
    "max_output_tokens": 1500  # Response length limit
}

🎓 Technical Decisions & Trade-offs

Why sentence-transformers/all-MiniLM-L6-v2?

Speed: Fast inference, only 384 dimensions
Quality: Good semantic understanding for general text
Size: 80MB model (reasonable download)

Alternative: all-mpnet-base-v2 (768-dim, better quality, slower)

Why FAISS?

Performance: Millisecond search even with 100k+ vectors
Memory efficient: Optimized C++ implementation
Scalable: Supports billions of vectors
Industry standard: Developed by Meta AI

Alternative: Pinecone, Weaviate (managed services, more features)

Why Gemini 2.5 Flash?

Speed: Fast response times
Quality: Good instruction following
Cost: Free tier available
Reliability: Handles strict prompting well

Alternative: GPT-4, Claude (better quality, higher cost) or Transformer - based Open Source Models such as LiquidAI/LFM2-1.2B-RAG

Why LangChain SemanticChunker?

Context preservation: Doesn't split mid-thought
Semantic coherence: Uses embeddings to find boundaries
Better retrieval: More meaningful chunks = better matches

Alternative: Fixed-size chunking (simpler, less accurate)

📦 Project Structure

.
├── doc_support_agent.py                 # Main Streamlit application
├── requirements.txt       # Python dependencies
├── README.md             # This file
└── .gitignore            # Git exclusions (API keys, etc.)

Class Hierarchy

ChatBot
  ├── DocumentProcessor (ingestion + chunking)
  ├── VectorStore (embeddings + search)
  ├── AnswerGenerator (LLM interface)
  └── Similar_source (formatting utilities)

🚨 Known Limitations

In-memory storage: Documents cleared on app restart
- Production fix: Use persistent vector DB (Pinecone, Weaviate)
No conversation history: Each query is independent
- Production fix: Implement chat memory with context window
English-optimized: Model trained primarily on English
- Production fix: Use multilingual models (paraphrase-multilingual)
PDF quality dependent: Scanned PDFs won't extract text
- Production fix: Add OCR (pytesseract, AWS Textract)
Single-session: No user accounts or saved documents
- Production fix: Add authentication and database storage

🔮 Future Enhancements

Re-ranking stage: Add cross-encoder for better precision
Conversation memory: Track dialogue context
Document versioning: Update docs without full re-index
Batch upload: Process multiple files simultaneously
Query caching: Store common question-answer pairs
Advanced filters: Filter by document source, date, etc.
Export functionality: Save Q&A pairs as markdown/PDF
Analytics: Track popular queries, retrieval quality

🧪 Testing Recommendations

Test Cases

Clear Answer Test
- Upload Python tutorial
- Ask: "What is a list comprehension?"
- ✅ Should get detailed answer with sources
Hallucination Prevention Test
- Same document
- Ask: "How do I use React hooks?"
- ✅ Should refuse (not in Python docs)
Multi-source Test
- Upload multiple documents
- Ask question spanning both
- ✅ Should synthesize from multiple sources
Edge Cases
- Empty query → validation error
- No documents uploaded → warning message
- Malformed PDF → graceful error handling

✅ File/URL/Text Ingestion

✅ Embedding and Retrieval

HuggingFace model (sentence-transformers)
Vector database (FAISS in-memory)
Semantic search (cosine similarity)
No keyword matching (pure embeddings)

✅ Chatbot Interface

✅ Hallucination Guardrails

✅ Code Quality

🏆 What Makes This Solution Stand Out

Modern RAG: Uses current best practices (semantic chunking, normalized vectors)
Zero to minimal hallucination: Multiple layers of prevention
Clean code: Well-structured, typed, documented

📞 Support

For setup issues:

Check requirements.txt - all dependencies installed?
Python 3.8+ installed? Check with python --version
Valid Gemini API key from https://ai.google.dev/
First run downloads model (~80MB) - wait for completion

👤 Author

Documentation Support Agent - Siddhi Pandya

Built with ❤️ for accurate, trustworthy document Q&A

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md
doc_support_agent.py		doc_support_agent.py

Sidp6853/Documentation-Support-Agent

Folders and files

Latest commit

History

Repository files navigation