Skip to content

Sidp6853/Documentation-Support-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ“š Documentation Support Agent

A Retrieval-Augmented Generation (RAG) system for document-based Q&A.

Python Streamlit .


๐ŸŽฏ Project Overview

This system enables users to upload documents (PDF, TXT, URLs, or raw text) and ask questions with guaranteed source-based answers. The agent refuses to hallucinate - if information isn't in the provided documents, it clearly states so.

Built for: Documentation Support Agent Technical Assessment


๐Ÿ–ฅ๏ธ User Interface Preview

Documentation Support Agent UI


โœจ Key Features

  • โœ… Multi-source ingestion: PDF, TXT files, web URLs, and raw text
  • โœ… Semantic chunking: Uses LangChain's SemanticChunker for intelligent text splitting
  • โœ… Pure semantic search: sentence-transformers embeddings + FAISS vector store
  • โœ… Cosine similarity: Normalized vectors for meaning-based retrieval
  • โœ… Zero hallucination: Multi-layer guardrails prevent made-up answers
  • โœ… Source highlighting: Shows exact passages with similarity scores
  • โœ… Web interface: Clean Streamlit UI with document management

๐Ÿ—๏ธ Architecture

User Query
    โ†“
[Embedding Model] โ†’ all-MiniLM-L6-v2 (384-dim vectors)
    โ†“
[FAISS Search] โ†’ Cosine similarity (IndexFlatIP)
    โ†“
[Top-5 Chunks] โ†’ Most relevant passages retrieved
    โ†“
[Gemini LLM] โ†’ Answer generation (temp=0.1, strict prompt)
    โ†“
[Response] โ†’ Answer + source citations + similarity scores

Core Components

DocumentProcessor

  • Extracts text from PDFs, TXT files, URLs
  • Uses LangChain SemanticChunker for context-aware splitting
  • Preserves semantic coherence across chunks

VectorStore

  • sentence-transformers for embeddings
  • FAISS IndexFlatIP for fast cosine similarity search
  • Normalized vectors for semantic (not magnitude) comparison

AnswerGenerator

  • Gemini 2.5 Flash with strict source-only prompting
  • Temperature: 0.1 (low creativity = high factuality)
  • Mandatory source citations in responses

ChatBot

  • Orchestrates the full RAG pipeline
  • Manages document lifecycle (ingest/clear)
  • Coordinates retrieval and generation

๐Ÿš€ Quick Start

Prerequisites

Installation

  1. Install dependencies
pip install -r requirements.txt
  1. Run the application
streamlit run doc_support_agent.py
  1. Access the interface
  • Opens automatically at http://localhost:8501
  • Enter your Gemini API key when prompted

๐Ÿ“– Usage

Step 1: Initialize

Enter your Gemini API key in the text field. Wait for "โœ… Chatbot initialized successfully!"

Step 2: Upload Documents

Choose from three options:

  • Upload PDF/TXT: Select local files
  • Enter URL: Paste webpage URLs for scraping
  • Paste Text: Directly input text content

Multiple documents can be added sequentially.

Step 3: Ask Questions

Type your question in the text field. The system will:

  1. Search for relevant chunks (semantic search)
  2. Generate answer using only those sources
  3. Display answer with source citations
  4. Show source excerpts with similarity scores

Step 4: Clear Documents (Optional)

Click "๐Ÿ—‘๏ธ Clear All Documents" to remove all ingested data and start fresh.


๐Ÿ›ก๏ธ Hallucination Prevention Strategy

Three-Layer Defense

Layer 1: Strict Prompting

"Answer ONLY using information from the sources below"
"DO NOT use any external knowledge"
"If sources don't contain enough info, say so clearly"

Layer 2: Low Temperature (0.1)

  • Minimizes LLM creativity and randomness
  • Ensures deterministic, grounded responses
  • Reduces likelihood of invented information

Layer 3: Mandatory Citations

  • LLM must reference [Source 1], [Source 2], etc.
  • Makes grounding transparent and verifiable
  • Easy to trace answers back to documents

Layer 4: Semantic Filtering

  • Only retrieves chunks above relevance threshold
  • Top-k retrieval (default: 5 chunks)
  • Prevents irrelevant context from confusing LLM

๐Ÿ”ฌ Technical Deep Dive

Why Semantic Chunking?

Traditional fixed-size chunking (e.g., 1000 characters) often breaks mid-sentence or mid-thought. LangChain's SemanticChunker splits text based on semantic coherence:

# Traditional chunking problems:
"...Python supports OOP. |CHUNK BREAK| Python has simple syntax..."
# Context lost! Each chunk lacks full meaning.

# Semantic chunking preserves context:
"...Python supports OOP. Python has simple syntax..." 
# Complete thoughts stay together.

Why Normalize Vectors?

# Without normalization (Euclidean distance)
v1 = [0.5, 0.5]   # Short vector
v2 = [5.0, 5.0]   # Long vector, SAME direction
distance = 6.36   # Seems very different!

# With normalization (Cosine similarity)
faiss.normalize_L2(embeddings)
v1_norm = [0.707, 0.707]
v2_norm = [0.707, 0.707]
similarity = 1.0  # Correctly identifies as similar!

Key insight: For text, we care about semantic direction (meaning), not vector magnitude (arbitrary scale). Normalization + IndexFlatIP gives us pure cosine similarity.

Retrieval Pipeline

  1. Query encoding: Convert question to 384-dim embedding
  2. Normalization: L2-normalize query vector
  3. FAISS search: IndexFlatIP computes dot products (= cosine similarity for normalized vectors)
  4. Top-k selection: Return 5 most similar chunks
  5. Context building: Combine chunks for LLM

๐Ÿ”ง Configuration Options

Chunking (in DocumentProcessor)

# Automatic semantic-based chunking
# No manual chunk_size or overlap needed
# LangChain determines optimal boundaries

Retrieval (in VectorStore.search)

k = 5  # Number of chunks to retrieve
# Adjustable: chatbot.query(question, k=10)

LLM Generation (in AnswerGenerator)

generation_config = {
    "temperature": 0.1,        # Low = factual, high = creative
    "top_p": 0.9,             # Nucleus sampling
    "max_output_tokens": 1500  # Response length limit
}

๐ŸŽ“ Technical Decisions & Trade-offs

Why sentence-transformers/all-MiniLM-L6-v2?

  • Speed: Fast inference, only 384 dimensions
  • Quality: Good semantic understanding for general text
  • Size: 80MB model (reasonable download)

Alternative: all-mpnet-base-v2 (768-dim, better quality, slower)

Why FAISS?

  • Performance: Millisecond search even with 100k+ vectors
  • Memory efficient: Optimized C++ implementation
  • Scalable: Supports billions of vectors
  • Industry standard: Developed by Meta AI

Alternative: Pinecone, Weaviate (managed services, more features)

Why Gemini 2.5 Flash?

  • Speed: Fast response times
  • Quality: Good instruction following
  • Cost: Free tier available
  • Reliability: Handles strict prompting well

Alternative: GPT-4, Claude (better quality, higher cost) or Transformer - based Open Source Models such as LiquidAI/LFM2-1.2B-RAG

Why LangChain SemanticChunker?

  • Context preservation: Doesn't split mid-thought
  • Semantic coherence: Uses embeddings to find boundaries
  • Better retrieval: More meaningful chunks = better matches

Alternative: Fixed-size chunking (simpler, less accurate)


๐Ÿ“ฆ Project Structure

.
โ”œโ”€โ”€ doc_support_agent.py                 # Main Streamlit application
โ”œโ”€โ”€ requirements.txt       # Python dependencies
โ”œโ”€โ”€ README.md             # This file
โ””โ”€โ”€ .gitignore            # Git exclusions (API keys, etc.)

Class Hierarchy

ChatBot
  โ”œโ”€โ”€ DocumentProcessor (ingestion + chunking)
  โ”œโ”€โ”€ VectorStore (embeddings + search)
  โ”œโ”€โ”€ AnswerGenerator (LLM interface)
  โ””โ”€โ”€ Similar_source (formatting utilities)

๐Ÿšจ Known Limitations

  1. In-memory storage: Documents cleared on app restart

    • Production fix: Use persistent vector DB (Pinecone, Weaviate)
  2. No conversation history: Each query is independent

    • Production fix: Implement chat memory with context window
  3. English-optimized: Model trained primarily on English

    • Production fix: Use multilingual models (paraphrase-multilingual)
  4. PDF quality dependent: Scanned PDFs won't extract text

    • Production fix: Add OCR (pytesseract, AWS Textract)
  5. Single-session: No user accounts or saved documents

    • Production fix: Add authentication and database storage

๐Ÿ”ฎ Future Enhancements

  • Re-ranking stage: Add cross-encoder for better precision
  • Conversation memory: Track dialogue context
  • Document versioning: Update docs without full re-index
  • Batch upload: Process multiple files simultaneously
  • Query caching: Store common question-answer pairs
  • Advanced filters: Filter by document source, date, etc.
  • Export functionality: Save Q&A pairs as markdown/PDF
  • Analytics: Track popular queries, retrieval quality

๐Ÿงช Testing Recommendations

Test Cases

  1. Clear Answer Test

    • Upload Python tutorial
    • Ask: "What is a list comprehension?"
    • โœ… Should get detailed answer with sources
  2. Hallucination Prevention Test

    • Same document
    • Ask: "How do I use React hooks?"
    • โœ… Should refuse (not in Python docs)
  3. Multi-source Test

    • Upload multiple documents
    • Ask question spanning both
    • โœ… Should synthesize from multiple sources
  4. Edge Cases

    • Empty query โ†’ validation error
    • No documents uploaded โ†’ warning message
    • Malformed PDF โ†’ graceful error handling

โœ… File/URL/Text Ingestion

  • PDF files (PyPDF2)
  • TXT files (native Python)
  • URLs (BeautifulSoup + requests)
  • Raw text (direct input)
  • Intelligent chunking (SemanticChunker)

โœ… Embedding and Retrieval

  • HuggingFace model (sentence-transformers)
  • Vector database (FAISS in-memory)
  • Semantic search (cosine similarity)
  • No keyword matching (pure embeddings)

โœ… Chatbot Interface

  • Question input
  • Strictly source-based answers
  • Source passage highlighting
  • Similarity scores displayed
  • Clean web UI (Streamlit)

โœ… Hallucination Guardrails

  • Strict prompting
  • Low temperature (0.1)
  • Mandatory citations
  • Clear "insufficient information" responses
  • No invented content

โœ… Code Quality

  • Modular structure (4 main classes)
  • Clear separation of concerns
  • Type hints (Pydantic models)
  • Error handling
  • Clean, readable code

๐Ÿ† What Makes This Solution Stand Out

  1. Modern RAG: Uses current best practices (semantic chunking, normalized vectors)
  2. Zero to minimal hallucination: Multiple layers of prevention
  3. Clean code: Well-structured, typed, documented

๐Ÿ“ž Support

For setup issues:

  1. Check requirements.txt - all dependencies installed?
  2. Python 3.8+ installed? Check with python --version
  3. Valid Gemini API key from https://ai.google.dev/
  4. First run downloads model (~80MB) - wait for completion

๐Ÿ‘ค Author

Documentation Support Agent - Siddhi Pandya


Built with โค๏ธ for accurate, trustworthy document Q&A

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages