An AI-powered document-based Question Answering (QA) and Summarization chatbot built for insurance policy PDFs. It uses advanced language models and retrieval techniques to help users understand dense policy documents via natural language questions and concise summaries.
-
📄 PDF Parsing with Context Awareness
Extracts structured data from PDFs usingpdfplumberandPyMuPDF, retaining formatting like headings and bolded phrases. -
✂️ Chunking with Contextual Embedding
Text is split usingRecursiveCharacterTextSplitterand prefixed with its parent heading and highlights, improving semantic understanding. -
🧠 Embeddings via SentenceTransformers
Text chunks are embedded usingall-MiniLM-L6-v2for high-quality semantic similarity matching. -
📾 ChromaDB Vector Store
Stores vector representations of document chunks persistently usingChroma. -
🔍 Contextual Compression Retriever
A 2-stage retrieval process using:- Similarity-based retriever\
- LLM compressor for relevance filtering
-
🧠 Mistral-7B QA LLM
Usesmistralai/Mistral-7B-Instruct-v0.1to generate accurate, explainable answers based on compressed context. -
🌐 Web Search Tool
Falls back to DuckDuckGo for external search when the document lacks enough information. -
💡 Summary Generation
Users can generate a document summary highlighting key coverage points, exclusions, and important clauses using the LLM — helpful for quick overviews. -
💾 Disk-Based Caching
Query responses are cached usingdiskcacheto improve performance on repeated searches. -
🖥️ Streamlit Interface
A polished web interface built with Streamlit supports PDF uploads, QA, and summary generation in real-time.
| Layer | Tool/Library |
|---|---|
| Text Extraction | pdfplumber, fitz (PyMuPDF) |
| Chunking | LangChain.text_splitter |
| Embeddings | sentence-transformers |
| Vector Store | ChromaDB |
| LLM | Mistral-7B-Instruct via HuggingFaceHub |
| Retrieval | LangChain retrievers + compression |
| Web Search | DuckDuckGoSearchRun |
| Caching | diskcache |
| UI | Streamlit |
.
├── final_project.ipynb # Core QA & summary pipeline
├── app.py # Streamlit web application
├── chroma_store/ # Persistent Chroma vector store
├── cache/ # Disk-based cache storage
├── requirements.txt # Python dependencies
└── README.md # Project documentation
git clone https://github.com/your-username/insurance-qa-chatbot.git
cd insurance-qa-chatbot
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txtquery = "What are the exclusions under the critical illness plan?"
results = document_search_tool(query, vectorstore)
print(results)# To generate a summary
summary = generate_summary(vectorstore)
print(summary)streamlit run app.py- 📤 Upload any insurance PDF
- 💬 Ask questions interactively
- 📌 Click “Generate Summary” to get a concise explanation of the entire policy
