This project implements a Retrieval-Augmented Generation (RAG) pipeline using modern AI/ML tools. It is designed to process various document types (JSON, PDF, DOCX, etc.), chunk the content, convert it into a unified document format, generate embeddings, and store them in a vector database (ChromaDB) for efficient semantic search and retrieval.
- Multi-format File Loading: Supports JSON, PDF, DOCX, and more via flexible loaders.
- Document Chunking: Splits large documents into manageable text chunks for better embedding and retrieval.
- Document Conversion: Converts raw file content into a standardized document format for downstream processing.
- Embeddings Generation: Uses SentenceTransformer models to create high-quality vector representations of text chunks.
- Vector Database Storage: Stores embeddings in ChromaDB for fast similarity search and retrieval.
- RAG Search: Integrates with Groq LLM via LangChain for advanced question answering and summarization over retrieved context.
- Load Documents: All supported files in the
data/directory are loaded and parsed. - Chunk Documents: Each document is split into smaller text chunks using recursive character splitting.
- Convert to Document Format: Chunks are wrapped in a document object for embedding.
- Embed Chunks: Each chunk is embedded using a SentenceTransformer model.
- Store in ChromaDB: Embeddings and metadata are stored in a persistent ChromaDB vector store.
- Semantic Search & RAG: Queries are answered by retrieving relevant chunks and generating LLM-based summaries.
- Python 3.12
- LangChain
- SentenceTransformers
- ChromaDB
- Groq LLM
- dotenv
- Place your files (PDF, DOCX, JSON, etc.) in the
data/directory. - Set your Groq API key in the
.envfile:GROQ_API_KEY=your_actual_groq_api_key_here - Run
main.pyto build the vector database and start searching.