Tags: RAG, AI, ML, Chunking, Groq, LangChain, ChromaDB, Embeddings, Retrieval-Augmented Generation

Project Overview

This project implements a Retrieval-Augmented Generation (RAG) pipeline using modern AI/ML tools. It is designed to process various document types (JSON, PDF, DOCX, etc.), chunk the content, convert it into a unified document format, generate embeddings, and store them in a vector database (ChromaDB) for efficient semantic search and retrieval.

Key Features

Multi-format File Loading: Supports JSON, PDF, DOCX, and more via flexible loaders.
Document Chunking: Splits large documents into manageable text chunks for better embedding and retrieval.
Document Conversion: Converts raw file content into a standardized document format for downstream processing.
Embeddings Generation: Uses SentenceTransformer models to create high-quality vector representations of text chunks.
Vector Database Storage: Stores embeddings in ChromaDB for fast similarity search and retrieval.
RAG Search: Integrates with Groq LLM via LangChain for advanced question answering and summarization over retrieved context.

Workflow

Load Documents: All supported files in the data/ directory are loaded and parsed.
Chunk Documents: Each document is split into smaller text chunks using recursive character splitting.
Convert to Document Format: Chunks are wrapped in a document object for embedding.
Embed Chunks: Each chunk is embedded using a SentenceTransformer model.
Store in ChromaDB: Embeddings and metadata are stored in a persistent ChromaDB vector store.
Semantic Search & RAG: Queries are answered by retrieving relevant chunks and generating LLM-based summaries.

Technologies Used

Python 3.12
LangChain
SentenceTransformers
ChromaDB
Groq LLM
dotenv

Usage

Place your files (PDF, DOCX, JSON, etc.) in the data/ directory.
Set your Groq API key in the .env file: GROQ_API_KEY=your_actual_groq_api_key_here
Run main.py to build the vector database and start searching.

Tags: RAG, AI, ML, Chunking, Groq, LangChain, ChromaDB, Embeddings, Retrieval-Augmented Generation

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
src		src
vectordb_store		vectordb_store
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Overview

Key Features

Workflow

Technologies Used

Usage

Tags: RAG, AI, ML, Chunking, Groq, LangChain, ChromaDB, Embeddings, Retrieval-Augmented Generation

About

Uh oh!

Releases

Packages

Languages

anuragratna/AI_RAG_App

Folders and files

Latest commit

History

Repository files navigation

Project Overview

Key Features

Workflow

Technologies Used

Usage

Tags: RAG, AI, ML, Chunking, Groq, LangChain, ChromaDB, Embeddings, Retrieval-Augmented Generation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages