This repository provides a step-by-step walkthrough of the RAG (Retrieval-Augmented Generation) pipeline codebase
The pipeline is implemented using a series of Jupyter notebooks. Follow the steps below to understand and run the pipeline.
Before you begin, ensure you have the following installed:
-
Python 3.11.11
-
Jupyter Notebook
-
Required Python packages (listed in
requirements.txt) -
remove
.examplefrom.env.exampleand fill in the required values
-
Clone the Repository
git clone https://github.com/devzohaib/RAG_Pipeline.git cd RAG_Pipeline -
Install Dependencies
pip install -r requirements.txt
Notebook: 1-Data_Collection.ipynb
- Objective: Prepare and preprocess the dataset for the RAG pipeline.
- Steps:
- Load the dataset.
- Clean and preprocess the text data.
- Save the processed data for further use.
Notebook: 2-Data_Embedding_and_Storage.ipynb
- Objective: Creating Embedding for process dataset and Store Embedding into VectorStore
- Steps:
- Load the Batch of process data.
- Creating the Embedding of data using
test-embedding-3-smallOpenAI embedding model . - Adding Data to the VectorStore.