This project is a hands-on guide to building a local Retrieval-Augmented Generation (RAG) system from scratch. The goal is to create a chatbot capable of answering questions about projects from the Academy Software Foundation (ASWF), using a knowledge base built from their official documentation and source code repositories.
The initial focus is on three key ASWF projects:
- OpenColorIO (OCIO)
- OpenImageIO (OIIO)
- OpenEXR
The architecture is designed to be modular and extensible, allowing for the easy addition of other knowledge sources in the future (e.g., Pixar's Universal Scene Description - USD).
This project serves as a practical learning exercise in Python, Machine Learning, and modern AI application development.
The core technologies chosen for this project are:
- Language: Python 3.11
- Core Framework: LangChain for orchestrating the RAG pipeline.
- LLM: Meta-Llama-3-8B-Instruct (via GGUF)
- Vector Database: ChromaDB for local, persistent storage and retrieval of text embeddings.
- Embedding Model: Sentence-Transformers for generating high-quality text embeddings locally.
- Frontend: Streamlit for the chatbot interface.
The project follows a modular structure to keep the code organized and easy to test:
/
├── .venv/ # The Python virtual environment
├── data/ # For storing raw or processed data
├── input/
│ └── sources.txt # List of URLs to scrape for the knowledge base
├── models/ # For storing the local LLM model
├── notebooks/ # Jupyter notebooks for experimentation
├── src/ # Main source code
│ ├── app.py # The Streamlit chatbot application
│ ├── data_loader.py # Scripts for loading and processing data
│ ├── rag_chain.py # The core RAG chain logic
│ ├── vector_store.py # Scripts for managing the ChromaDB instance
│ └── main.py # Main application script for data ingestion
├── .gitignore # Git ignore file
├── requirements.in # Pip-tools input file for dependencies
├── requirements.txt # Project dependencies
└── README.md # This file
Follow these steps to set up your local development environment.
- Python 3.11
- Git
- Clone the repository:
git clone <repository-url> cd RagLangChain
- Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`
- Install the dependencies:
pip install -r requirements.txt
- Download the LLM:
Download the
Meta-Llama-3-8B-Instruct.Q5_K_M.ggufmodel and place it in themodels/directory.
The project has two main parts: data ingestion and the chatbot application.
To build the. knowledge base, you first need to ingest the data from the sources defined in input/sources.txt.
python src/main.pyThis script will scrape the data, create embeddings, and store them in the ChromaDB vector store.
Once the data has been ingested, you can start the chatbot application.
streamlit run src/app.pyThis will open a new tab in your browser with the chatbot interface.
The project is in a functional state. The data ingestion pipeline and the RAG-based chatbot are implemented. Future work could include adding more data sources, experimenting with different LLMs and embedding models, and improving the chatbot's user interface.