Skip to content
This repository was archived by the owner on Dec 10, 2025. It is now read-only.

JoergFlue/RagLangChain

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Local RAG Chatbot for ASWF Projects

Overview

This project is a hands-on guide to building a local Retrieval-Augmented Generation (RAG) system from scratch. The goal is to create a chatbot capable of answering questions about projects from the Academy Software Foundation (ASWF), using a knowledge base built from their official documentation and source code repositories.

The initial focus is on three key ASWF projects:

  • OpenColorIO (OCIO)
  • OpenImageIO (OIIO)
  • OpenEXR

The architecture is designed to be modular and extensible, allowing for the easy addition of other knowledge sources in the future (e.g., Pixar's Universal Scene Description - USD).

This project serves as a practical learning exercise in Python, Machine Learning, and modern AI application development.

Tech Stack

The core technologies chosen for this project are:

  • Language: Python 3.11
  • Core Framework: LangChain for orchestrating the RAG pipeline.
  • LLM: Meta-Llama-3-8B-Instruct (via GGUF)
  • Vector Database: ChromaDB for local, persistent storage and retrieval of text embeddings.
  • Embedding Model: Sentence-Transformers for generating high-quality text embeddings locally.
  • Frontend: Streamlit for the chatbot interface.

Project Structure

The project follows a modular structure to keep the code organized and easy to test:

/
├── .venv/                  # The Python virtual environment
├── data/                   # For storing raw or processed data
├── input/
│   └── sources.txt         # List of URLs to scrape for the knowledge base
├── models/                 # For storing the local LLM model
├── notebooks/              # Jupyter notebooks for experimentation
├── src/                    # Main source code
│   ├── app.py              # The Streamlit chatbot application
│   ├── data_loader.py      # Scripts for loading and processing data
│   ├── rag_chain.py        # The core RAG chain logic
│   ├── vector_store.py     # Scripts for managing the ChromaDB instance
│   └── main.py             # Main application script for data ingestion
├── .gitignore              # Git ignore file
├── requirements.in         # Pip-tools input file for dependencies
├── requirements.txt        # Project dependencies
└── README.md               # This file

Getting Started

Follow these steps to set up your local development environment.

Prerequisites

  • Python 3.11
  • Git

Installation

  1. Clone the repository:
    git clone <repository-url>
    cd RagLangChain
  2. Create and activate a virtual environment:
    python -m venv .venv
    source .venv/bin/activate  # On Windows, use `.venv\Scripts\activate`
  3. Install the dependencies:
    pip install -r requirements.txt
  4. Download the LLM: Download the Meta-Llama-3-8B-Instruct.Q5_K_M.gguf model and place it in the models/ directory.

Usage

The project has two main parts: data ingestion and the chatbot application.

1. Data Ingestion

To build the. knowledge base, you first need to ingest the data from the sources defined in input/sources.txt.

python src/main.py

This script will scrape the data, create embeddings, and store them in the ChromaDB vector store.

2. Run the Chatbot

Once the data has been ingested, you can start the chatbot application.

streamlit run src/app.py

This will open a new tab in your browser with the chatbot interface.

Current Status

The project is in a functional state. The data ingestion pipeline and the RAG-based chatbot are implemented. Future work could include adding more data sources, experimenting with different LLMs and embedding models, and improving the chatbot's user interface.

About

Simple RAG prototype

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages