Skip to content
/ RAG Public

A Retrieval-Augmented Generation (RAG) system built in Python

Notifications You must be signed in to change notification settings

Max0072/RAG

Repository files navigation

RAG System Implementation

A comprehensive Retrieval-Augmented Generation (RAG) system built in Python for evaluating different chunking strategies, embedding models, and retrieval configurations on text corpora.

Overview

This project implements a complete RAG pipeline that:

  • Chunks text documents using configurable token-based splitting
  • Generates embeddings using sentence transformers
  • Performs semantic retrieval using cosine similarity
  • Evaluates performance with multiple metrics (Recall, Precision, IoU)
  • Provides comprehensive analysis of different parameter configurations

Features

  • Flexible Chunking: Configurable chunk sizes and overlap strategies
  • Semantic Embeddings: Uses all-MiniLM-L6-v2 for high-quality text representations
  • Multiple Evaluation Metrics: Recall, Precision, and Intersection over Union (IoU)
  • Comprehensive Testing: Automated experiments across different parameter combinations
  • Visual Results: Export evaluation results as formatted tables and PNG images

Installation

  1. Clone the repository:
git clone <repository-url>
cd RAG
  1. Install dependencies:
pip install -r requirements.txt

Usage

Quick Start

Run the complete evaluation pipeline:

python main.py

This will execute experiments with different chunk sizes (100-800 tokens), overlap settings (0-400 tokens), and top-k values (2, 5, 10).

Key Components

  • main.py — Entry point to run the full evaluation pipeline
  • load_data.py — Loads corpora and questions into the pipeline
  • fixed_token_chunker.py — Token-based text chunker with configurable overlap
  • embedder.py — Text embedding generation using sentence transformers
  • retrieval.py — Semantic retrieval using cosine similarity
  • evaluation.py — Evaluation metrics computation (Recall, Precision, IoU)
  • final_evaluation.py — End-to-end evaluation orchestration
  • export_table.py — Export results to PNG table format

Data Structure

├── corpora/
│   └── wikitexts.txt          # Text corpus for retrieval
├── questions/
│   └── questions_df.csv       # Annotated questions with gold references
├── result_tables/
│   └── final_evaluation_results.png  # Exported evaluation results
└── REPORT.md                  # Detailed findings and analysis

Configuration

The system evaluates multiple configurations automatically:

  • Chunk Sizes: 100, 200, 300, 400, 800 tokens
  • Overlap: 0, 200, 400 tokens
  • Top-k Retrieval: 2, 5, 10 results

Custom Configuration

To run with custom parameters, modify the experiment calls in main.py:

run_experiment(chunk_size=400, chunk_overlap=200, top_k=5)

Evaluation Metrics

  • Recall: Proportion of relevant chunks retrieved out of all relevant chunks
  • Precision: Proportion of retrieved chunks that were actually relevant
  • IoU: Intersection over Union between retrieved and gold standard chunks

Results

The system generates:

  • Console output with detailed metrics for each experiment
  • Tabulated results showing all configurations and their performance
  • PNG export of results table for easy sharing and presentation

Key Findings

Based on comprehensive evaluation:

  • Larger chunk sizes (400-800 tokens) consistently outperform smaller ones
  • Chunk overlap significantly improves recall and IoU scores
  • Lower top-k values improve precision but reduce recall (precision-recall tradeoff)
  • Optimal balanced configuration: 800 tokens, 400 overlap, top-k=5
  • Best high-precision configuration: 800 tokens, 400 overlap, top-k=2

Dependencies

  • pandas - Data manipulation and analysis
  • sentence_transformers - Text embedding generation
  • numpy - Numerical computations
  • sklearn - Cosine similarity calculations
  • tiktoken - Token counting and chunking
  • tabulate - Results table formatting
  • matplotlib - Chart generation and PNG export

License

This project is available under the MIT License.

Author

Max0072

About

A Retrieval-Augmented Generation (RAG) system built in Python

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages