A comprehensive Retrieval-Augmented Generation (RAG) system built in Python for evaluating different chunking strategies, embedding models, and retrieval configurations on text corpora.
This project implements a complete RAG pipeline that:
- Chunks text documents using configurable token-based splitting
- Generates embeddings using sentence transformers
- Performs semantic retrieval using cosine similarity
- Evaluates performance with multiple metrics (Recall, Precision, IoU)
- Provides comprehensive analysis of different parameter configurations
- Flexible Chunking: Configurable chunk sizes and overlap strategies
- Semantic Embeddings: Uses
all-MiniLM-L6-v2for high-quality text representations - Multiple Evaluation Metrics: Recall, Precision, and Intersection over Union (IoU)
- Comprehensive Testing: Automated experiments across different parameter combinations
- Visual Results: Export evaluation results as formatted tables and PNG images
- Clone the repository:
git clone <repository-url>
cd RAG- Install dependencies:
pip install -r requirements.txtRun the complete evaluation pipeline:
python main.pyThis will execute experiments with different chunk sizes (100-800 tokens), overlap settings (0-400 tokens), and top-k values (2, 5, 10).
main.py— Entry point to run the full evaluation pipelineload_data.py— Loads corpora and questions into the pipelinefixed_token_chunker.py— Token-based text chunker with configurable overlapembedder.py— Text embedding generation using sentence transformersretrieval.py— Semantic retrieval using cosine similarityevaluation.py— Evaluation metrics computation (Recall, Precision, IoU)final_evaluation.py— End-to-end evaluation orchestrationexport_table.py— Export results to PNG table format
├── corpora/
│ └── wikitexts.txt # Text corpus for retrieval
├── questions/
│ └── questions_df.csv # Annotated questions with gold references
├── result_tables/
│ └── final_evaluation_results.png # Exported evaluation results
└── REPORT.md # Detailed findings and analysis
The system evaluates multiple configurations automatically:
- Chunk Sizes: 100, 200, 300, 400, 800 tokens
- Overlap: 0, 200, 400 tokens
- Top-k Retrieval: 2, 5, 10 results
To run with custom parameters, modify the experiment calls in main.py:
run_experiment(chunk_size=400, chunk_overlap=200, top_k=5)- Recall: Proportion of relevant chunks retrieved out of all relevant chunks
- Precision: Proportion of retrieved chunks that were actually relevant
- IoU: Intersection over Union between retrieved and gold standard chunks
The system generates:
- Console output with detailed metrics for each experiment
- Tabulated results showing all configurations and their performance
- PNG export of results table for easy sharing and presentation
Based on comprehensive evaluation:
- Larger chunk sizes (400-800 tokens) consistently outperform smaller ones
- Chunk overlap significantly improves recall and IoU scores
- Lower top-k values improve precision but reduce recall (precision-recall tradeoff)
- Optimal balanced configuration: 800 tokens, 400 overlap, top-k=5
- Best high-precision configuration: 800 tokens, 400 overlap, top-k=2
pandas- Data manipulation and analysissentence_transformers- Text embedding generationnumpy- Numerical computationssklearn- Cosine similarity calculationstiktoken- Token counting and chunkingtabulate- Results table formattingmatplotlib- Chart generation and PNG export
This project is available under the MIT License.
Max0072