The RAG pipeline shown in Figure 1 consists of four key phases:
- Embed Query: Converts the input query into a vector representation (embedding).
- Search and Retrieval: The query embedding is used to retrieve relevant documents from a corpus.
- Augmentation: Additional retrieved context is incorporated to the query.
- Referenced Generation: The final step, where the LLM generates response using the augmented context.
This artifact provides a framework to evaluate the performance of different retrieval models using real-world biomedical datasets like PubMed (500K passages) and BioASQ (queries). Since the proposed polymorphic accelerator is designed to accelerate the query embedding and search/retrieval phases, this artifact focuses exclusively on these steps. The augmentation and referenced generation phases are not included in this artifact's scope, but the entire RAG pipeline can be run on AWS following the instructions instructions here. Keep in mind that to run LLaMA, you will need to obtain permission from Meta’s website. You can find more information here. For the accelerator, we provide the simulator since the RTL consists of 87K lines of code with TODO files, which is beyond the permissible file limits.
Overall it takes between 2-3 hours to complete.
Start by cloning the GitHub repository to your local machine:
git clone https://github.com/rohanmahapatra/ragx
cd ragx/artifact
We provide a pre-configured Docker container for setting up the environment. If you choose this method, you don't need to manually install dependencies. Follow the steps below:
Run the following command to check if Docker is installed:
which docker
If Docker is installed, this command will return its path (e.g., /usr/bin/docker
). If nothing is returned, you need to install Docker.
For Ubuntu/Debian, run:
sudo apt update
sudo apt install -y docker.io
For CentOS/RHEL, run:
sudo yum install -y docker
For Mac (using Homebrew):
brew install --cask docker
For Windows, install Docker Desktop from Docker's official website.
If Docker is installed but not running, start it with:
sudo systemctl start docker
sudo systemctl enable docker
If running docker
requires sudo
, add yourself to the Docker group to run it without sudo
:
sudo usermod -aG docker $USER
newgrp docker
Then try:
docker --version
If you encounter permission denied while trying to connect to the Docker daemon socket, follow these steps:
-
Ensure Docker Daemon is Running
sudo systemctl status docker
If it's not running, start it:
sudo systemctl start docker
-
Re-add Your User to the Docker Group (if necessary):
sudo usermod -aG docker $USER newgrp docker
Then log out and log back in, or restart your system.
-
Use BuildKit Instead of the Deprecated Legacy Builder
sudo apt install docker-buildx-plugin -y export DOCKER_BUILDKIT=1
-
Test Docker Permissions
docker run hello-world
If this runs successfully, your permissions are correctly set.
-
Check Docker Logs for Errors
journalctl -u docker --no-pager | tail -50
More details on using dockers can be found here
Once Docker is set up, proceed with the following steps to launch the docker in interactive mode:
6. Build and Run the Docker Container in interactive mode
-
Build the Docker image:
./build_docker.sh
-
Run the Docker container:
- For GPU-enabled setup:
./run_docker_gpu.sh
- For CPU-only setup:
./run_docker_cpu.sh
- For GPU-enabled setup:
Note: If you encounter issues with GPUs not being available in the Docker container, refer to this StackOverflow link to resolve the error.
If you choose to manually set up the environment, follow these instructions. Note that when using Docker, you do not need to install packages or set paths—just proceed to running the scripts.
- Create a conda environment:
conda env create -f environment.yml conda activate artifact
- Install other dependencies:
sudo apt update sudo apt install openjdk-17-jdk pip install pyserini==0.22.0
For our evaluation, we used publicly available datasets: PubMed (biomedical passages) and BioASQ (biomedical queries). These datasets are essential for generating embeddings and running the benchmarks.
We have included a shell script to download the required datasets:
./download_datasets.sh
This script will first download the PubMed biomedical passage dataset, which is composed of ~50 million passages (~25GB). Next, it will extract the first 500K passages to create a smaller version of the dataset. Finally, the BioASQ dataset, consisting of 3,800 biomedical queries, will be downloaded. This process will take ~20 minutes.
- Download PubMed and BioASQ datasets:
cd ragx/dataset python3 download_pubmed.py python3 download_bioasq.py
To scale the PubMed dataset to 500K passages, follow these steps:
-
Shrink the dataset:
python3 shrink_pubmed.py
-
Create a directory for the 500K dataset:
mkdir pubmed_500K mv pubmed_corpus_500K.jsonl pubmed_500K/
The PubMed (500K) and BioASQ datasets should now be ready for benchmarking.
We use five benchmarks in our evaluation: BM25, SPLADEv2, ColBERT, Doc2Vec, and GTR. Below are the instructions to generate the necessary databases for each of these benchmarks.
To generate the databases for BM25, ColBERT, Doc2Vec, and GTR (SPLADEv2 not included because it is not part of the functional artifact), run the following shell script:
./build_databases.sh
The generated databases are in the /app/benchmarks directory. For BM25, the database is an inverted index composed of posting lists and is constructed using Pyserini. For ColBERT, Doc2Vec, and GTR, the databases are HNSW-based and constructed using Meta's FAISS.
Note! This process involves embedding all the passages and then constructing the corresponding databases. If you do not have GPUs available, this process can be extremely time consuming (>6 hours). We have also uploaded the databases to huggingface, so you can run the following script to download them and populate the /app/benchmarks directory:
./download_databases.sh
The databases will be in the respective benchmark's folder within the /app/benchmarks directory.
Note! This process should only take 5-10min if the download seems "stuck" it may have completed and the progress bars may just be overlapping the terminal prompt. Press enter a few times to resolve this.
- Generate the BM25 database:
cd ragx/benchmarks/BM25/ python -m pyserini.index --collection JsonCollection --input ../../dataset/pubmed_500K --index bm25_pubmed_500K --generator DefaultLuceneDocumentGenerator --storePositions --storeDocvectors --storeRaw --storeContents
The SPLADEv2 setup is complex and requires additional setup from the SPLADE GitHub repository. Since SPLADE is challenging to set up, we do not consider it part of the functional artifact. However, if you choose to run it, here’s the command format:
python3 -m splade.index --config.index_dir=experiments/pubmed/index500K --data.COLLECTION_PATH=/home/santhanam/rag/baseline/pubmed500K
Note! Change the "corpus_file" path in create_hnsw_colbert.py to the path of pubmed_corpus_500K.jsonl is. This file is in the dataset/pubmed_500K/ folder.
- Generate the ColBERT HNSW database:
cd ragx/benchmarks/ColBERT python3 create_hnsw_colbert.py
Note! Change the "corpus_file" path in create_hnsw_doc2vec.py to the path of pubmed_corpus_500K.jsonl is. This file is in the dataset/pubmed_500K/ folder.
- Generate the Doc2Vec HNSW database:
cd ragx/benchmarks/Doc2Vec python3 create_hnsw_doc2vec.py
Note! Change the "corpus_file" path in create_hnsw_gtr.py to the path of pubmed_corpus_500K.jsonl is. This file is in the dataset/pubmed_500K/ folder.
cd ragx/benchmarks/GTR
python3 create_hnsw_gtr.py
Note! Generating databases may take a long time depending on how many GPUs are available. To download the databases from our huggingface please use the following script (when not in a Docker container):
cd ragx/
./download_databases_manual.sh
To recreate the CPU-DRAM performance results, we provide scripts to retrieve results using various benchmarks.
To run the CPU-DRAM baseline evaluation, run the following script:
./run_cpu_dram.sh
The generated results will be in /app/baseline-cpu-dram/cpu_dram_results/. The generated results consist of 4 csv files (one per retriever benchmark). Within these files, the latency for embedding and search will be recorded for 100 queries in the BioASQ dataset.
-
Navigate to the Baseline(CPU-DRAM) directory:
cd /app/baseline-cpu-dram/
-
Open
bm25_cpu_dram_retrieve.py
and setqueries_file
to the path of the BioASQ dataset andindex_path
to the BM25 index. -
Run the retrieval process:
python3 bm25_cpu_dram_retrieve.py
For each benchmark, the process is similar:
-
SPLADEv2: Not part of the functional artifact.
-
ColBERT: Set paths for queries and index, then run: Open
colbert_cpu_dram_retrieve.py
and setqueries_file
to the path of the BioASQ dataset andindex_path
to the ColBERT index (.faiss file). In addition, set model_name_custom's path to ColBERT's model found in benchmark/ColBERT/.python3 colbert_cpu_dram_retrieve.py
-
Doc2Vec: Set paths for queries and model, then run: Open
doc2vec_cpu_dram_retrieve.py
and setqueries_file
to the path of the BioASQ dataset andhnsw_path
to the Doc2Vec index (.faiss file). Setmodel_path
to the path of doc2vec_model found in benchmark/Doc2Vec. Also set,corpus_file
to the path of the pubmed_500K jsonl file.python3 doc2vec_cpu_dram_retrieve.py
-
GTR: Set paths for queries and index, then run: Open
gtr_cpu_dram_retrieve.py
and setqueries_file
to the path of the BioASQ dataset andoutput_index_file
to the GTR index (.faiss file).python3 gtr_cpu_dram_retrieve.py
To generate results for RAGX, we instrument the search functions for both keyword-based and embedding-based retrievers to produce trace files. These trace files record the search process through the index and are used by the RAGX simulator. For example, with HNSW-based databases, the trace file contains the nodes visited and scored during the graph traversal. For inverted index based databases, the trace file contains the posting lists scored and the sizes of these posting lists.
The trace files for the 500K dataset are located in the /app/baseline-cpu-dram/traces
directory. These traces can be used to simulate results in the RAGX simulator.
The RAGX simulator processes trace files generated from the baseline execution and estimates the latency for query embedding, search, and retrieval phases. The simulator requires compiled kernel files to execute its computations.
Navigate to the simulator directory:
cd ragx.simulator/
./run_ragx_simulations.sh
The run_ragx_simulations.sh
script automates the entire process of preparing the simulator, running the simulations, and analyzing the results. Here's what the script does step-by-step:
-
Prepares the compiled kernels: The script first navigates to the
ragx.simulator/
directory and extracts the compiled kernel files from thecompiled_kernels.zip
archive. These kernels are essential for the simulation to compute the latencies for different operations. -
Runs simulations for different configurations: The script defines a set of configuration files (e.g.,
bm25-500K.yaml
,splade-500K.yaml
, etc.) and their corresponding trace files (e.g.,bm25_query1_500K_trace.json
,spladev2_query1_500K_trace.json
, etc.).- It loops over each configuration-trace pair and runs the simulation using the
eurekastore.py
script. - The configuration file specifies the simulator settings (e.g., model parameters, batch size), while the trace file contains the query and retrieval operations.
- For each trace file, the script calculates the number of points or neighbors that need to be scored, invoking the simulator to estimate the compute latency for each query and retrieval operation.
This process is repeated for all configurations and traces.
- It loops over each configuration-trace pair and runs the simulation using the
-
Generates simulation logs: As the simulations run, detailed logs are saved in the
simulation_logs/
directory. These logs contain information on the simulation's progress and output. -
Analyzes and collates the results: After all simulations are complete, the script automatically analyzes the generated logs using the
analyze_simulated_logs.py
script. This step processes the raw simulation data and compiles it into a final results file (ragx-results.csv
) that contains the estimated latency measurements for each configuration and trace.
The simulation process will take approximately 1 hour to complete, depending on the size of the trace files. This is due to the following steps for each simulation:
- The script accesses the trace file to identify the number of points or neighbors that need to be scored.
- It invokes the simulator to estimate the compute latency for each query and retrieval operation.
- The process is repeated for all configurations and trace files, and because trace files are typically large, this can take a significant amount of time.
To evaluate different configurations, follow these steps:
- Generate new execution traces: You can create new traces for different configurations (e.g., using different models or parameters).
- Compile each kernel/layer: Ensure that the necessary kernel files for the new configuration are compiled.
- Run the simulator: Execute the
run_ragx_simulations.sh
script to evaluate the performance of the new configuration.
The simulator will process these new traces and configurations, providing you with latency measurements that can be compared to baseline results.
Navigate back to the artifact directory.
cd ragx.simulator/
./run_ragx_simulations.sh
Once you are finished with the Docker container, exit the interactive mode by typing:
exit
This will close the Docker container's shell.
To stop the running Docker container, use the following command:
docker stop <container_name>
Replace <container_name>
with the actual name of your running container (e.g., ragx-container
). You can find the container name by running:
docker ps
This will display the currently running containers, and you can identify the name of the container to stop.
If you wish to completely remove the Docker container after stopping it, you can run:
docker rm <container_name>
This will remove the stopped container from your system. This step is optional and can be skipped if you plan to reuse the container later.
This artifact offers a framework for assessing in-storage acceleration in Retrieval-Augmented Generation. By following the provided guidelines, users can systematically benchmark different retrieval models and compare them against conventional baselines. The included simulator facilitates detailed performance analysis, enabling a comprehensive evaluation of the proposed accelerator.