We recommend using a clean Conda environment. Our experiments were conducted with Python 3.11, CUDA 12.4, and 8 NVIDIA A100-80GB GPUs.
conda create -n TRACE python=3.11
conda activate TRACEAll scripts are provided in the scripts/ folder.
Run the following script to install required dependencies:
bash install_environment.shThis script does the following:
cd ..
pip install -r requirements.txt
cd trl
pip install -e .
pip install flash-attn --no-build-isolation
cd ../open-r1
pip install -e ".[dev]"Note:
- We use and modify the following open-source repositories in our codebase:
- Modified versions of these libraries are included in our repository and installed from source.
- If installing
flash-attnfails, please manually download and install the correct version from: https://github.com/Dao-AILab/flash-attention/releases
⚠️ Important:
We recommend setting up your environment using the exact versions and procedures we provide.
Using different package versions or environments may lead to unexpected issues or failure to run the code properly.
Run the following script to download the three multi-hop QA datasets used in our experiments:
bash download_data.shThis internally runs:
python ../pikerag/main.py ../pikerag/data_process/config/datasets.yamlWe have modified PIKE-RAG's data downloader to correctly handle:
- HotpotQA
- 2WikiMultiHopQA
- MuSiQue
Run the following script to download the base LLMs. For example:
bash download_model.shThis script uses ModelScope to download Qwen:
modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ../model/Qwen2.5-7B-InstructRun the following script to preprocess data for training:
bash run_data_processing.shThis will process all datasets into the format required by our GRPO training setup:
# Inside run_data_processing.sh
names=("hotpotqa" "two_wiki" "musique")
train_limits=(10000 10000 5000)
test_limit=500
for i in "${!names[@]}"; do
name=${names[$i]}
train_limit=${train_limits[$i]}
echo "Running data processing for $name..."
python ../src/data_generator.py --name "$name" --train-limit "$train_limit" --test-limit "$test_limit"
python ../src/datamaker_conversation.py --name "$name" --testfile-name "test"
done
python ../src/datamaker_grpo.py --name "hotpotqa" "two_wiki" "musique" --trainfile-name "train" --saved-name "grpo_25000"Run the following script to start training with our GRPO framework:
bash run_grpo.shThe core command:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file ../configs/deepspeed_zero2.yaml \
--num_processes=7 ../src/grpo.py \
--config ../configs/grpo.yamlWe use DeepSpeed Zero2 for efficient multi-GPU training.
All inference and evaluation results are provided in the results/ folder.
To perform inference on the test set using the base model (before training), we use the vLLM engine.
Run the following script:
bash run_inference.shThis internally calls:
python ../src/vllm_inference.py \
--name "hotpotqa" \
--testdata-name "test" \
--saved-name "test_base_500" \
--model-path "../model/Qwen2.5-7B-Instruct"All inference and evaluation results are provided in the results/ folder.
You can evaluate the inference results using either rule-based metrics or LLM-as-a-judge (LJ):
bash run_evaluate_rulebase.shThis runs:
python ../src/evaluate.py \
--name "hotpotqa" \
--result-name "test_base_500"bash run_evaluate_gpt.shThis runs:
python ../src/gpt_eval.py \
--name "hotpotqa" \
--result-name "test_base_500"Note:
Forgpt_eval.py, make sure you have set your OpenAI API key via theOPENAI_API_KEYenvironment variable.
We also provide code for reproducing the supervised fine-tuning (SFT) method described in our paper.
bash run_sft_prepare.shThis script includes the following steps:
# For each dataset
names=("hotpotqa" "two_wiki" "musique")
train_limits=(10000 10000 5000)
test_limit=500
for i in "${!names[@]}"; do
name=${names[$i]}
train_limit=${train_limits[$i]}
echo "Running sft preparation for $name..."
python ../src/data_generator_sft.py --name "$name" --train-limit "$train_limit" --test-limit "$test_limit"
python ../src/vllm_inference_sft.py --name "$name" --testdata-name "train_sft_first_step" --saved-name "train_sft_first_step" --model-path "../model/Qwen2.5-7B-Instruct"
done
# Merge into final training file
python ../src/datamaker_sft.py --name "hotpotqa" "two_wiki" "musique" --trainfile-name "train_sft_first_step" --saved-name "sft_25000"bash run_sft.shThis launches supervised fine-tuning with DeepSpeed Zero2:
accelerate launch \
--config_file ../configs/deepspeed_zero2.yaml \
../src/sft.py \
--model_name_or_path ../model/Qwen2.5-7B-Instruct \
--dataset_name ../data/data_train/sft/sft_25000.jsonl \
--per_device_train_batch_size 4 \
--output_dir ../checkpoints/Qwen2.5-7B-Instruct-SFT_25000 \
--bf16 True \
--gradient_accumulation_steps 8 \
--num_train_epochs 1 \
--logging_steps 1 \
--eval_strategy steps \
--eval_steps 100 \
--learning_rate 1e-5 \
--max_grad_norm 0.3 \
--warmup_ratio 0.1 \
--torch_dtype bfloat16 \
--gradient_checkpointing TrueYou can evaluate the trained SFT model using the same evaluation scripts as GRPO models (see previous Evaluation section).