TRACE

Usage

We recommend using a clean Conda environment. Our experiments were conducted with Python 3.11, CUDA 12.4, and 8 NVIDIA A100-80GB GPUs.

Create Conda Environment

conda create -n TRACE python=3.11
conda activate TRACE

Install Dependencies

All scripts are provided in the scripts/ folder.

Run the following script to install required dependencies:

bash install_environment.sh

This script does the following:

cd ..
pip install -r requirements.txt
cd trl
pip install -e .

pip install flash-attn --no-build-isolation

cd ../open-r1
pip install -e ".[dev]"

Note:

We use and modify the following open-source repositories in our codebase:

PIKE-RAG (MIT License)

TRL (Apache-2.0)

Open-R1 (Apache-2.0)

Modified versions of these libraries are included in our repository and installed from source.

If installing flash-attn fails, please manually download and install the correct version from: https://github.com/Dao-AILab/flash-attention/releases

⚠️ Important:
We recommend setting up your environment using the exact versions and procedures we provide.
Using different package versions or environments may lead to unexpected issues or failure to run the code properly.

Download Datasets

Run the following script to download the three multi-hop QA datasets used in our experiments:

bash download_data.sh

This internally runs:

python ../pikerag/main.py ../pikerag/data_process/config/datasets.yaml

We have modified PIKE-RAG's data downloader to correctly handle:

HotpotQA
2WikiMultiHopQA
MuSiQue

Download Base Models

Run the following script to download the base LLMs. For example:

bash download_model.sh

This script uses ModelScope to download Qwen:

modelscope download --model Qwen/Qwen2.5-7B-Instruct --local_dir ../model/Qwen2.5-7B-Instruct

Preprocess Data

Run the following script to preprocess data for training:

bash run_data_processing.sh

This will process all datasets into the format required by our GRPO training setup:

# Inside run_data_processing.sh

names=("hotpotqa" "two_wiki" "musique")
train_limits=(10000 10000 5000)
test_limit=500

for i in "${!names[@]}"; do
  name=${names[$i]}
  train_limit=${train_limits[$i]}
  
  echo "Running data processing for $name..."
  python ../src/data_generator.py --name "$name" --train-limit "$train_limit" --test-limit "$test_limit"
  python ../src/datamaker_conversation.py --name "$name" --testfile-name "test"
done

python ../src/datamaker_grpo.py --name "hotpotqa" "two_wiki" "musique" --trainfile-name "train" --saved-name "grpo_25000"

Train with GRPO

Run the following script to start training with our GRPO framework:

bash run_grpo.sh

The core command:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file ../configs/deepspeed_zero2.yaml \
    --num_processes=7 ../src/grpo.py \
    --config ../configs/grpo.yaml

We use DeepSpeed Zero2 for efficient multi-GPU training.

Inference

All inference and evaluation results are provided in the results/ folder.

To perform inference on the test set using the base model (before training), we use the vLLM engine.

Run the following script:

bash run_inference.sh

This internally calls:

python ../src/vllm_inference.py \
  --name "hotpotqa" \
  --testdata-name "test" \
  --saved-name "test_base_500" \
  --model-path "../model/Qwen2.5-7B-Instruct"

Evaluation

All inference and evaluation results are provided in the results/ folder.

You can evaluate the inference results using either rule-based metrics or LLM-as-a-judge (LJ):

Rule-based Evaluation (EM / F1)

bash run_evaluate_rulebase.sh

This runs:

python ../src/evaluate.py \
  --name "hotpotqa" \
  --result-name "test_base_500"

LLM-as-a-Judge Evaluation (LJ)

bash run_evaluate_gpt.sh

This runs:

python ../src/gpt_eval.py \
  --name "hotpotqa" \
  --result-name "test_base_500"

Note:
For gpt_eval.py, make sure you have set your OpenAI API key via the OPENAI_API_KEY environment variable.

Train with SFT

We also provide code for reproducing the supervised fine-tuning (SFT) method described in our paper.

Step 1: Prepare SFT Data

bash run_sft_prepare.sh

This script includes the following steps:

# For each dataset
names=("hotpotqa" "two_wiki" "musique")
train_limits=(10000 10000 5000)
test_limit=500

for i in "${!names[@]}"; do
  name=${names[$i]}
  train_limit=${train_limits[$i]}
  
  echo "Running sft preparation for $name..."
  python ../src/data_generator_sft.py --name "$name" --train-limit "$train_limit" --test-limit "$test_limit"
  python ../src/vllm_inference_sft.py --name "$name" --testdata-name "train_sft_first_step" --saved-name "train_sft_first_step" --model-path "../model/Qwen2.5-7B-Instruct"
done

# Merge into final training file
python ../src/datamaker_sft.py --name "hotpotqa" "two_wiki" "musique" --trainfile-name "train_sft_first_step" --saved-name "sft_25000"

Step 2: Run SFT Training

bash run_sft.sh

This launches supervised fine-tuning with DeepSpeed Zero2:

accelerate launch \
  --config_file ../configs/deepspeed_zero2.yaml \
  ../src/sft.py \
  --model_name_or_path ../model/Qwen2.5-7B-Instruct \
  --dataset_name ../data/data_train/sft/sft_25000.jsonl \
  --per_device_train_batch_size 4 \
  --output_dir ../checkpoints/Qwen2.5-7B-Instruct-SFT_25000 \
  --bf16 True \
  --gradient_accumulation_steps 8 \
  --num_train_epochs 1 \
  --logging_steps 1 \
  --eval_strategy steps \
  --eval_steps 100 \
  --learning_rate 1e-5 \
  --max_grad_norm 0.3 \
  --warmup_ratio 0.1 \
  --torch_dtype bfloat16 \
  --gradient_checkpointing True

Step 3: Evaluate SFT Model

You can evaluate the trained SFT model using the same evaluation scripts as GRPO models (see previous Evaluation section).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRACE

Usage

Create Conda Environment

Install Dependencies

Download Datasets

Download Base Models

Preprocess Data

Train with GRPO

Inference

Evaluation

Rule-based Evaluation (EM / F1)

LLM-as-a-Judge Evaluation (LJ)

Train with SFT

Step 1: Prepare SFT Data

Step 2: Run SFT Training

Step 3: Evaluate SFT Model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
open-r1		open-r1
pikerag		pikerag
scripts		scripts
src		src
trl		trl
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

TRACE

Usage

Create Conda Environment

Install Dependencies

Download Datasets

Download Base Models

Preprocess Data

Train with GRPO

Inference

Evaluation

Rule-based Evaluation (EM / F1)

LLM-as-a-Judge Evaluation (LJ)

Train with SFT

Step 1: Prepare SFT Data

Step 2: Run SFT Training

Step 3: Evaluate SFT Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages