๐ค HF Repoย ย ย ๐ Paper ๐ฆ Twitterย ย ย
This repo contains the resources (Code, Data, Models) for the paper "Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping"
Laser (Length-bAsed StEp Reward shaping) and its adaptive versions Laser-D, Laser-DE ( Dynamic and Difficulty-aware Length-bAsed StEp Reward shaping) are three novel methods to successfully improve both the effectiveness and efficiency of reasoning. Laser-D and Laser-DE achieve a 6.1 improvement on AIME2024 while reducing token usage by 63%.
- ๐ฅ [05/2025] We are excited to release the resources for the paper "Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping"
In Laser, we propose a unified view for length-based reward shaping, unifying various reward-shaping and truncation methods. Building on this view, we propose a novel Length-bAsed StEp Reward shaping method (Laser), which employs a step reward function based on target length. We further propose the adaptive version of Laser, Laser-D and Laser-DE, based on two key intuitions:
-
The reasoning behavior of the model evolves dynamically during training, necessitating reward specifications that are also adaptive and dynamic;
-
Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries.
This approach facilitates a combination of fast and slow thinking, leading to a better overall tradeoff. Unlike methods that improve token efficiency at the expense of accuracy, our proposed approaches deliver substantial gains in both dimensionsโeven on the challenging AIME2024 benchmark.
We propose a unified framework for length-based reward shaping, unifying various reward-shaping and truncation methods. More details can be found in our paper, section 4.
Efficacy (accuracy) and efficiency (token efficiency) are actually two conflicting goals. The goal of RL-based CoT compression should be to find a better balance between the two and improve both.
Each point in the following figures represents an independent experiment, obtained through different training runs with different parameter configurations. Benchmarks consist of MATH500, AIME2024, AMC2023, and Olympiad Bench.
Dataset Name | Description | Link |
---|---|---|
Laser-Deepscaler-Dataset | Training dataset | ๐ค HuggingFace |
1.5B Models (Based on DeepSeek-R1-Distill-Qwen-1.5B)
Model Name | Adaptive Target Length (L) | Size | Link |
---|---|---|---|
Laser-L2048 | 2048 | 1.5B | ๐ค HuggingFace |
Laser-L4096 | 4096 | 1.5B | ๐ค HuggingFace |
Laser-L8192 | 8192 | 1.5B | ๐ค HuggingFace |
Model Name | Adaptive Target Length (L) | Size | Link |
---|---|---|---|
Laser-D-L1024 | 1024 | 1.5B | ๐ค HuggingFace |
Laser-D-L2048 | 2048 | 1.5B | ๐ค HuggingFace |
Laser-D-L4096 | 4096 | 1.5B | ๐ค HuggingFace |
Model Name | Adaptive Target Length (L) | Size | Link |
---|---|---|---|
Laser-DE-L1024 | 1024 | 1.5B | ๐ค HuggingFace |
Laser-DE-L2048 | 2048 | 1.5B | ๐ค HuggingFace |
Laser-DE-L4096 | 4096 | 1.5B | ๐ค HuggingFace |
7B Models (Based on DeepSeek-R1-Distill-Qwen-7B)
Model Name | Adaptive Target Length (L) | Size | Link |
---|---|---|---|
Laser-D-L4096 | 4096 | 7B | ๐ค HuggingFace |
Laser-DE-L4096 | 4096 | 7B | ๐ค HuggingFace |
Note: Smaller value of
$L$ indicates more rapid compression during training, resulting in more concise Chains of Thought (CoTs) during inference.
conda create -n laser python=3.10
git clone https://github.com/hkust-nlp/Laser.git
pip install -r requirement.txt
pip install flash-attn==2.6.3 --no-build-isolation
pip install -e . --no-dependencies
python scripts/pull_from_hub.py --repo_id hkust-nlp/Laser-Deepscaler-Dataset --local_path ./data/deepscaler --repo_type dataset --ignore_patterns "global_step*"
or you can download the dataset from ๐ค HuggingFace and put it in the data/deepscaler
folder.
If you use slurm to run the training with ray, you can use the following command:
bash scripts/example/ray_start_slurm.sh $SCRIPT
# e.g. bash scripts/example/ray_start_slurm.sh scripts/training/laser-de-1.5b/laser-de-1.5B-l4096.sh
Otherwise, you can use the following command to run the training with ray:
bash scripts/example/ray_start_sh.sh $SCRIPT
SCRIPT is the script you want to run, for example, scripts/training/laser-de-1.5b/laser-de-1.5B-l4096.sh
.
# Laser
scripts/training/laser-1.5b/laser-1.5b-l2048.sh
scripts/training/laser-1.5b/laser-1.5b-l4096.sh
scripts/training/laser-1.5b/laser-1.5b-l8192.sh
# Laser-D
scripts/training/laser-d-1.5b/laser-d-1.5b-l1024.sh
scripts/training/laser-d-1.5b/laser-d-1.5b-l2048.sh
scripts/training/laser-d-1.5b/laser-d-1.5b-l4096.sh
# Laser-DE
scripts/training/laser-de-1.5b/laser-de-1.5b-l1024.sh
scripts/training/laser-de-1.5b/laser-de-1.5b-l2048.sh
scripts/training/laser-de-1.5b/laser-de-1.5b-l4096.sh
RUNNAME=""
INIT_MODEL_PATH="" # path to the init model, or any hf model path
TPSIZE=1
STEPS="" # if empty, init model will be evaluated
bash Qwen2.5-Math/evaluation/sh/nodes/run_eval.sh $RUNNAME $INIT_MODEL_PATH $TPSIZE $STEPS
If you find the content of this project helpful, please cite our paper as follows:
@misc{liu2025learnreasonefficientlyadaptive,
title={Learn to Reason Efficiently with Adaptive Length-based Reward Shaping},
author={Wei Liu and Ruochen Zhou and Yiyun Deng and Yuzhen Huang and Junteng Liu and Yuntian Deng and Yizhe Zhang and Junxian He},
year={2025},
eprint={2505.15612},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.15612},
}