Skip to content

hkust-nlp/Laser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

7 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Laser

๐Ÿค— HF Repoย ย ย  ๐Ÿ“ƒ Paper ๐Ÿฆ Twitterย ย ย 

This repo contains the resources (Code, Data, Models) for the paper "Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping"

Laser (Length-bAsed StEp Reward shaping) and its adaptive versions Laser-D, Laser-DE ( Dynamic and Difficulty-aware Length-bAsed StEp Reward shaping) are three novel methods to successfully improve both the effectiveness and efficiency of reasoning. Laser-D and Laser-DE achieve a 6.1 improvement on AIME2024 while reducing token usage by 63%.

Laser main figure

Table of Contents

News

  • ๐Ÿ”ฅ [05/2025] We are excited to release the resources for the paper "Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping"

Introduction

In Laser, we propose a unified view for length-based reward shaping, unifying various reward-shaping and truncation methods. Building on this view, we propose a novel Length-bAsed StEp Reward shaping method (Laser), which employs a step reward function based on target length. We further propose the adaptive version of Laser, Laser-D and Laser-DE, based on two key intuitions:

  1. The reasoning behavior of the model evolves dynamically during training, necessitating reward specifications that are also adaptive and dynamic;

  2. Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries.

This approach facilitates a combination of fast and slow thinking, leading to a better overall tradeoff. Unlike methods that improve token efficiency at the expense of accuracy, our proposed approaches deliver substantial gains in both dimensionsโ€”even on the challenging AIME2024 benchmark.

Unified Framework for Length-based Reward Shaping

We propose a unified framework for length-based reward shaping, unifying various reward-shaping and truncation methods. More details can be found in our paper, section 4.

Unified Framework for Length-based Reward Shaping

Performance

Efficacy-Efficiency Trade-off

Efficacy (accuracy) and efficiency (token efficiency) are actually two conflicting goals. The goal of RL-based CoT compression should be to find a better balance between the two and improve both.

Each point in the following figures represents an independent experiment, obtained through different training runs with different parameter configurations. Benchmarks consist of MATH500, AIME2024, AMC2023, and Olympiad Bench.

Average Performance Average AIME Performance

๐Ÿš€ Resources

Datasets

Dataset Name Description Link
Laser-Deepscaler-Dataset Training dataset ๐Ÿค— HuggingFace

Models

1.5B Models (Based on DeepSeek-R1-Distill-Qwen-1.5B)

Laser Models
Model Name Adaptive Target Length (L) Size Link
Laser-L2048 2048 1.5B ๐Ÿค— HuggingFace
Laser-L4096 4096 1.5B ๐Ÿค— HuggingFace
Laser-L8192 8192 1.5B ๐Ÿค— HuggingFace
Laser-D Models
Model Name Adaptive Target Length (L) Size Link
Laser-D-L1024 1024 1.5B ๐Ÿค— HuggingFace
Laser-D-L2048 2048 1.5B ๐Ÿค— HuggingFace
Laser-D-L4096 4096 1.5B ๐Ÿค— HuggingFace
Laser-DE Models
Model Name Adaptive Target Length (L) Size Link
Laser-DE-L1024 1024 1.5B ๐Ÿค— HuggingFace
Laser-DE-L2048 2048 1.5B ๐Ÿค— HuggingFace
Laser-DE-L4096 4096 1.5B ๐Ÿค— HuggingFace

7B Models (Based on DeepSeek-R1-Distill-Qwen-7B)

Model Name Adaptive Target Length (L) Size Link
Laser-D-L4096 4096 7B ๐Ÿค— HuggingFace
Laser-DE-L4096 4096 7B ๐Ÿค— HuggingFace

Note: Smaller value of $L$ indicates more rapid compression during training, resulting in more concise Chains of Thought (CoTs) during inference.

How to Start ๐Ÿƒ?

Installation

conda create -n laser python=3.10
git clone https://github.com/hkust-nlp/Laser.git

pip install -r requirement.txt
pip install flash-attn==2.6.3 --no-build-isolation
pip install -e . --no-dependencies

Data Preparation

python scripts/pull_from_hub.py --repo_id hkust-nlp/Laser-Deepscaler-Dataset --local_path ./data/deepscaler --repo_type dataset --ignore_patterns "global_step*"

or you can download the dataset from ๐Ÿค— HuggingFace and put it in the data/deepscaler folder.

Training

If you use slurm to run the training with ray, you can use the following command:

bash scripts/example/ray_start_slurm.sh $SCRIPT

# e.g. bash scripts/example/ray_start_slurm.sh scripts/training/laser-de-1.5b/laser-de-1.5B-l4096.sh

Otherwise, you can use the following command to run the training with ray:

bash scripts/example/ray_start_sh.sh $SCRIPT

SCRIPT is the script you want to run, for example, scripts/training/laser-de-1.5b/laser-de-1.5B-l4096.sh.

# Laser
scripts/training/laser-1.5b/laser-1.5b-l2048.sh
scripts/training/laser-1.5b/laser-1.5b-l4096.sh
scripts/training/laser-1.5b/laser-1.5b-l8192.sh

# Laser-D
scripts/training/laser-d-1.5b/laser-d-1.5b-l1024.sh
scripts/training/laser-d-1.5b/laser-d-1.5b-l2048.sh
scripts/training/laser-d-1.5b/laser-d-1.5b-l4096.sh

# Laser-DE
scripts/training/laser-de-1.5b/laser-de-1.5b-l1024.sh
scripts/training/laser-de-1.5b/laser-de-1.5b-l2048.sh
scripts/training/laser-de-1.5b/laser-de-1.5b-l4096.sh

Evaluation

RUNNAME=""
INIT_MODEL_PATH=""  # path to the init model, or any hf model path
TPSIZE=1
STEPS="" # if empty, init model will be evaluated

bash Qwen2.5-Math/evaluation/sh/nodes/run_eval.sh $RUNNAME $INIT_MODEL_PATH $TPSIZE $STEPS

Citation

If you find the content of this project helpful, please cite our paper as follows:

@misc{liu2025learnreasonefficientlyadaptive,
      title={Learn to Reason Efficiently with Adaptive Length-based Reward Shaping}, 
      author={Wei Liu and Ruochen Zhou and Yiyun Deng and Yuzhen Huang and Junteng Liu and Yuntian Deng and Yizhe Zhang and Junxian He},
      year={2025},
      eprint={2505.15612},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.15612}, 
}

Acknowledgements

  • As a sister project of SimpleRL, we would like to thank the authors of SimpleRL for their great work.
  • Our code is built on the great work of verl.

About

Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published