Skip to content

akarinmoe/SRaR

Repository files navigation

SRaR: Step-wise Rubric Rewards for LLM Reasoning

arXiv  Datasets and Weights  Code

This repository contains the official implementation of SRaR (Step-wise Rubrics as Rewards), an RLVR framework that delivers fine-grained, step-level rubric supervision during reinforcement learning training for LLM reasoning.

Overview

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning LLMs using only final-answer correctness, providing no supervision over intermediate reasoning steps. Rubric-based methods like RaR introduce finer-grained evaluation, but still aggregate rubric scores into a single trajectory-level scalar, leading to three structural weaknesses: loss of multi-criterion structure, indiscriminate step supervision (18.2% wrong steps positively rewarded, 49.9% correct steps penalized), and reward hacking via self-correction looping.

SRaR addresses these through three coordinated designs:

  • Step-attributed rubric judging: An LLM judge ties each rubric item (SUGGEST / PITFALL / BONUS) to the specific reasoning step it evaluates.
  • Per-step cross-rollout normalization: Each step's reward is normalized across rollouts so only steps whose quality varies produce a learning signal.
  • Decoupled advantage estimator: Outcome advantage + bounded per-step rubric offset, preventing rubric noise from entering the GRPO baseline.

Installation

git clone <repo_url>
cd verl
pip install -e .

Data Preparation

# Preprocess training data (adds step-format prompt, converts to verl format)
python recipe/SRaR/preprocess_data.py --mode train --input /path/to/rubric_data.parquet --output recipe/SRaR/data/train.parquet

# Preprocess validation data
python recipe/SRaR/preprocess_data.py --mode val --input /path/to/val_data.parquet --output recipe/SRaR/data/val.parquet

Training data requires columns: problem, rubric, ground_truth. The rubric column contains lines like:

<SUGGEST> Applies the parity indicator to rewrite the restricted sum.
<PITFALL> Omits the 1/2 factor in the parity projection.
<BONUS> Uses the compact GF projection form.
<ANSWER> The final reported m+n equals 37.

Training

SRaR

ray stop --force || true
ray start --head --num-gpus=8 --dashboard-host=0.0.0.0
export HOME=/path/to/your/home
cd verl && bash recipe/SRaR/run_srar.sh

RaR

ray stop --force || true
ray start --head --num-gpus=8 --dashboard-host=0.0.0.0
export HOME=/path/to/your/home
cd verl && bash recipe/SRaR/run_rar.sh

Key Hyperparameters

Hyperparameter Value
Train Batch Size 128
Rollout Group Size (n) 8
Learning Rate 1e-6
Training Steps 200
Max Response Length 8192
Clip Ratio High 0.28
R_SUG / R_PIT / R_BON 0.8 / -1.0 / 1.0
Format Weight (lambda) 0.1

An OpenAI-compatible LLM judge service is required. Set JUDGE_URL and JUDGE_MODEL in the run script.

Project Structure

recipe/SRaR/
  reward_manager.py         # RaR/SRaR reward managers with LLM judge
  srar_advantage.py         # Decoupled advantage estimator (SRaR) + GRPO (RaR)
  srar_ray_trainer.py       # Ray trainers
  main_srar.py / main_rar.py
  preprocess_data.py
  run_srar.sh / run_rar.sh
  config/                   # Hydra configs

Citation

@article{xie2026step,
  title={Step-wise Rubric Rewards for LLM Reasoning},
  author={Xie, Weichu and Zhao, Haozhe and Liu, Wenpu and Zhu, Yongfu and Chen, Liang and Ye, Minghao and Chen, Zirong and Xu, Yuqi and Dong, Shuai and Wang, Ziyue and others},
  journal={arXiv preprint arXiv:2605.17291},
  year={2026}
}

Acknowledgements

Built on verl and adapted from the DAPO training recipe.

About

Step-wise Rubric Rewards for LLM Reasoning

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors