Description
Dear open-r1 community users,
We would like to share simpleR1
(https://github.com/yflyzhang/simpleR1), a simple framework for training R1-like models, built on Hugging Face’s TRL GRPOTrainer and the open-r1 project (https://github.com/huggingface/open-r1). Designed for mathematical reasoning tasks, simpleR1 enhances the GRPO training pipeline with user-friendly features, most notably the modular model generation and reward score estimation, which allows custom behaviors like user-defined resampling conditions, and the ability to train and evaluate simultaneously, allowing real-time observation of results for immediate performance insights.
-
Enhanced GRPO trainer with multi-iteration support, precise time estimation (tqdm), custom evaluate block, and wandb logging.
-
Modularized generate, score, and log for completions, enabling more user-defined controls.
-
Implementation of a simple reject sampling approach for generation.
-
Compatible with Hugging Face TRL and open-r1 workflows.
Training Examples:
Train on MATH and evaluate on MATH-500:
We trained four Qwen2.5-1.5B*
models and one Qwen3-0.6B*
model on the MATH-benchmark (1000 samples for fast testing) dataset, with real-time evaluation on MATH-500 using the implemented evaluate
function.
-
Models:
Qwen/Qwen2.5-1.5B
,Qwen/Qwen2.5-1.5B-Instruct
,Qwen/Qwen2.5-Math-1.5B
,Qwen/Qwen2.5-Math-1.5B-Instruct
,Qwen/Qwen3-0.6B
. -
Setup: Trained for 1 epoch, 3 grpo iterations, on one A100-80G GPU. More parameters as given in the usage example in the simpleR1 project. Train and evaluation accuracy and completion length were logged via wandb.
-
Rewards: Rule-based rewards were adopted in the example, including 'accuracy reward', 'format reward', and 'tag count reward', with different weights to arrive at the final reward signal. For example,
reward = 8 * accuracy_reward + 1 * format_reward + 1 * tag_count_reward
. -
Results:
- All models improved accuracy, with Math-specific models (
Qwen2.5-Math-1.5B*
) outperforming general models (Qwen2.5-1.5B*
). - Completion length varied, with Instruct models often producing shorter outputs.
Qwen3-0.6B
outperforms all the testedQwen2.5*-1.5B*
variants except forQwen2.5-Math-1.5B-Instruct
, but this superior performance comes at the expense of generating significantly more tokens.
- All models improved accuracy, with Math-specific models (
Model | Initial Accuracy | Best Accuracy | Completion Length Trend | Token Efficiency | Runtime |
---|---|---|---|---|---|
Qwen2.5-1.5B |
0.150 | 0.478 | 2,044 → 550 tokens | ↑ | 3h 37m 55 |
Qwen2.5-1.5B-Instruct |
0.368 | 0.492 | 455 → 552 tokens | ↓ | 2h 18m 43 |
Qwen2.5-Math-1.5B |
0.432 | 0.606 | 1425 → 812 tokens | ↑ | 3h 6m 20s |
Qwen2.5-Math-1.5B-Instruct |
0.738 | 0.766 | 591 → 581 tokens | -- | 5h 39m 41 |
Qwen3-0.6B |
0.576 | 0.612 | 2096 → 1799 tokens | ↑ | 11h 3m 10 |
Fig 1. SimpleR1 running example (eval on MATH-500).
📈 More train and eval logs are available at WandB project:
wandb log.
Train on gsm8k and evaluate on MATH-500:
We also trained the models on gsm8k (with 1000 samples for fast testing) and evaluated on MATH-500. The evaluation results can be found as follows:
Fig 2. Train on gsm8k while evaluate on MATH-500.
📈 More train and eval logs are available at WandB project:
wandb log
We invite the open-r1 community to try simpleR1, leverage its real-time evaluation capabilities, and share feedback! It’s an ideal starting point for experimenting with GRPO-based reasoning models.
For further details, please explore the code, configs, and scripts at simpleR1: https://github.com/yflyzhang/simpleR1.