🚀 Introducing simpleR1: A streamlined framework for training R1-like models based on TRL grpo_trainer

Dear open-r1 community users, 

We would like to share **`simpleR1`** (https://github.com/yflyzhang/simpleR1), a simple framework for training R1-like models, built on Hugging Face’s TRL GRPOTrainer and the open-r1 project (https://github.com/huggingface/open-r1). Designed for mathematical reasoning tasks, simpleR1 enhances the GRPO training pipeline with user-friendly features, most notably the modular model generation and reward score estimation, which allows custom behaviors like user-defined resampling conditions, and the ability to train and evaluate simultaneously, allowing real-time observation of results for immediate performance insights.

- Enhanced GRPO trainer with multi-iteration support, precise time estimation (tqdm), custom evaluate block, and wandb logging.

- Modularized generate, score, and log for completions, enabling more user-defined controls.

- Implementation of a simple reject sampling approach for generation.

- Compatible with Hugging Face TRL and open-r1 workflows.


### Training Examples:

#### Train on MATH and evaluate on MATH-500:

We trained four `Qwen2.5-1.5B*` models and one `Qwen3-0.6B*` model on the [MATH-benchmark](https://huggingface.co/datasets/nlile/hendrycks-MATH-benchmark) (1000 samples for fast testing) dataset, with real-time evaluation on [MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500) using the implemented `evaluate` function.



- **Models**: 
`Qwen/Qwen2.5-1.5B`, `Qwen/Qwen2.5-1.5B-Instruct`, `Qwen/Qwen2.5-Math-1.5B`, `Qwen/Qwen2.5-Math-1.5B-Instruct`, `Qwen/Qwen3-0.6B`.

- **Setup**: Trained for 1 epoch, 3 grpo iterations, on one A100-80G GPU. More parameters as given in the usage example in the simpleR1 project. Train and evaluation accuracy and completion length were logged via wandb.

- **Rewards**: Rule-based rewards were adopted in the example, including 'accuracy reward', 'format reward', and 'tag count reward', with different weights to arrive at the final reward signal. For example, `reward = 8 * accuracy_reward + 1 * format_reward + 1 * tag_count_reward`.


- **Results**: 
 - All models improved accuracy, with Math-specific models (`Qwen2.5-Math-1.5B*`) outperforming general models (`Qwen2.5-1.5B*`). 
 - Completion length varied, with Instruct models often producing shorter outputs. 
 - `Qwen3-0.6B` outperforms all the tested `Qwen2.5*-1.5B*` variants except for `Qwen2.5-Math-1.5B-Instruct`, but this superior performance comes at the expense of generating significantly more tokens. 


| Model | Initial Accuracy | Best Accuracy | Completion Length Trend | Token Efficiency | Runtime |
| --- | --- | --- | --- | --- | --- |
| `Qwen2.5-1.5B` | 0.150 | 0.478 | 2,044 → 550 tokens | ↑ | 3h 37m 55 |
| `Qwen2.5-1.5B-Instruct` | 0.368 | 0.492 | 455 → 552 tokens | ↓ | 2h 18m 43 |
| `Qwen2.5-Math-1.5B` | 0.432 | 0.606 | 1425 → 812 tokens | ↑ | 3h 6m 20s |
| `Qwen2.5-Math-1.5B-Instruct` | 0.738 | 0.766 | 591 → 581 tokens | -- | 5h 39m 41 |
| `Qwen3-0.6B` | 0.576 | 0.612 | 2096 → 1799 tokens | ↑ | 11h 3m 10 |




 <img src="https://github.com/user-attachments/assets/36e0a095-1f24-4362-ad5f-68fd5d0a7511" width="800" />
 
 Fig 1. SimpleR1 running example (eval on MATH-500).
 
 📈 More train and eval logs are available at WandB project: 
 <a href="https://api.wandb.ai/links/yflyzhang/wkdme9ea">wandb log</a>.



#### Train on gsm8k and evaluate on MATH-500:
We also trained the models on [gsm8k](https://huggingface.co/datasets/openai/gsm8k) (with 1000 samples for fast testing) and evaluated on [MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500). The evaluation results can be found as follows:


 <img src="https://github.com/user-attachments/assets/76613bc1-c7b9-4e5b-9362-90a6af3ec70d" width="800" />
 
 Fig 2. Train on gsm8k while evaluate on MATH-500.
 
 📈 More train and eval logs are available at WandB project: 
<a href="https://wandb.ai/yflyzhang/simpleR1-gsm8k/reports/SimpleR1-Training-Examples-gsm8k---VmlldzoxMjg1MjE4Mw?accessToken=ifm2s40kzp8a5i5n94jozsli50mdp978s9hz8ukjrqezo0frsj684l2aayvn5r7d">wandb log</a>


 

We invite the open-r1 community to try simpleR1, leverage its real-time evaluation capabilities, and share feedback! It’s an ideal starting point for experimenting with GRPO-based reasoning models.

For further details, please explore the code, configs, and scripts at simpleR1: https://github.com/yflyzhang/simpleR1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Introducing simpleR1: A streamlined framework for training R1-like models based on TRL grpo_trainer #650

Training Examples:

Train on MATH and evaluate on MATH-500:

Train on gsm8k and evaluate on MATH-500:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Initial Accuracy	Best Accuracy	Completion Length Trend	Token Efficiency	Runtime
`Qwen2.5-1.5B`	0.150	0.478	2,044 → 550 tokens	↑	3h 37m 55
`Qwen2.5-1.5B-Instruct`	0.368	0.492	455 → 552 tokens	↓	2h 18m 43
`Qwen2.5-Math-1.5B`	0.432	0.606	1425 → 812 tokens	↑	3h 6m 20s
`Qwen2.5-Math-1.5B-Instruct`	0.738	0.766	591 → 581 tokens	--	5h 39m 41
`Qwen3-0.6B`	0.576	0.612	2096 → 1799 tokens	↑	11h 3m 10

🚀 Introducing simpleR1: A streamlined framework for training R1-like models based on TRL grpo_trainer #650

Description

Training Examples:

Train on MATH and evaluate on MATH-500:

Train on gsm8k and evaluate on MATH-500:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions