Skip to content

🚀 Introducing simpleR1: A streamlined framework for training R1-like models based on TRL grpo_trainer #650

Open
@yflyzhang

Description

@yflyzhang

Dear open-r1 community users,

We would like to share simpleR1 (https://github.com/yflyzhang/simpleR1), a simple framework for training R1-like models, built on Hugging Face’s TRL GRPOTrainer and the open-r1 project (https://github.com/huggingface/open-r1). Designed for mathematical reasoning tasks, simpleR1 enhances the GRPO training pipeline with user-friendly features, most notably the modular model generation and reward score estimation, which allows custom behaviors like user-defined resampling conditions, and the ability to train and evaluate simultaneously, allowing real-time observation of results for immediate performance insights.

  • Enhanced GRPO trainer with multi-iteration support, precise time estimation (tqdm), custom evaluate block, and wandb logging.

  • Modularized generate, score, and log for completions, enabling more user-defined controls.

  • Implementation of a simple reject sampling approach for generation.

  • Compatible with Hugging Face TRL and open-r1 workflows.

Training Examples:

Train on MATH and evaluate on MATH-500:

We trained four Qwen2.5-1.5B* models and one Qwen3-0.6B* model on the MATH-benchmark (1000 samples for fast testing) dataset, with real-time evaluation on MATH-500 using the implemented evaluate function.

  • Models:
    Qwen/Qwen2.5-1.5B, Qwen/Qwen2.5-1.5B-Instruct, Qwen/Qwen2.5-Math-1.5B, Qwen/Qwen2.5-Math-1.5B-Instruct, Qwen/Qwen3-0.6B.

  • Setup: Trained for 1 epoch, 3 grpo iterations, on one A100-80G GPU. More parameters as given in the usage example in the simpleR1 project. Train and evaluation accuracy and completion length were logged via wandb.

  • Rewards: Rule-based rewards were adopted in the example, including 'accuracy reward', 'format reward', and 'tag count reward', with different weights to arrive at the final reward signal. For example, reward = 8 * accuracy_reward + 1 * format_reward + 1 * tag_count_reward.

  • Results:

    • All models improved accuracy, with Math-specific models (Qwen2.5-Math-1.5B*) outperforming general models (Qwen2.5-1.5B*).
    • Completion length varied, with Instruct models often producing shorter outputs.
    • Qwen3-0.6B outperforms all the tested Qwen2.5*-1.5B* variants except for Qwen2.5-Math-1.5B-Instruct, but this superior performance comes at the expense of generating significantly more tokens.
Model Initial Accuracy Best Accuracy Completion Length Trend Token Efficiency Runtime
Qwen2.5-1.5B 0.150 0.478 2,044 → 550 tokens 3h 37m 55
Qwen2.5-1.5B-Instruct 0.368 0.492 455 → 552 tokens 2h 18m 43
Qwen2.5-Math-1.5B 0.432 0.606 1425 → 812 tokens 3h 6m 20s
Qwen2.5-Math-1.5B-Instruct 0.738 0.766 591 → 581 tokens -- 5h 39m 41
Qwen3-0.6B 0.576 0.612 2096 → 1799 tokens 11h 3m 10


Fig 1. SimpleR1 running example (eval on MATH-500).
📈 More train and eval logs are available at WandB project: wandb log.

Train on gsm8k and evaluate on MATH-500:

We also trained the models on gsm8k (with 1000 samples for fast testing) and evaluated on MATH-500. The evaluation results can be found as follows:


Fig 2. Train on gsm8k while evaluate on MATH-500.
📈 More train and eval logs are available at WandB project: wandb log


We invite the open-r1 community to try simpleR1, leverage its real-time evaluation capabilities, and share feedback! It’s an ideal starting point for experimenting with GRPO-based reasoning models.

For further details, please explore the code, configs, and scripts at simpleR1: https://github.com/yflyzhang/simpleR1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions