Intuitor ships in two self-contained variants: open-r1-intuitor and verl-intuitor. Each variant is a complete implementation of the Intuitor algorithm, allowing you to choose the one that best fits your needs. The results presented in our paper were obtained using the open-r1 variant.
Both variant folders retain their original Apache-2.0 LICENSE
(and any accompanying NOTICE
) files, as required by their respective upstream projects.
See the respective folder for more details:
Firstly, cd into the desired variant folder and set up the enviornment as specified in the README.md
file of that variant. Then follow the instructions below to run the example training script.
Modify the WANDB_KEY in the run_intuitor.sh
script to your own WANDB key, then run the following command:
bash run_intuitor.sh
To facilitate future research, we have enabled combining self-certainty with other reward signals. If reward weights are not set to 0, self-certainty and other rewards will first be normalized separately, then added together.
First, download the MATH dataset and prepare it using the following Python script:
python examples/data_preprocess/math_dataset.py
Then, run the following command to start the training:
bash math_intuitor.sh
(Modify the WANDB_KEY in the math_intuitor.sh
script to your own WANDB key.)
This project builds upon the following open-source repositories:
open-r1
- Repository: open-r1 License: Apache License 2.0
- Description: A community re-implementation of DeepSeek-R1 that provides transparent GRPO training.
verl
- Repository: verl License: Apache License 2.0
- Description: A high-throughput RL training library featuring hybrid-controller data-flow, FSDP, and vLLM back-ends for large-scale LLM reinforcement learning.
If you use Intuitor in your research, please cite our paper:
@article{zhao2025learning,
title={Learning to Reason without External Rewards},
author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
journal={arXiv preprint arXiv:2505.19590},
year={2025}
}