Skip to content

sunblaze-ucb/Intuitor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intuitor: Learning to Reason without External Rewards

Paper

Intuitor ships in two self-contained variants: open-r1-intuitor and verl-intuitor. Each variant is a complete implementation of the Intuitor algorithm, allowing you to choose the one that best fits your needs. The results presented in our paper were obtained using the open-r1 variant.

Both variant folders retain their original Apache-2.0 LICENSE (and any accompanying NOTICE) files, as required by their respective upstream projects.

See the respective folder for more details:

Overview

Getting started


Firstly, cd into the desired variant folder and set up the enviornment as specified in the README.md file of that variant. Then follow the instructions below to run the example training script.

open-r1-intuitor

Modify the WANDB_KEY in the run_intuitor.sh script to your own WANDB key, then run the following command:

bash run_intuitor.sh

To facilitate future research, we have enabled combining self-certainty with other reward signals. If reward weights are not set to 0, self-certainty and other rewards will first be normalized separately, then added together.

verl-intuitor

First, download the MATH dataset and prepare it using the following Python script:

python examples/data_preprocess/math_dataset.py

Then, run the following command to start the training:

bash math_intuitor.sh

(Modify the WANDB_KEY in the math_intuitor.sh script to your own WANDB key.)


References

This project builds upon the following open-source repositories:

open-r1

  • Repository: open-r1License: Apache License 2.0
  • Description: A community re-implementation of DeepSeek-R1 that provides transparent GRPO training.

verl

  • Repository: verlLicense: Apache License 2.0
  • Description: A high-throughput RL training library featuring hybrid-controller data-flow, FSDP, and vLLM back-ends for large-scale LLM reinforcement learning.

📄 Citation

If you use Intuitor in your research, please cite our paper:

@article{zhao2025learning,
  title={Learning to Reason without External Rewards},
  author={Zhao, Xuandong and Kang, Zhewei and Feng, Aosong and Levine, Sergey and Song, Dawn},
  journal={arXiv preprint arXiv:2505.19590},
  year={2025}
}

About

Code for the paper: "Learning to Reason without External Rewards"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published