Skip to content

yongchao98/R1-Code-Interpreter

Repository files navigation

R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning

Our code is based on Llama-factory/VeRL/Search-R1 for the SFT and RL training and SymBench/BIG-Bench-Hard/reasoning-gym for datasets/benchmarks of reasoning/planning tasks.

📝 Introduction

R1-Code-Interpreter is the first framework to train LLMs for step-by-step code reasoning using multi-turn supervised fine-tuning and reinforcement learning. By curating 144 diverse reasoning and planning tasks, we enable Qwen-2.5 models (3B/7B/14B) to autonomously decide when and how to invoke code. Our best model, R1-CI-14B, outperforms GPT-4o (text-only) and approaches GPT-4o with Code Interpreter, showing emergent self-checking behavior via code generation.

🏆 Performance

GRPO Training

🤖 Dataset

The implemented tasks are now available on huggingface-hub:

Model Name HF Link
R1-Code-Interpreter-Data 🤗 yongchao98/R1-Code-Interpreter-Data

🤖 Model

R1-CI-14B/7B/3B are now available on huggingface-hub:

Model Name HF Checkpoint Size
R1-Code-Interpreter-14B 🤗 yongchao98/R1-Code-Interpreter-14B 14B
R1-Code-Interpreter-7B 🤗 yongchao98/R1-Code-Interpreter-7B 7B
R1-Code-Interpreter-3B 🤗 yongchao98/R1-Code-Interpreter-3B 3B

🚀 Get Started

Direct usage (Inference)

First we create the environment for inference and SFT training.

git clone https://github.com/yongchao98/R1-Code-Interpreter.git
cd R1-Code-Interpreter
conda create -n llama_factory_infer python=3.11
conda activate llama_factory_infer
cd LLaMA-Factory
pip install -r requirements.txt
cd ..

(In benchmark_inference_test.py, fill your python local path of current directory in line 28 and choose desired model type in line 30; In generation_models.py and Search-R1/r1_code_inter/generation_models.py, fill in your OpenAI API for GPT-4o calling to extract the answer). Then we can run the testing R1-CI models with:

python benchmark_inference_test.py

SFT training

Then for SFT training, we'd better create another environment. We can do this by running the following command:

conda create -n llama_factory_SFT python=3.11
conda activate llama_factory_SFT
cd LLaMA-Factory
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
pip install deepspeed==0.15.2
cd ..
sh finetune_qwen_7b_1M.sh

GRPO training

Then for GRPO training, we'd better create another environment. We can do this by running the following command:

cd R1-Code-Interpreter
conda deactivate
conda create -n R1_code_inter python=3.11
conda activate R1_code_inter
pip install reasoning-gym
git clone https://github.com/volcengine/verl.git
cd verl
pip3 install -e .
pip install --upgrade huggingface_hub
huggingface-cli login
cd ../Search-R1
pip install -r requirements.txt
pip3 install flash-attn --no-build-isolation
cd ..

(In Search-R1/train_grpo_3B.sh, fill your wandb key and python local path in line 1 and line 2; In r1_code_inter/generation_models.py and ../generation_models.py, fill in your OpenAI API for GPT-4o calling to extract the answer):

cd Search-R1
sh train_grpo_3B.sh

✍️ Citation

@misc{chen2025r1codeinterpretertrainingllmsreason,
      title={R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning}, 
      author={Yongchao Chen and Yueying Liu and Junwei Zhou and Yilun Hao and Jingquan Wang and Yang Zhang and Chuchu Fan},
      year={2025},
      eprint={2505.21668},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2505.21668}, 
}

About

R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages