Video-R1: Reinforcing Video Reasoning in MLLMs

[📖 Paper] [🤗 Video-R1-7B-model] [🤗 Video-R1-train-data] [🤖 Video-R1-7B-model] [🤖 Video-R1-train-data]

👀 About Video-R1

Inspired by DeepSeek-R1's success in eliciting reasoning abilities through rule-based RL, we introduce Video-R1 as the first work to systematically explore the R1 paradigm for eliciting video reasoning within MLLMs.

We introduce T-GRPO, an extension of GRPO that incorporates temporal modeling to explicitly promote temporal reasoning. Besides, We constructed two datasets: Video-R1-COT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data.

Our Video-R1-7B obtain strong performance on several video reasoning benchmarks. For example, Video-R1-7B attains a 35.8% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o.

Video-R1-7B can be easily trained using 4 H20 (96GB) GPUs, or 5 A100 (80G) GPUs.

🔥 News

[2025/03/28] We release our paper, codes, model weights, and two curated training datasets in huggingface🤗 and modelscope🤖.
[2025/02/23] We release the preliminary version of Video-R1, you can refer to ./previous_version for this version.

📍 Features

Support Qwen2.5-VL
Support vLLM training and inference
Support Image-Video mixed training
Support multiple types for answers output (multiple choice, numerical, OCR, free-form, regression)
Provide full pipeline (dataset, COT annotation, SFT training, RL training, evaluation, etc)

🔍 Dataset

To overcome the scarcity of high-quality video reasoning training data, we strategically introduce image-based reasoning data as part of training data. We collect data from a variety of public datasets and carefully sample and balance the proportion of each subset.

To facilitate an effective SFT cold start, we leverage Qwen2.5-VL-72B to generate COT rationales for the samples in Video-R1-260k. After applying basic rule-based filtering to remove low-quality or inconsistent outputs, we obtain a high-quality CoT dataset, Video-R1-COT 165k.

🏆 Performance

Video-R1 significantly outperforms previous models across most benchmarks. Notably, on VSI-Bench, which focuses on spatial reasoning in videos, Video-R1-7B achieves a new state-of-the-art accuracy of 35.8%, surpassing GPT-4o, a proprietary model, while using only 32 frames and 7B parameters.

This highlights the necessity of explicit reasoning capability in solving video tasks, and confirms the effectiveness of reinforcement learning for video tasks.

Besides, although the model is trained using only 16 frames, we find that evaluating on more frames (e.g., 64) generally leads to better performance, particularly on benchmarks with longer videos. These results indicate the importance of training models to reason over more frames.

🧠 Aha Moment in Video Reasoning

One of the most intriguing outcomes of reinforcement learning in Video-R1 is the emergence of self-reflection reasoning behaviors, commonly referred to as “aha moments”. Some examples are as follows.

📈 RL Training Curves

The accuracy reward exhibits a generally upward trend, indicating that the model continuously improves its ability to produce correct answers under RL.

Interestingly, the response length curve first drops at the beginning of RL training, then gradually increases. We guess this is because the model initially discards its previous, potentially sub-optimal reasoning style. Then gradually converges to a better and stable reasoning policy.

📐 Set up

git clone https://github.com/tulerfeng/Video-R1
cd Video-R1

# build environment
conda create -n video-r1 python=3.11 
conda activate video-r1
bash setup.sh

# qwen video extraction setting, e.g., max frames, resolutions
# Use the [decord] feature to improve speed
cd src/qwen-vl-utils
pip install -e .[decord]
cd ..

# download training dataset
git lfs install
git clone https://huggingface.co/datasets/Video-R1/Video-R1-data

Please put the downloaded dataset to src/r1-v/Video-R1-data/

Then, unzip the data

python ./src/unzip.py

The Video-R1-260k.json file is for RL training while Video-R1-COT-165k.json is for SFT cold start.

Qwen2.5-VL has been frequently updated in the Transformers library, which may cause version-related bugs or inconsistencies. Our code is compatible with the following version, please download at here

Then install our provided version of transformers

unzip transformers-main.zip
cd ./transformers-main
pip install .

For vLLM library, please use 0.7.2 version.

For trl library, please use 0.16.0 version.

🚀 Training

We first perform supervised fine-tuning on the Video-R1-COT-165k dataset for one epoch to obtain the Qwen2.5-VL-7B-SFT model. If you want to perform CoT annotation on your own data, please refer to src/generate_cot_vllm.py

bash ./src/scripts/run_sft_video.sh

If you want to skip the SFT process, we also provide one of our SFT models at 🤗Qwen2.5-VL-SFT.

This is followed by RL training on the Video-R1-260k dataset to produce the final Video-R1 model. Due to current computational resource limitations, we train the model for only 1.2k RL steps.

The script for training the obtained Qwen2.5-VL-7B-SFT model with T-GRPO or GRPO is as follows

bash ./src/scripts/run_grpo_video.sh

You can also use the following script to enable vLLM acceleration for RL training

bash ./src/scripts/run_grpo_vllm_qwen25vl.sh

For efficiency considerations, we limit the maximum number of video frames to 16 during training. Each frame is processed at a max resolution of 128 × 28 × 28. You can set this in src/qwen-vl-utils

Please keep per_device_train_batch_size=1 as in previous work r1-v

🔮 Inference & Evaluation

During inference, we increase the max frame resolution to 256 × 28 × 28 and max frames to 16/32/64 to enhance performance. You can easily set this in src/qwen-vl-utils

For all evaluations, we follow the decoding configuration used in the official Qwen2.5-VL demo, with top_p = 0.001 and temperature = 0.01. Setting large top_p may encounter messy output when inference.

We recommend using our provided json files and scripts for easier evaluation.

The json files can be downloaded at: [🤗 Video-R1-eval], put them in /src/r1-v/Evaluation

Next, download the evaluation video data from each benchmark’s official website, and place them in /src/r1-v/Evaluation as specified in the provided json files.

Finally, conduct evaluation on all benchmarks using the following scripts

bash ./src/eval_bench.sh

Acknowledgements

We sincerely appreciate the contributions of the open-source community. The related projects are as follows: R1-V , DeepSeek-R1

Citations

If you find our work helpful for your research, please consider citing our work.

@article{feng2025video,
  title={Video-R1: Reinforcing Video Reasoning in MLLMs},
  author={Feng, Kaituo and Gong, Kaixiong and Li, Bohao and Guo, Zonghao and Wang, Yibing and Peng, Tianshuo and Wang, Benyou and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2503.21776},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
images		images
previous_version/Video-R1-main-previous		previous_version/Video-R1-main-previous
src		src
README.md		README.md
Video-R1 paper.pdf		Video-R1 paper.pdf
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video-R1: Reinforcing Video Reasoning in MLLMs

👀 About Video-R1

🔥 News

📍 Features

🔍 Dataset

🏆 Performance

🧠 Aha Moment in Video Reasoning

📈 RL Training Curves

📐 Set up

🚀 Training

🔮 Inference & Evaluation

Acknowledgements

Citations

About

Releases

Packages

Contributors 2

Languages

tulerfeng/Video-R1

Folders and files

Latest commit

History

Repository files navigation

Video-R1: Reinforcing Video Reasoning in MLLMs

👀 About Video-R1

🔥 News

📍 Features

🔍 Dataset

🏆 Performance

🧠 Aha Moment in Video Reasoning

📈 RL Training Curves

📐 Set up

🚀 Training

🔮 Inference & Evaluation

Acknowledgements

Citations

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages