Current implementations of DeepSeek R1 framework in multimodal settings have predominantly concentrated on unimodal task specialization (e.g., mathematical reasoning, visual-spatial analysis, referring expression comprehension (REC), or visual counting). This narrow focus induces catastrophic forgetting - where optimization for isolated task domains compromises model generalization through capacity competition in shared neural substrates. Furthermore, while existing reproductions prioritize base models due to stability concerns during reinforcement learning (RL) exploration, the critical question of metacognitive emergence ("Aha Moments") in instruction model like Qwen2.5-VL-Instruct remains unaddressed.
To address these limitations, we propose Ocean-R1, a two-stage rule-based RL framework for multimodal intelligence enhancement. Specifically, the first stage is used to strengthen the model's reasoning ability and the second stage is used to improve visual perception. Our experiments show that our approach successfully induces the emergence of metacognition in Qwen2.5-VL-Instruct (3B/7B), achieving significant improvements on multiple tasks:
- Visual Math: MathVision (+2.7/+2.7), MathVerse (+3.2/+1.4), and MathVista (+4.9/+4.4),
- Geometric Reasoning: GeoQA (+17.5/+22.2),
- Visual Counting: SuperCLEVR (+23.2/+22.6),
- Referring Expression Comprehension (REC): RefCOCO/+/g Avg (+10.2/+1.7),
- Visual Spatial Reasoning: CVBench (+9.3/+6.5),
- OCR: OCR Bench (+9.9/+5.6).
🔥 We apply the awesome verl framework to train our models. To foster further research in this area, we release all our codes, models, and datasets.
- 🤗 Ocean-R1-3B-Instruct
- 🤗 Ocean-R1-7B-Instruct
- 🤗 Ocean_R1_visual_data_stage1 (63k)
- 🤗 Ocean_R1_visual_data_stage2 (20k)
Note
These data are from the open source community and are obtained through cleaning and filtering.
- 2025-04-03: We release the latest Ocean-R1 repo, including codebase, model, and training datasets.
- 2025-03-10: We release the Ocean-R1 repo, including codebase, model, and training datasets.
Model | SuperCLEVR | GEOQA | RefCOCO/+/g AVG | CVBench | OCR Bench | MathVision | MathVista | MathVerse |
---|---|---|---|---|---|---|---|---|
Qwen2.5-VL-3B-Instruct | 64.1 | 38.9 | 75.3 | 66.5 | 74.6 | 21.2 | 62.3 | 35.4 |
Ocean-R1-3B-Instruct | 87.3 | 56.4 | 85.5 | 75.8 | 84.5 | 23.9 | 67.2 | 38.6 |
+ | +23.2 | +17.5 | +10.2 | +9.3 | +9.9 | +2.7 | +4.9 | +3.2 |
Qwen2.5-VL-7B-Instruct | 72.0 | 47.5 | 85.1 | 74.5 | 82.3 | 25.1 | 68.2 | 47.9 |
Ocean-R1-7B-Instruct | 94.6 | 69.7 | 86.8 | 81.0 | 87.9 | 27.8 | 72.6 | 49.3 |
+ | +22.6 | +22.2 | +1.7 | +6.5 | +5.6 | +2.7 | +4.4 | +1.4 |
Examples of Reflection Patterns on GeoQA |
---|
However, this calculation seems to have an error. Let's re-evaluate the problem. |
However, upon rechecking the problem constraints and the logical steps, it appears there was a misinterpretation in the problem statement or a need to re-evaluate the given information. Given the problem constraints and the standard geometric properties, the correct answer should be recalculated as follows: |
Since CP = CD - DP = 8 - 16 = -8, which is not possible, we need to re-evaluate the setup. |
However, this calculation is incorrect. Let's re-evaluate: |
Therefore, angle A = angle ADO. Given that angle D = 50°, we have angle A = angle ADO = 50°. Since DE is parallel to OA, we have angle AOD = angle A + angle ADO = 100°. Since CD is the diameter, we have angle C = 90° - angle AOD = 90° - 100° = -10°. However, this calculation is incorrect. Let's re-evaluate: Since DE is parallel to OA, we have angle AOD = 2 * angle D = 100°. Therefore, angle C = 90° - angle AOD = 90° - 100° = -10°. This is still incorrect. Let's re-evaluate correctly: |
However, this contradicts the given conditions. Therefore, we need to re-evaluate the correct proportion, which is ( \frac{AB}{AC} = \frac{AD}{AB} ). |
Therefore, ( PM = \frac{1}{2} AD ) and (QN = \frac{1}{2} BC ). Given that ( MP + NQ = 14 ) and ( AC + BC = 18 ), we have ( \frac{1}{2} AC + \frac{1}{2} BC = 14 ). Thus, ( AC + BC = 28 ). However, we are given that ( AC + BC = 18 ), which seems to be a contradiction. Revising the given conditions, we use the correct relationship: |
However, there seems to be a discrepancy with the provided options. Given the options, it appears there might be a misinterpretation or a different approach needed. But based on the standard geometric properties and calculations, the correct answer should be: |
Given the options, it seems there might be a misunderstanding in the problem statement or the options provided. |
However, there seems to be a miscalculation in the above reasoning. Let's re-evaluate using the properties of... |
-
Visual Aha Moment: Our findings indicate that in multimodal reasoning scenarios, rule-based reinforcement learning (RL) training of the Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct models can elicit emergent behaviors such as reflection and verification, which are akin to "Aha Moment". Table 1 presents specific instances of these reflective and verification behaviors.
-
The Training Dynamic of Reasoning Behaviors: Figure 1 depicts the evolution of reasoning behavior counts during model training on the GeoQA dataset. Notably, distinct models exhibit divergent trends in reasoning behavior changes. The accuracy rate increases with the increase of reflective behavior. When the model learns to reason, reflective behavior decreases. Notably, multiple reflections may not necessarily produce the correct outcomes, as illustrated in the cases outlined in Table 2.
-
Response Length Variations Correlate with Task Complexity: Our experiments reveal distinct behavioral patterns between model scales in RL training. The Qwen2.5-VL-3B-Instruct model demonstrated simultaneous improvements in both answer accuracy and response length through GRPO training. Conversely, while the Qwen2.5-VL-7B-Instruct variant achieved comparable accuracy gains, it exhibited an inverse relationship with response length. This pattern indicates a negative correlation between model capacity and verbosity, suggesting that larger models may more succinctly encode information when confronting tasks they perceive as less challenging. Notably, when applying RL training exclusively to the complex reasoning subset, we observed consistent performance enhancements across both metrics, highlighting the methodology's effectiveness for cognitively demanding tasks.
- Cross-Task Knowledge Transfer via Compositional Reasoning: Following the enhancement of complex visual reasoning capabilities in stage 1, we achieved substantial improvements in the model's performance on visual counting, referring expression comprehension, and visual spatial reasoning tasks using merely 20k mixed samples. This suggests that augmenting complex visual reasoning abilities can positively transfer to simpler tasks.
Model | Response Length | Reward Score | GeoQA Acc | CVBench Acc |
---|---|---|---|---|
Ocean-R1-3B-stage1 | ![]() |
![]() |
![]() |
![]() |
Ocean-R1-3B-stage2 | ![]() |
![]() |
![]() |
![]() |
Ocean-R1-7B-stage1 | ![]() |
![]() |
![]() |
![]() |
Ocean-R1-7B-stage2 | ![]() |
![]() |
![]() |
git clone https://github.com/VLM-RL/Ocean-R1
cd Ocean-R1
conda create -n ocean_r1 python==3.10
conda activate ocean_r1
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
git clone https://github.com/volcengine/verl.git
pip install -e .
Note
If you meet bug when running the script, refer to verl-install.
You can download our training data from Ocean_R1_visual_data_stage1 and Ocean_R1_visual_data_stage2, also download the validation data from geoqa_test and cvbench_test. Refer to download_data.py
for downloading the datasets and converting them into parquet
formats. Each entry in our datasets is a dictionary organized in the following format.
data = {
"data_source": data_source,
"prompt": [{
"role": "user",
"content": prompt,
}],
"images": images,
"ability": "math", ## default
"reward_model": {
"style": "rule",
"ground_truth": answer
},
"extra_info": {
'index': idx,
'answer': answer,
"question": problem,
"reward_func": reward_func, ## acc or iou
"image_paths": image_paths
}
}
Notably, the prompt is:
"{Question}\nYou FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE in <answer> </answer> tags."
We implement customized reward functions in a separate file and specify them using custom_reward_function.path
and custom_reward_function.name
. Please refer to ./verl/verl/utils/reward_score/custom_reward_fn.py
for more details.
-
for single node
## 3B bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_stage1.sh bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_stage2.sh ## 7B bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_stage1.sh bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_stage2.sh
-
for multiple node
## 3B bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_multinodes_stage1.sh bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_multinodes_stage2.sh ## 7B bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_multinodes_stage1.sh bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_multinodes_stage2.sh
cd ./eval_data
wget https://www.cs.jhu.edu/~zhuowan/zhuowan/SuperCLEVR/to_be_released/images.zip
unzip images.zip
# change image dir and the model path in the scripts
python ./eval/test_qwen2d5vl_counting_superclevr_5k.py
We provide the example script to evaluate on the test set (direct answer form) of GEOQA.
# prepare images for testing
cd ./eval_data
git lfs install
git clone https://huggingface.co/datasets/Luckyjhg/Geo170K
cd Geo170K
unzip images.zip
# change image dir and the model path in the scripts
python ./eval/test_qwen2d5vl_geoqa.py
- Download the COCO Train2014 image and unzip it, and we refer to the image dir as
<your_image_root>
.
- Download the RefCOCO/+/g Annotation files and unzip it.
# Remember to change the model path, image root, and annotation path in the script
python ./eval/test_qwen2d5vl_rec.py
python ./eval/test_qwen2d5vl_cvbench.py
We apply VLMEvalKit to evaluate the other benchmarks.
- Upload Arxiv
- Synthesize more high-quality and diverse multimodal data
- Scale up to larger models and more general tasks
We sincerely thank verl (our initial codebase), DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V, VLM-R1, CLEVR, SuperCLEVR, RefCOCO, and CVBench for providing open source resources and to build the project.
Contributors: Lingfeng Ming, Yadong Li, Song Chen, Jianhua Xu, Zenan Zhou, Weipeng Chen.
If you find this work useful, please cite it as follows:
@misc{ming2025oceanr1,
author = {Lingfeng Ming, Yadong Li, Song Chen, Jianhua Xu, Zenan Zhou, Weipeng Chen},
title = {Ocean-R1: An Open and Generalizable Large Vision-Language Model enhanced by Reinforcement Learning},
howpublished = {\url{https://github.com/VLM-RL/Ocean-R1}},
note = {Accessed: 2025-04-03},
year = {2025}
}