Skip to content

VLM-RL/Ocean-R1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ocean-R1: An Open and Generalizable Large Vision-Language Model enhanced by Reinforcement Learning

🎯Overview

Current implementations of DeepSeek R1 framework in multimodal settings have predominantly concentrated on unimodal task specialization (e.g., mathematical reasoning, visual-spatial analysis, referring expression comprehension (REC), or visual counting). This narrow focus induces catastrophic forgetting - where optimization for isolated task domains compromises model generalization through capacity competition in shared neural substrates. Furthermore, while existing reproductions prioritize base models due to stability concerns during reinforcement learning (RL) exploration, the critical question of metacognitive emergence ("Aha Moments") in instruction model like Qwen2.5-VL-Instruct remains unaddressed.

To address these limitations, we propose Ocean-R1, a two-stage rule-based RL framework for multimodal intelligence enhancement. Specifically, the first stage is used to strengthen the model's reasoning ability and the second stage is used to improve visual perception. Our experiments show that our approach successfully induces the emergence of metacognition in Qwen2.5-VL-Instruct (3B/7B), achieving significant improvements on multiple tasks:

  • Visual Math: MathVision (+2.7/+2.7), MathVerse (+3.2/+1.4), and MathVista (+4.9/+4.4),
  • Geometric Reasoning: GeoQA (+17.5/+22.2),
  • Visual Counting: SuperCLEVR (+23.2/+22.6),
  • Referring Expression Comprehension (REC): RefCOCO/+/g Avg (+10.2/+1.7),
  • Visual Spatial Reasoning: CVBench (+9.3/+6.5),
  • OCR: OCR Bench (+9.9/+5.6).

🔥 We apply the awesome verl framework to train our models. To foster further research in this area, we release all our codes, models, and datasets.

Note

These data are from the open source community and are obtained through cleaning and filtering.


🚀 News

  • 2025-04-03: We release the latest Ocean-R1 repo, including codebase, model, and training datasets.
  • 2025-03-10: We release the Ocean-R1 repo, including codebase, model, and training datasets.

🗞️ Our Findings

Model SuperCLEVR GEOQA RefCOCO/+/g AVG CVBench OCR Bench MathVision MathVista MathVerse
Qwen2.5-VL-3B-Instruct 64.1 38.9 75.3 66.5 74.6 21.2 62.3 35.4
Ocean-R1-3B-Instruct 87.3 56.4 85.5 75.8 84.5 23.9 67.2 38.6
+ +23.2 +17.5 +10.2 +9.3 +9.9 +2.7 +4.9 +3.2
Qwen2.5-VL-7B-Instruct 72.0 47.5 85.1 74.5 82.3 25.1 68.2 47.9
Ocean-R1-7B-Instruct 94.6 69.7 86.8 81.0 87.9 27.8 72.6 49.3
+ +22.6 +22.2 +1.7 +6.5 +5.6 +2.7 +4.4 +1.4

Examples of Reflection Patterns on GeoQA
However, this calculation seems to have an error. Let's re-evaluate the problem.
However, upon rechecking the problem constraints and the logical steps, it appears there was a misinterpretation in the problem statement or a need to re-evaluate the given information. Given the problem constraints and the standard geometric properties, the correct answer should be recalculated as follows:
Since CP = CD - DP = 8 - 16 = -8, which is not possible, we need to re-evaluate the setup.
However, this calculation is incorrect. Let's re-evaluate:
Therefore, angle A = angle ADO. Given that angle D = 50°, we have angle A = angle ADO = 50°. Since DE is parallel to OA, we have angle AOD = angle A + angle ADO = 100°. Since CD is the diameter, we have angle C = 90° - angle AOD = 90° - 100° = -10°. However, this calculation is incorrect. Let's re-evaluate: Since DE is parallel to OA, we have angle AOD = 2 * angle D = 100°. Therefore, angle C = 90° - angle AOD = 90° - 100° = -10°. This is still incorrect. Let's re-evaluate correctly:
However, this contradicts the given conditions. Therefore, we need to re-evaluate the correct proportion, which is ( \frac{AB}{AC} = \frac{AD}{AB} ).
Therefore, ( PM = \frac{1}{2} AD ) and (QN = \frac{1}{2} BC ). Given that ( MP + NQ = 14 ) and ( AC + BC = 18 ), we have ( \frac{1}{2} AC + \frac{1}{2} BC = 14 ). Thus, ( AC + BC = 28 ). However, we are given that ( AC + BC = 18 ), which seems to be a contradiction. Revising the given conditions, we use the correct relationship:
However, there seems to be a discrepancy with the provided options. Given the options, it appears there might be a misinterpretation or a different approach needed. But based on the standard geometric properties and calculations, the correct answer should be:
Given the options, it seems there might be a misunderstanding in the problem statement or the options provided.
However, there seems to be a miscalculation in the above reasoning. Let's re-evaluate using the properties of...
  • Visual Aha Moment: Our findings indicate that in multimodal reasoning scenarios, rule-based reinforcement learning (RL) training of the Qwen2.5-VL-3B-Instruct and Qwen2.5-VL-7B-Instruct models can elicit emergent behaviors such as reflection and verification, which are akin to "Aha Moment". Table 1 presents specific instances of these reflective and verification behaviors.

  • The Training Dynamic of Reasoning Behaviors: Figure 1 depicts the evolution of reasoning behavior counts during model training on the GeoQA dataset. Notably, distinct models exhibit divergent trends in reasoning behavior changes. The accuracy rate increases with the increase of reflective behavior. When the model learns to reason, reflective behavior decreases. Notably, multiple reflections may not necessarily produce the correct outcomes, as illustrated in the cases outlined in Table 2.

  • Response Length Variations Correlate with Task Complexity: Our experiments reveal distinct behavioral patterns between model scales in RL training. The Qwen2.5-VL-3B-Instruct model demonstrated simultaneous improvements in both answer accuracy and response length through GRPO training. Conversely, while the Qwen2.5-VL-7B-Instruct variant achieved comparable accuracy gains, it exhibited an inverse relationship with response length. This pattern indicates a negative correlation between model capacity and verbosity, suggesting that larger models may more succinctly encode information when confronting tasks they perceive as less challenging. Notably, when applying RL training exclusively to the complex reasoning subset, we observed consistent performance enhancements across both metrics, highlighting the methodology's effectiveness for cognitively demanding tasks.

  • Cross-Task Knowledge Transfer via Compositional Reasoning: Following the enhancement of complex visual reasoning capabilities in stage 1, we achieved substantial improvements in the model's performance on visual counting, referring expression comprehension, and visual spatial reasoning tasks using merely 20k mixed samples. This suggests that augmenting complex visual reasoning abilities can positively transfer to simpler tasks.

Image Question Ground Truth Ocean-R1-Instruct (3B or 7B) Verify
As shown in the figure, the straight line BC is tangent to circle O at point A, AD is the chord of circle O. Connect OD, if angle DAC = 50.0, then the degree of angle ODA is () 40° <think>\nTo solve for the measure of angle ODA, we need to use some properties of circles and tangents.\n\n1. Since the line \(BC\) is tangent to the circle at point \(A\), the radius \(OA\) is perpendicular to \(BC\). Therefore, \(\angle OAB = 90^\circ\).\n\n2. Since \(AD\) is a chord and \(OA\) is a radius, \(\angle AOD = 2 \times \angle DAC\) because the angle at the center is twice the angle at the circumference subtended by the same arc. Given that \(\angle DAC = 50^\circ\), we have:\n \[\n \angle AOD = 2 \times \angle DAC = 2 \times 50^\circ = 100^\circ\n \]\n\n3. Since \(OA \perp BC\), \(\angle OAB = 90^\circ\). Therefore, in \(\triangle OAD\), we have:\n \[\n \angle ODA = 180^\circ - \angle OAD - \angle AOD = 180^\circ - 90^\circ - 100^\circ = 180^\circ - 190^\circ = 80^\circ\n \]\n\nThus, the measure of angle \(ODA\) is \(\boxed{40^\circ}\).\n\nHowever, upon reviewing the options and the problem statement, it seems there might be a discrepancy. Given the options, the correct measure of \(\angle ODA\) should be derived from the properties of the circle and the given angles. Since \(\angle AOD = 100^\circ\) and knowing that \(OA \perp BC\), the correct measure of \(\angle ODA\) should be:\n\[\n\angle ODA = 90^\circ - \angle DAC = 90^\circ - 50^\circ = 40^\circ\n\]\n\nTherefore, the correct answer is \(\boxed{40^\circ}\).\n</think>\n<answer>\n40°\n</answer> YES
As shown in the figure, in the parallelogram ABCD, F is a point on AB, DF intersects AC at point E, if CD = 10.0, frac {AE}{EC}=frac {2.0}{5.0}, then the length of BF is () 6 <think>\nSince quadrilateral ABCD is a parallelogram, we have AB = CD = 10. Also, AD parallel BC, which implies ∠DAE = ∠FCE and ∠ADE = ∠CFE. Therefore,△ADE ∼ △CFE. Thus, \(\frac{AE}{EC} = \frac{AD}{FC}\). Given \(\frac{AE}{EC} = \frac{2}{5}\), we have \(\frac{AD}{FC} = \frac{2}{5}\). Since AD = 10, we get \(FC = \frac{5 \times 10}{2} = 25\). Therefore, \(BF = AB - AF = 10 - (25 - 10) = 10 - 15 = -5\). However, this calculation seems to have an error. Let's recheck the proportion. Since \(\frac{AE}{EC} = \frac{2}{5}\), we have \(\frac{AD}{FC} = \frac{2}{5}\). Therefore, \(FC = \frac{5 \times 10}{2} = 25\). Thus, \(BF = AB - AF = 10 - (25 - 10) = 6\). Therefore, the length of BF is 6.\n</think>\n\n<answer>\n6\n</answer> YES
As shown in the figure, CD is the diameter of circle O, chord DE parallel OA, if the degree of angle D is 50.0, then the degree of angle C is () 25° <think>\nSince OA is the radius, we have OA = OD. Therefore, angle A = angle ADO. Given that angle D = 50°, we have angle A = angle ADO = 50°. Since DE is parallel to OA, we have angle AOD = angle A + angle ADO = 100°. Since CD is the diameter, we have angle C = 90° - angle AOD = 90° - 100° = -10°. However, this calculation is incorrect. Let's re-evaluate: Since DE is parallel to OA, we have angle AOD = 2 * angle D = 100°. Therefore, angle C = 90° - angle AOD = 90° - 100° = -10°. This is still incorrect. Let's re-evaluate correctly: Since DE is parallel to OA, we have angle AOD = 2 * angle D = 100°. Therefore, angle C = 90° - angle AOD = 90° - 100° = 20°. Thus, the degree of angle C is 20°.\n</think>\n\n<answer>\n20°\n</answer> NO
Model Response Length Reward Score GeoQA Acc CVBench Acc
Ocean-R1-3B-stage1
Ocean-R1-3B-stage2
Ocean-R1-7B-stage1
Ocean-R1-7B-stage2

📦 Setup

git clone https://github.com/VLM-RL/Ocean-R1
cd Ocean-R1

conda create -n ocean_r1 python==3.10
conda activate ocean_r1
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
git clone https://github.com/volcengine/verl.git
pip install -e .

Note

If you meet bug when running the script, refer to verl-install.

🔄 Training

Data Preparation

You can download our training data from Ocean_R1_visual_data_stage1 and Ocean_R1_visual_data_stage2, also download the validation data from geoqa_test and cvbench_test. Refer to download_data.py for downloading the datasets and converting them into parquet formats. Each entry in our datasets is a dictionary organized in the following format.

data = {
            "data_source": data_source,
            "prompt": [{
                "role": "user",
                "content": prompt,
            }],
            "images": images,
            "ability": "math", ## default
            "reward_model": {
                "style": "rule",
                "ground_truth": answer
            },
            "extra_info": {
                'index': idx,
                'answer': answer,
                "question": problem,
                "reward_func": reward_func, ## acc or iou
                "image_paths": image_paths
            }
        }

Notably, the prompt is: "{Question}\nYou FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE in <answer> </answer> tags."

Customized Reward Function

We implement customized reward functions in a separate file and specify them using custom_reward_function.path and custom_reward_function.name. Please refer to ./verl/verl/utils/reward_score/custom_reward_fn.py for more details.

Start Training (GRPO)

  • for single node

    ## 3B
    bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_stage1.sh
    bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_stage2.sh
    
    ## 7B
    bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_stage1.sh
    bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_stage2.sh
  • for multiple node

    ## 3B
    bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_multinodes_stage1.sh
    bash ./verl/examples/grpo_trainer/run_qwen25vl-3b_multinodes_stage2.sh
    
    ## 7B
    bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_multinodes_stage1.sh
    bash ./verl/examples/grpo_trainer/run_qwen25vl-7b_multinodes_stage2.sh

🧪 Evaluation

Visual Counting: SuperCLEVR

cd ./eval_data
wget https://www.cs.jhu.edu/~zhuowan/zhuowan/SuperCLEVR/to_be_released/images.zip
unzip images.zip

# change image dir and the model path in the scripts
python ./eval/test_qwen2d5vl_counting_superclevr_5k.py

Geometric Reasoning: GEOQA

We provide the example script to evaluate on the test set (direct answer form) of GEOQA.

# prepare images for testing
cd ./eval_data
git lfs install
git clone https://huggingface.co/datasets/Luckyjhg/Geo170K
cd Geo170K
unzip images.zip


# change image dir and the model path in the scripts
python ./eval/test_qwen2d5vl_geoqa.py

Referring Expression Comprehension (REC): RefCOCO/+/g

  1. Download the COCO Train2014 image and unzip it, and we refer to the image dir as <your_image_root>.
  1. Download the RefCOCO/+/g Annotation files and unzip it.
# Remember to change the model path, image root, and annotation path in the script
python ./eval/test_qwen2d5vl_rec.py

Visual Spatial Reasoning: CVBench

python ./eval/test_qwen2d5vl_cvbench.py

Others: OCR Bench, MathVision, MathVerse and MathVista

We apply VLMEvalKit to evaluate the other benchmarks.

📋️ TODO

  • Upload Arxiv
  • Synthesize more high-quality and diverse multimodal data
  • Scale up to larger models and more general tasks

🤝 Acknowledgements

We sincerely thank verl (our initial codebase), DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V, VLM-R1, CLEVR, SuperCLEVR, RefCOCO, and CVBench for providing open source resources and to build the project.

📚 Contributors and Citation

Contributors: Lingfeng Ming, Yadong Li, Song Chen, Jianhua Xu, Zenan Zhou, Weipeng Chen.

If you find this work useful, please cite it as follows:

@misc{ming2025oceanr1,
  author       = {Lingfeng Ming, Yadong Li, Song Chen, Jianhua Xu, Zenan Zhou, Weipeng Chen},
  title        = {Ocean-R1: An Open and Generalizable Large Vision-Language Model enhanced by Reinforcement Learning},
  howpublished = {\url{https://github.com/VLM-RL/Ocean-R1}},
  note         = {Accessed: 2025-04-03},
  year         = {2025}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published