Skip to content

EvolvingLMMs-Lab/MGPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 

Repository files navigation

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

MGPO MGPO 机器之心

💡 Introduction

Inspired by the human visual system's top-down, task-driven search, we propose Multi-turn Grounding-based Policy Optimization (MGPO). MGPO equips LMMs with interpretable, iterative visual grounding: the model predicts key regions, crops sub-images, and reasons over both the original and focused views.

Key advantages:

  • Interpretable, Top-down Visual Reasoning: MGPO highlights which image regions are attended to at each step.
  • Breaks Pixel Limits: Even if the full image is blurry due to resizing, MGPO identifies and crops clear sub-images for further analysis.
  • No Extra Grounding Annotations Needed: MGPO is trained only with binary answer correctness, yet learns robust grounding.

🚀 Training Code

Our code is based on verl, training code and script are available at

https://github.com/xinyu1205/verl/blob/mgpo/examples/grpo_trainer/run_qwen2_5_vl-7b_mgpo.sh

🧰 Experiments

Visualizations

(Examples of models trained with multi-turn grounding-based RL on high-resolution realworld tasks. The model first identifies key regions, which are then automatically cropped and returned as sub-images. Notably, despite only a binary reward function derived from the correctness of the final answer, the model gradually emerge robust grounding capability throughout the RL process.)

Main Results

  • MGPO outperforms both SFT and GRPO on high-resolution tasks.
  • +5.4% on MME-Realworld (ID), +5.2% on V* Bench (OOD) over GRPO baseline.
  • Surpasses OpenAI’s o1 and GPT-4o on V* Bench, despite using a smaller model and less data.

✒️ Citation

If you find our work to be useful for your research, please consider citing.

@misc{huang2025highresolutionvisualreasoningmultiturn,
      title={High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning}, 
      author={Xinyu Huang and Yuhao Dong and Weiwei Tian and Bo Li and Rui Feng and Ziwei Liu},
      year={2025},
      eprint={2507.05920},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.05920}, 
}

About

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published