The model checkpoint and training datasets are under review due to company policies and will be released soon. Thank you for your patience.
- [2025/9/16] Our model GUI-G1-3B-v1 in our paper is released.
- [2025/5/22] Our code is released.
- [2025/5/22] Our paper is released.
This repository is based on VLM-R1, with several improvements and adaptations for our use case, especially on Template, Reward Functions, and GRPO Objective.
In this work, we build upon the original VLM-R1 frameworks. We introduce GUI-G1, a VLM fine-tuned for GUI Grounding.
- Introduced a Fast Thinking Template that requires no model reasoning, accelerating training and inference
- Utilized diverse reward functions (Hit, IoU, Box) to prevent reward hacking and achieve multi-objective optimization
- Removed length correction from the GRPO objective and added a difficulty coefficient to enhance model robustness
conda create -n myproject python=3.10
conda activate myproject
bash setup.shFollow the steps below to prepare data and train the model:
- [Data preparation instructions customized for your setup]
- [Reference to your configuration files or modified scripts]
- Use the following command to launch training:
bash src/open-r1-multimodal/run_scripts/run.sh| Model | ScreenSpot | ScreenSPot-Pro |
|---|---|---|
| InfiGUI-R1-3B | 87.5 | 35.7 |
| GUI-G1-3B | 90.3 | 37.1 |
This repository builds upon the great work from:
We thank the authors for their open-source contributions.
If you find our code or work useful for your research, please cite our work:
@article{zhou2025guig1,
title = {GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents},
author = {Zhou, Yuqi and Dai, Sunhao and Wang, Shuai and Zhou, Kaiwen and Jia, Qinglin and Xu, Jun},
journal = {arXiv preprint arXiv:2505.15810},
year = {2025}
}