Skip to content

Official implementation of GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

License

Notifications You must be signed in to change notification settings

ritzz-ai/GUI-R1

Repository files navigation

GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents

The official repo for "GUI-R1: A Generalist R1-style Vision-Language Action Model For GUI Agents".

🤗 GUI-R1-3K   |   🤗 GUI-R1   |   📑 Paper  

News

Our Exploration

By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as group relative policy optimization (GRPO) to update the model, GUI-R1 achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

Framework

Given the high-level instruction, action history, and visual image inputs, the policy model generates multiple responses containing reasoning steps. Then the verifiable rewards, such as action type reward, click point reward, and input text reward, are used with the policy gradient optimization algorithm to update the policy model.

Result

image

image

image

Requirements

We recommend using the pre-built docker image in EasyR1.

# stable

docker pull hiyouga/verl:ngc-th2.5.1-cu120-vllm0.7.4-hotfix

# nightly

docker pull hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2

Data preparation

Download the training and evaluation dataset GUI-R1-3K.

The structure of the directory should be:

│──Dataset
│	 ├──train.parquet
│	 ├──test.parquet
│	 ├──androidcontrol_high_test.parquet
│	 ├──androidcontrol_low_test.parquet
│	 ├──guiact_web_test.parquet
│	 ├──guiodyssey_test.parquet
│	 ├──omniact_web_test.parquet
│	 ├──omniact_desktop_test.parquet
│	 ├──screenspot_pro_test.parquet
│	 ├──screenspot_test.parquet

RL Training

bash examples/qwen2_5_vl_3b_gui_grpo/sh
bash examples/qwen2_5_vl_7b_gui_grpo/sh

Inference and Evaluation

cd guir1
bash inference.sh
bash eval.sh

Star History

Star History Chart

Acknowledgements

We would like to express our sincere gratitude to DeepSeek, VLM-R1, QwenVL, EasyR1, and OS-ATLAS for providing open-source resources that contributed to the development of this project.

Citation

If you find this repo useful for your research, please consider citing the paper

@article{luo2025gui,
  title={GUI-R1: A Generalist R1-Style Vision-Language Action Model For GUI Agents},
  author={Luo, Run and Wang, Lu and He, Wanwei and Xia, Xiaobo},
  journal={arXiv preprint arXiv:2504.10458},
  year={2025}
}

Releases

No releases published

Packages

No packages published

Languages