DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

🤗 Overview

🔥🔥🔥 We have open-sourced our self-developed GUI multimodal visual understanding model GUIExplorer and part of the DeskVision dataset used to train the model (the complete dataset is being compiled and will be provided later). The model is based on the llava architecture and not only achieves visual understanding results similar to or even better than those of cutting-edge solutions under the open source GUI understanding benchmark, but also supports Visual Grounding and the ability to execute single-step instructions in terms of GUI understanding functions. We will continue to develop the model in the future to enable it to have interactive dialogue capabilities and complete GUI Agent functions.

📝 Release Plans

Inference scripts
Pre-trained model for GUI understanding (7B)
Gradio demo (supporting specified GUI understanding functions)
Technical report or paper
Training data
Complete multi-step execution Agent model of complex instructions
Training scripts

🎞️ Examples

1. Grounding

2. single-step instruction execution

This project also provides local gradio demo deployment code. Welcome to deploy and experience it.

⚒️ Installation

1. Build Environtment

git clone https://github.com/MooreThreads/GUIExplorer
cd GUIExplorer
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # Enable PEP 660 support.
# pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -e ".[train]"  
# If this command is very slow to load the library, you can specify the Tsinghua source (the above command)  
pip install -e ".[train]"

2. Download weights

For open source weights, please visit huggingface to download, and place the downloaded model weights in the ./pretrained_weights folder. Currently, 7B pre-trained models are provided.

🚀 Inference

1. Inference commands

cd ./scripts/inference  
python infer.py --task xx --input_text xx --input_image xx

Input parameter description:

parameter name	type	default	parameter description
task	str	"grounding"	Task type, currently supported "ocr", "grounding" and "instruction"
input_text	str	""	Input text. If it is "ocr", enter the absolute coordinates "[x1,y1,x2,y2]" of the area to be identified; "grounding" means enter the content to be located; others are instructions;
input_image	str	""	Input image path

Output description:
A visualization results of ocr, grounding or instruction execution.

2. 🎨 Gradio Demo

cd ./scripts/inference
python demo.py

The demo provides several examples of "grounding" and "instruction".

📝Appendix

Dataset introduction

1. DeskVision

We have open-sourced the code for generating DeskVision data, which includes two tools, Detector and Captioner. For details on how to use them, see ./scripts/DeskVision. We have also open-sourced (part of) the DeskVision data generated based on these tools. Due to data legitimacy reasons, self-built image data is presented in URL format. For details of the data content, see [🤗Data]. More data will be added in the future. We also generate Region Captions for the open source Desktop complete image data of OS-Atlas, and also open source the related annotations.

2. GUI Understanding Benchmarks

a. ScreenSpot

Models	Model Size	GUI Specific	Mobile Text	Mobile Icon/Widget	Desktop Text	Desktop Icon/Widget	Web Text	Web Icon/Widget	Average
MiniGPT-v2	7B	❌	8.4	6.6	6.2	2.9	6.5	3.4	5.7
GPT-4V	-	❌	22.6	24.5	20.2	11.8	9.2	8.8	16.2
Qwen2-VL	7B	✅	61.34	39.29	51.01	44.98	33.04	21.84	42.89
Fuyu	8B	✅	41.0	1.3	33.0	3.6	33.9	4.4	19.5
CogAgent	18B	✅	67.0	24.0	74.2	20.0	70.4	28.6	47.4
SeeClick	9.6B	✅	78.0	52.0	72.2	30.0	55.7	32.5	53.4
UGround	7B	✅	82.8	60.3	82.5	63.6	80.4	70.4	73.3
OmniParser(w. LS+ID)	-	✅	93.9	57	91.3	63.6	81.3	51	73
OS-Atlas-Base	7B	✅	93.4	72.93	91.75	62.86	90.87	74.27	82.47
GUIExplorer	7B	✅	89.01	77.29	88.14	75.0	82.61	81.55	82.86

b. GUIEnv

Models	Bbox2Text		Text2Bbox
	EM Score	F1 Score	IoU@0.2	IoU@0.5	IoU@0.7	Center@acc
MiniCPM-GUI	44.12	64.78	68.02	47.96	23.28	-
SeeClick	5.19	8.59	53.34	24.58	5.55	56.85
UGround	-	-	-	-	-	63.76
OS-Atlas-Base	42.33	60.51	76.33	59.68	41.9	75.76
GUIExplorer	54.60	78.71	88.51	82.56	62.17	87.66

Citation

If you use GUIExplorer or DeskVision datasets for your research, please cite our [📝Paper]：

@misc{xu2025deskvisionlargescaledesktop,
      title={DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents}, 
      author={Yibin Xu and Liang Yang and Hao Chen and Hua Wang and Zhi Chen and Yaohua Tang},
      year={2025},
      eprint={2503.11170},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.11170}, 
}

⚖️ Disclaimer

The open source code, models and datasets of this project are intended for academic research. We explicitly disclaim any responsibility for user-generated content. Users are fully responsible for their actions when using the models and related datasets. Project contributors have no legal relationship with users and do not assume any responsibility for their actions.

🙏🏻 Acknowledgements

We are very grateful to LLaVA-OneVision, OS-Atlas, SeeClick and other excellent open source work and datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
llava		llava
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh.md		README.zh.md
cog.yaml		cog.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

🤗 Overview

📝 Release Plans

🎞️ Examples

⚒️ Installation

🚀 Inference

📝Appendix

Dataset introduction

Citation

⚖️ Disclaimer

🙏🏻 Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

MooreThreads/GUIExplorer

Folders and files

Latest commit

History

Repository files navigation

DeskVision: Large Scale Desktop Region Captioning for Advanced GUI Agents

🤗 Overview

📝 Release Plans

🎞️ Examples

⚒️ Installation

🚀 Inference

📝Appendix

Dataset introduction

Citation

⚖️ Disclaimer

🙏🏻 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages