Skip to content

A simple screen parsing tool towards pure vision based GUI agent. Run on CPU or GPU. The parent version from Microsoft is hard to run because of wrong commands. This git rep will work.

License

Notifications You must be signed in to change notification settings

sayuj01/OmniParser_microsoft

 
 

Repository files navigation

OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

Logo

arXiv License

📢 [Project Page] [Blog Post] [Models] [Huggingface demo]

OmniParser is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface.

News

  • [2024/10] OmniParser is the #1 trending model on huggingface model hub (starting 10/29/2024).
  • [2024/10] Feel free to checkout our demo on huggingface space! (stay tuned for OmniParser + Claude Computer Use)
  • [2024/10] Both Interactive Region Detection Model and Icon functional description model are released! Hugginface models
  • [2024/09] OmniParser achieves the best performance on Windows Agent Arena!

Install

Install environment for droplet with Ubuntu:

sudo apt update
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
bash miniconda.sh
sudo apt-get install -y mesa-utils

/root/miniconda3/bin/conda init
source ~/.bashrc

conda create -n "omni" python==3.12
conda activate omni

Git clone the repo:

git clone https://github.com/sayuj01/OmniParser_microsoft.git
cd OmniParser_microsoft

Then just run the sh file

chmod +x master.sh
bash master.sh

Examples:

We put together a few simple examples in the demo.ipynb.

Flash API

cd api
pip install -r requirements.txt
python app.py

Gunicorn for concurrent requests

Choose workers based on the number of CPU cores.

gunicorn app:app --workers=4 --threads=1 --bind 0.0.0.0:8000

gunicorn app2:app --workers=4 --threads=1 --bind 0.0.0.0:8000 #for better api output

Gradio Demo

To run gradio demo, simply run:

python gradio_demo.py

Model Weights License

For the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https://huggingface.co/microsoft/OmniParser.

📚 Citation

Our technical report can be found here. If you find our work useful, please consider citing our work:

@misc{lu2024omniparserpurevisionbased,
      title={OmniParser for Pure Vision Based GUI Agent}, 
      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},
      year={2024},
      eprint={2408.00203},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.00203}, 
}

About

A simple screen parsing tool towards pure vision based GUI agent. Run on CPU or GPU. The parent version from Microsoft is hard to run because of wrong commands. This git rep will work.

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 76.4%
  • Python 23.0%
  • Shell 0.6%