WARNING: THIS REPOSITORY IS DEPRECATED! Please visit the new repository at sagerpascal.github.io/agents-for-computer-use
An awesome list of computer control agents (GUI automation of desktop and mobile devices) 🚀.
Please have a look at our website for more information.
- 📄 Paper: Link to Paper (arXiv.2501.16150)
 - 🌐 Website: https://sagerpascal.github.io/agents-for-computer-use
 - 🤖Agent Overview
 - 📊 Datasets Overview
 
- **Abukadah et al. ** - [Mapping Natural Language Intents to User Interfaces through Vision-Language Models]
 - **Bishop et al. ** - [Latent State Estimation Helps UI Agents to Reason]
 - **Bonatti et al. ** - [Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale]
 - **Branavan et al. ** - [Reinforcement Learning for Mapping Instructions to Actions]
 - **Chae et al. ** - [Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation]
 - **Cheng et al. ** - [Seeclick: Harnessing gui grounding for advanced visual gui agents]
 - **Cho et al. ** - [CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only]
 - **Deng et al. ** - [Mind2Web: Towards a Generalist Agent for the Web]
 - **Deng et al. ** - [Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents]
 - **Deng et al. ** - [On the Multi-turn Instruction Following for Conversational Web Agents]
 - **Ding et al. ** - [MobileAgent: enhancing mobile control via human-machine interaction and SOP integration]
 - **Dorka et al. ** - [Training a Vision Language Model as Smartphone Assistant]
 - **Fereidouni et al. ** - [Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning]
 - **Furuta et al. ** - [Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web]
 - **Furuta et al. ** - [Multimodal Web Navigation with Instruction-Finetuned Foundation Models]
 - **Gao et al. ** - [ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation]
 - **Guan et al. ** - [Intelligent Virtual Assistants with LLM-based Process Automation]
 - **Guo et al. ** - [PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion]
 - **Gur et al. ** - [A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis]
 - **Gur et al. ** - [Environment Generation for Zero-Shot Compositional Reinforcement Learning]
 - **Gur et al. ** - [Learning to Navigate the Web]
 - **Gur et al. ** - [Understanding HTML with Large Language Models]
 - **He at al. ** - [WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models]
 - **Hong et al. ** - [CogAgent: A Visual Language Model for GUI Agents]
 - **Humphreys et al. ** - [A data-driven approach for learning to control computers]
 - **Iki et al. ** - [Do BERTs learn to use browser user interface? Exploring multi-step tasks with unified vision-and-language berts]
 - **Jia et al. ** - [DOM-Q-NET: Grounded RL on Structured Language]
 - **Kil et al. ** - [Dual-View Visual Contextualization for Web Navigation]
 - **Kim et al. ** - [Language Models can Solve Computer Tasks]
 - **Koh et al. ** - [Tree Search For Language Model Agents]
 - **Lai et al. ** - [AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent]
 - **Lee et al. ** - [Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation]
 - **Li ** - [Learning UI Navigation through Demonstrations composed of Macro Actions]
 - **Li et al. ** - [A Zero-Shot Language Agent for Computer Control with Structured Reflection]
 - **Li et al. ** - [AppAgent v2: Advanced Agent for Flexible Mobile Interactions]
 - **Li et al. ** - [Glider: A Reinforcement Learning Approach to Extract UI Scripts from Websites]
 - **Li et al. ** - [Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations]
 - **Li et al. ** - [Mapping Natural Language Instructions to Mobile UI Action Sequences]
 - **Li et al. ** - [On the Effects of Data Scale on Computer Control Agents]
 - **Li et al. ** - [UINav: A Practical Approach to Train On-Device Automation Agents]
 - **Lin et al. ** - [Automating Web-based Infrastructure Management via Contextual Imitation Learning]
 - **Liu et al. ** - [Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration]
 - **Lu et al. ** - [GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices]
 - **Lu et al. ** - [OmniParser for Pure Vision Based GUI Agent]
 - **Lu et al. ** - [WebLINX: Real-World Website Navigation with Multi-Turn Dialogue]
 - **Lutz et al. ** - [WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents]
 - **Ma et al. ** - [CoCo-Agent: Comprehensive Cognitive LLM Agent for Smartphone GUI Automation]
 - **Ma et al. ** - [LASER: LLM Agent with State-Space Exploration for Web Navigation]
 - **Mazumder et al. ** - [FLIN: A Flexible Natural Language Interface for Web Navigation]
 - **Murty et al. ** - [BAGEL: Bootstrapping Agents by Guiding Exploration with Language]
 - **Nakano et al. ** - [WebGPT: Browser-assisted question-answering with human feedback]
 - **Niu et al. ** - [ScreenAgent: A Vision Language Model-driven Computer Control Agent]
 - **Nong et al. ** - [MobileFlow: A Multimodal LLM For Mobile GUI Agent]
 - **Pan et al. ** - [Autonomous Evaluation and Refinement of Digital Agents]
 - **Putta et al. ** - [Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents]
 - **Rahman et al. ** - [V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM]
 - **Rawles et al. ** - [Android in the Wild: A Large-Scale Dataset for Android Device Control]
 - **Shaw et al. ** - [From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces]
 - **Shi et al. ** - [World of Bits: An Open-Domain Platform for Web-Based Agents]
 - **Sodhi et al. ** - [HeaP: Hierarchical Policies for Web Actions using LLMs]
 - **Song et al. ** - [MMAC-Copilot: Multi-modal Agent Collaboration Operating System Copilot]
 - **Song et al. ** - [Navigating Interfaces with AI for Enhanced User Interaction]
 - **Song et al. ** - [RestGPT: Connecting Large Language Models with Real-World RESTful APIs]
 - **Song et al. ** - [VisionTasker: Mobile Task Automation Using Vision Based UI Understanding and LLM Task Planning]
 - **Lo et al. ** - [Hierarchical Prompting Assists Large Language Model on Web Navigation]
 - **Sun et al. ** - [AdaPlanner: Adaptive Planning from Feedback with Language Models]
 - **Sun et al. ** - [META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI]
 - **Tao et al. ** - [WebWISE: Web Interface Control and Sequential Exploration with Large Language Models]
 - **Wang et al. ** - [Enabling Conversational Interaction with Mobile UI using Large Language Models]
 - **Wang et al. ** - [Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception]
 - **Wang et al. ** - [OfficeBench: Benchmarking Language Agents across Multiple Applications for Office Automation]
 - **Wen et al. ** - [AutoDroid: LLM-powered Task Automation in Android]
 - **Wen et al. ** - [DroidBot-GPT: GPT-powered UI Automation for Android]
 - **Wu et al. ** - [MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding]
 - **Wu et al. ** - [OS-COPILOT: TOWARDS GENERALIST COMPUTER AGENTS WITH SELF-IMPROVEMENT]
 - **Xu et al. ** - [Grounding Open-Domain Instructions to Automate Web Support Tasks]
 
- **Shi et al. ** - [World of Bits: An Open-Domain Platform for Web-Based Agents]
 - **Liu et al. ** - [Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration]
 - **Xu et al. ** - [Grounding Open-Domain Instructions to Automate Web Support Tasks]
 - **Gur et al. ** - [Environment Generation for Zero-Shot Compositional Reinforcement Learning]
 - **Yao et al. ** - [WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents]
 - **Deng et al. ** - [Mind2Web: Towards a Generalist Agent for the Web]
 - **Koroglu et al. ** - [QBE: QLearning-Based Exploration of Android Applications]
 - **Rawles et al. ** - [Android in the Wild: A Large-Scale Dataset for Android Device Control]
 - **Zhou et al. ** - [WebArena: A Realistic Web Environment for building autonomous agents]
 - **Li et al. ** - [Mapping Natural Language Instructions to Mobile UI Action Sequences]
 - **Toyama et al. ** - [AndroidEnv: A Reinforcement Learning Platform for Android]
 - **Burns et al. ** - [A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility]
 - **Xie et al. ** - [OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments]
 - **Shvo et al. ** - [AppBuddy: Learning to Accomplish Tasks in Mobile Apps via Reinforcement Learning]
 - **Sun et al. ** - [META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI]
 - **Liu et al. ** - [AgentBench: Evaluating LLMs as Agents]
 - **Chen et al. ** - [WebVLN: Vision-and-Language Navigation on Websites]
 - **Song et al. ** - [RestGPT: Connecting Large Language Models with Real-World RESTful APIs]
 - **Koh et el. ** - [VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks]
 - **Deng et al. ** - [On the Multi-turn Instruction Following for Conversational Web Agents]
 - **Kapoor et al. ** - [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web]
 - **Wen et al. ** - [Empowering LLM to use Smartphone for Intelligent Task Automation]
 - **Gao et al. ** - [ASSISTGUI: Task-Oriented Desktop Graphical User Interface Automation]
 - **Niu et al. ** - [ScreenAgent: A Vision Language Model-driven Computer Control Agent]
 - **Drouin et al. ** - [WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?]
 - **Lai et al. ** - [AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent]
 - **Zhang et al. ** - [Android in the Zoo: Chain-of-Action-Thought for GUI Agents]
 - **Chen et al. ** - [GUICourse: From General Vision Language Models to Versatile GUI Agents]
 - **Guo et al. ** - [PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion]
 - **Venkatesh et al. ** - [UGIF: UI Grounded Instruction Following]
 - **Zheng et al. ** - [AgentStudio: A Toolkit for Building General Virtual Agents]
 - **Zhang et al. ** - [Mobile-Env: An Evaluation Platform and Benchmark for LLM-GUI Interaction]
 - **Chen et al. ** - [GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents]
 - **Chai et al. ** - [AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents]
 
If helpful, please cite:
@misc{sager_acu_2025,
      title={A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions}, 
      author={Pascal J. Sager and Benjamin Meyer and Peng Yan and Rebekka von Wartburg-Kottler and Layan Etaiwi and Aref Enayati and Gabriel Nobel and Ahmed Abdulkadir and Benjamin F. Grewe and Thilo Stadelmann},
      year={2025},
      eprint={2501.16150},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2501.16150}, 
}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
