Skip to content

Official implementation of AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Notifications You must be signed in to change notification settings

Westlake-AGI-Lab/AppAgentX

Repository files navigation

AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Wenjia Jiang1, 2    Yangyang Zhuang1    Chenxi Song1    Xu Yang3    Chi Zhang1
1AGI Lab, Westlake University,  2Henan University,  3Southeast University 
jiangwenjia@westlake.edu.cn
     


Abstract

Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule-based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent's task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high-level actions that act as shortcuts, replacing these low-level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open-sourced to support further research.

🚀Getting Started

Before diving into the setup, it's worth mentioning that DeepSeek DeepSeek can be used with an OpenAI-compatible API format. By modifying the configuration, you can use the OpenAI SDK to access the DeepSeek API.

1. LLM Setup and Dependencies

This project utilizes LangChain and LangGraph to construct the agent framework. It is recommended to follow the installation methods suggested on their official websites. For other dependencies, please use pip install -r requirements.txt. For LLM configuration, adjust the relevant settings in the config.py file.

2. Database Deployment and Connection

We use Neo4j as the memory storage for the agent, leveraging its Cypher query language to facilitate node retrieval. For vector storage, Pinecone is employed. Ensure that the necessary API and keys are configured in the config.py file. For more information, visit Neo4j's official site and Pinecone's official site.

3. Screen Recognition and Feature Extraction Deployment

To simplify deployment, we use Docker to containerize the screen recognition and feature extraction services. Refer to the README in the backend folder for instructions on starting the container. Note that this may require Docker's GPU support; please consult Docker's official documentation for configuration. This modular approach allows for easy replacement of different screen parsing and feature extraction tools, significantly enhancing the model's scalability. If you need to deploy, please refer to the README file in the backend folder of the current project.

4. Launching the Demo

To use this project, you first need to configure ADB (Android Debug Bridge) to connect your Android device to your computer.

1. Setting Up ADB and Connecting Your Device

  1. Install ADB on your PC:
    Download and install Android Debug Bridge (adb)—a command-line tool that enables communication between your PC and an Android device.

  2. Enable USB Debugging on your Android device:

    • Go to Settings > Developer Options and enable USB Debugging.
  3. Connect your device to the PC using a USB cable.

Using an Emulator Instead of a Physical Device (Optional)

If you do not have an actual Android device but still want to try AppAgent, we recommend using the built-in emulator in Android Studio:

  1. Download and install Android Studio.
  2. Open Device Manager in Android Studio to create and launch an emulator.
  3. Install apps on the emulator by downloading APK files and dragging them into the emulator window.
  4. AppAgent can detect and operate apps on an emulator just like on a real device.

Launching Gradio

Once your device or emulator is set up, you can start the project. We use Gradio as the front-end interface. Use one of the following commands to launch the demo:

python demo.py

or

gradio demo.py

Now, AppAgent should be ready to use! 🚀

📷 Screenshots and Demo

Below are several screenshots of AppAgentX after deployment:

Initial Interface
Initial Interface
Automatic Exploration
Automatic Exploration
Chain Execution
Chain Execution
Execution Results
Execution Results

Here are demonstration GIFs of AppAgentX:

Demo 1

Demo 2

Demo 3

Demo 4

🔗 Related Links

  • AppAgent - First LLM-based intelligent smartphone application agent
  • OmniParser - Microsoft's multimodal interface parsing tool
  • LangChain - Framework for building LLM-powered applications

📚 BibTeX

If you find our repo helpful, please consider leaving a star or cite our paper :)

@misc{jiang2025appagentxevolvingguiagents,
      title={AppAgentX: Evolving GUI Agents as Proficient Smartphone Users},
      author={Wenjia Jiang and Yangyang Zhuang and Chenxi Song and Xu Yang and Chi Zhang},
      year={2025},
      eprint={2503.02268},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2503.02268},
}

📭 Contact

If you have any comments or questions, feel free to contact Wenjia Jiang.

About

Official implementation of AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Resources

Stars

Watchers

Forks