Wenjia Jiang1, 2
Yangyang Zhuang1
Chenxi Song1
Xu Yang3
Chi Zhang1
1AGI Lab, Westlake University,
2Henan University,
3Southeast University
jiangwenjia@westlake.edu.cn
Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule-based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent's task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high-level actions that act as shortcuts, replacing these low-level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open-sourced to support further research.
Before diving into the setup, it's worth mentioning that DeepSeek can be used with an OpenAI-compatible API format. By modifying the configuration, you can use the OpenAI SDK to access the DeepSeek API.
This project utilizes LangChain and LangGraph to construct the agent framework. It is recommended to follow the installation methods suggested on their official websites. For other dependencies, please use pip install -r requirements.txt
. For LLM configuration, adjust the relevant settings in the config.py
file.
We use Neo4j as the memory storage for the agent, leveraging its Cypher query language to facilitate node retrieval. For vector storage, Pinecone is employed. Ensure that the necessary API and keys are configured in the config.py
file. For more information, visit Neo4j's official site and Pinecone's official site.
To simplify deployment, we use Docker to containerize the screen recognition and feature extraction services. Refer to the README in the backend folder for instructions on starting the container. Note that this may require Docker's GPU support; please consult Docker's official documentation for configuration. This modular approach allows for easy replacement of different screen parsing and feature extraction tools, significantly enhancing the model's scalability. If you need to deploy, please refer to the README file in the backend folder of the current project.
To use this project, you first need to configure ADB (Android Debug Bridge) to connect your Android device to your computer.
-
Install ADB on your PC:
Download and install Android Debug Bridge (adb)—a command-line tool that enables communication between your PC and an Android device. -
Enable USB Debugging on your Android device:
- Go to Settings > Developer Options and enable USB Debugging.
-
Connect your device to the PC using a USB cable.
If you do not have an actual Android device but still want to try AppAgent, we recommend using the built-in emulator in Android Studio:
- Download and install Android Studio.
- Open Device Manager in Android Studio to create and launch an emulator.
- Install apps on the emulator by downloading APK files and dragging them into the emulator window.
- AppAgent can detect and operate apps on an emulator just like on a real device.
Once your device or emulator is set up, you can start the project. We use Gradio as the front-end interface. Use one of the following commands to launch the demo:
python demo.py
or
gradio demo.py
Now, AppAgent should be ready to use! 🚀
Below are several screenshots of AppAgentX after deployment:
Here are demonstration GIFs of AppAgentX:
- AppAgent - First LLM-based intelligent smartphone application agent
- OmniParser - Microsoft's multimodal interface parsing tool
- LangChain - Framework for building LLM-powered applications
If you find our repo helpful, please consider leaving a star or cite our paper :)
@misc{jiang2025appagentxevolvingguiagents,
title={AppAgentX: Evolving GUI Agents as Proficient Smartphone Users},
author={Wenjia Jiang and Yangyang Zhuang and Chenxi Song and Xu Yang and Chi Zhang},
year={2025},
eprint={2503.02268},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2503.02268},
}
If you have any comments or questions, feel free to contact Wenjia Jiang.