List of AI tools that can interact with user interfaces. PRs welcome.
These are still mostly text-based
- Claude 3.5 Computer Use (Oct 2024): Version of the Claude 3.5 model which supports computer use structured text and image tool inputs and actionable text outputs.
- Llama 3.2 (Sep 2024): The two largest models of the Llama 3.2 collection, 11B and 90B, support image reasoning use cases, such as document-level understanding including charts and graphs, captioning of images, and visual grounding tasks such as directionally pinpointing objects in images based on natural language descriptions.
- Molmo (Sep 2024): VLM that matches GPT-4V performance with pointing ability.
- CogAgent (Dec 2023): CogAgent is an open-source visual language model that can identify regions and points of UIs to interact with.
- Florence 2 (Nov 2023): Vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks including producing bounding boxes.
- OpenAdapt.AI: AI-First Process Automation with Large ([Language (LLMs) / Action (LAMs) / Multimodal (LMMs)] / Visual Language (VLMs)) Models
- Skyvern: Browser automation software
- ScreenAgent
- Mobile-Agent
- UI-ACT: An AI agent for interacting with a computer using the graphical user interface
- OpenInterpreter: Uses code to interact with operating system.
- AIOS: Can interact with operating system as backend.
- Adept: Company looking to automate user interface interaction through ML