🤖 AI Web Automation Agent System

Vision-Guided Browser Automation with GPT-4o

An intelligent web automation system that takes natural language task queries and automatically navigates live web applications using vision-guided decision making and Set-of-Marks visual grounding.

Features • Installation • Usage • Demo • Architecture

Overview

This system uses GPT-4o with Vision (OpenAI) to interpret natural language queries like "How do I create a project in Linear?" or "How do I create a table in Notion?" and automatically:

Parses the query to understand the app and task
Analyzes screenshots in real-time using Set-of-Marks visual grounding
Decides next actions dynamically based on current UI state
Navigates the live web application using Playwright
Captures screenshots at each step
Generates structured metadata for each workflow

Features

Natural Language Interface: Describe what you want to do in plain English
Vision-Guided Navigation: GPT-4o analyzes screenshots in real-time to decide next actions
Set-of-Marks (SoM) Visual Grounding:
- Overlays numbered labels on interactive elements
- Enables precise element identification by ID
- Eliminates ambiguity in element selection
Real-Time Decision Making:
- No pre-programmed workflows
- Adapts to any UI state dynamically
- Handles errors and unexpected states
Notion-Specific Intelligence:
- Distinguishes between title and body content blocks
- Understands slash command workflows
- Matches slash commands to task goals (/table vs /database)
Comprehensive Screenshot Capture:
- Labeled screenshots with element markers
- Before and after each action
- Final state documentation
Web UI & API: Flask web interface and REST API
Session Management: Saves browser sessions (no re-login needed)
Fully Generic: No hardcoded selectors - works across different applications

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    Vision-Guided Agent                               │
│                 (vision_guided_agent.py)                             │
│                  Main Orchestrator                                   │
└─────────────────────────────┬───────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
    ┌─────────▼──────┐  ┌────▼─────────┐  ┌─▼────────────────┐
    │ Vision Agent   │  │  Navigator   │  │ SoM Labeler      │
    │  (GPT-4o)      │  │ (Playwright) │  │ (Element Marker) │
    │ - Parse task   │  │ - Execute    │  │ - Extract elements│
    │ - Decide next  │  │   actions    │  │ - Label with IDs │
    │   action       │  │ - Manage     │  │ - Create labeled │
    │ - Analyze SoM  │  │   browser    │  │   screenshots    │
    └────────────────┘  └──────┬───────┘  └──────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │ Screenshot Manager  │
                    │ - Capture states    │
                    │ - Save metadata     │
                    │ - Organize datasets │
                    └─────────────────────┘

                    ┌─────────────────────┐
                    │   Web UI (Flask)    │
                    │   web_app.py        │
                    │ - REST API          │
                    │ - Progress tracking │
                    │ - Dataset browser   │
                    └─────────────────────┘

Installation

Prerequisites

Python 3.10+
OpenAI API key (Get one here)
- Requires GPT-4o with vision capabilities

Setup

Clone the repository (or navigate to the project directory):
```
cd softlight
```
Install dependencies:
```
pip install -r requirements.txt
```
Install Playwright browsers:
```
python -m playwright install
```
Configure environment variables:
```
cp .env.example .env
```
Edit .env and add your OpenAI API key:
```
OPENAI_API_KEY=your_api_key_here
```

Initial login (one-time setup):

For the agent to work, you need to manually log in to your target applications once:

# This will open a browser where you can log in
python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); context = browser.new_context(); page = context.new_page(); page.goto('https://linear.app'); input('Log in, then press Enter...'); context.storage_state(path='browser_state/linear_state.json'); browser.close()"

Repeat for Notion or other apps you want to automate.

Usage

There are three ways to run the agent:

1. Web UI (Recommended)

Start the Flask web server:

python web_app.py

Then open your browser to http://localhost:8080

The Web UI provides:

Simple text input for natural language queries
Real-time execution status
Dataset browser to view captured workflows
Download datasets as ZIP files

2. Command Line

Run a single task query directly:

python vision_guided_agent.py "How do I create a table in Notion?"

Examples:

# Notion tasks
python vision_guided_agent.py "How do I create a table in Notion?"
python vision_guided_agent.py "How do I create a database in Notion?"

# Linear tasks
python vision_guided_agent.py "How do I create a project in Linear?"
python vision_guided_agent.py "How do I create an issue in Linear?"

3. REST API

Start the web app and use the REST API:

# Start the server
python web_app.py

# In another terminal, make API requests
curl -X POST http://localhost:8080/api/capture/start \
  -H "Content-Type: application/json" \
  -d '{"query": "How do I create a table in Notion?"}'

4. Programmatic Usage

import asyncio
from vision_guided_agent import VisionGuidedAgent

async def main():
    # Create agent
    agent = VisionGuidedAgent()

    # Execute a query
    result = await agent.execute_query(
        "How do I create a table in Notion?",
        headless=False  # Set to True for headless mode
    )

    # Check results
    if result['success']:
        print(f"Task completed successfully!")
        print(f"Screenshots saved to: {result['dataset_location']}")
        print(f"Total steps: {result['execution']['total_steps']}")
    else:
        print(f"Task failed: {result['error']}")

asyncio.run(main())

Example Tasks

Here are example tasks the system can handle:

Notion Tasks

Create a table: "How do I create a table in Notion?"
Create a database: "How do I create a database in Notion?"
Create a page: "How do I create a new page in Notion?"
Add a to-do list: "How do I add a to-do list in Notion?"

Linear Tasks

Create a project: "How do I create a project in Linear?"
Create an issue: "How do I create a new issue in Linear?"
Filter issues: "How do I filter issues by status in Linear?"
Assign an issue: "How do I assign an issue to someone in Linear?"

The system is fully generic and can handle any task query for supported applications.

Output Structure

Screenshots and metadata are organized by app and task:

dataset/
├── linear/
│   ├── create_project/
│   │   ├── step_01_initial_initial_page_state.png
│   │   ├── step_01_labeled.png                      # Set-of-Marks labeled
│   │   ├── step_02_click_after_click_on_element_#42.png
│   │   ├── step_02_labeled.png                      # Set-of-Marks labeled
│   │   ├── step_03_fill_after_fill_on_element_#75.png
│   │   ├── step_04_click_after_click_on_element_#88.png
│   │   ├── step_01_final_final_state_after_task_completion.png
│   │   └── metadata.json                            # Complete execution metadata
│   └── create_issue/
│       └── ...
└── notion/
    ├── create_table/
    │   ├── step_01_initial_initial_page_state.png
    │   ├── step_01_labeled.png
    │   ├── step_02_click_after_click_on_element_#22.png
    │   ├── step_03_press_after_press_on_element_#75.png
    │   ├── step_04_fill_after_fill_on_element_#76.png
    │   ├── step_05_click_after_click_on_element_#75.png
    │   ├── step_01_final_final_state_after_task_completion.png
    │   └── metadata.json
    └── create_database/
        └── ...

Each task directory contains:

Labeled screenshots: With numbered element markers (Set-of-Marks)
Action screenshots: Captured after each action
Final state: Showing task completion
metadata.json: Complete execution details

Metadata Format

Each task generates a metadata.json file:

{
  "task": {
    "app": "Notion",
    "task_type": "create table",
    "description": "How do I create a new table in Notion?",
    "query": "How do I create a new table in Notion?"
  },
  "execution": {
    "success": true,
    "error": null,
    "start_time": "2025-11-07T01:39:39.587238",
    "end_time": "2025-11-07T01:40:17.954218",
    "total_steps": 4
  },
  "steps": [
    {
      "step_number": 2,
      "action": "click",
      "element_id": 22,
      "description": "To create a new table, we need to start by opening a new page where we can insert Notion content blocks.",
      "success": true,
      "error": null,
      "screenshot_path": "dataset/notion/create_table/step_02_click_after_click_on_element_#22.png",
      "timestamp": "2025-11-07T01:39:39.587238"
    },
    {
      "step_number": 3,
      "action": "press",
      "element_id": 75,
      "description": "To create a new table, first press Enter from the title to create a body content block.",
      "success": true,
      "screenshot_path": "dataset/notion/create_table/step_03_press_after_press_on_element_#75.png",
      "timestamp": "2025-11-07T01:39:49.217972"
    }
  ],
  "screenshots": [...],
  "metadata_version": "1.0",
  "generated_at": "2025-11-07T01:40:22.856501"
}

The element_id field references the numbered markers in the Set-of-Marks labeled screenshots.

Configuration

Edit .env file to customize behavior:

Key Settings

OPENAI_API_KEY: Your OpenAI API key (required)
BROWSER_MODE: headed (visible) or headless (background)
SLOW_MO: Slow down actions by X milliseconds (default: 500)
SCREENSHOT_DIR: Where to save screenshots (default: ./dataset)
GPT_MODEL: GPT model to use (default: gpt-4o)
SAVE_SESSION: Save browser sessions to avoid re-login (default: true)

See .env.example for all available options.

How It Works

Vision-Guided Workflow

The system uses a vision-guided approach powered by GPT-4o and Set-of-Marks:

1. Task Parsing (vision_agent.py)

GPT-4o analyzes the natural language query:

Identifies the application (Linear, Notion, etc.)
Extracts the task goal
Understands the desired outcome

2. Set-of-Marks Element Labeling (set_of_marks_labeler.py)

For each UI state:

Extracts all interactive elements using JavaScript
Identifies element properties (text, labels, roles, position)
Creates labeled screenshots with numbered markers
Special handling for Notion:
- Distinguishes TITLE vs BODY content blocks
- Detects H1 elements as titles
- Identifies slash command input areas

3. Vision-Guided Decision Making (vision_agent.py)

GPT-4o analyzes the labeled screenshot:

Sees the numbered elements on the screenshot
Reads element descriptions (text, labels, roles)
Decides the next action based on:
- Current UI state
- Task goal
- Previous actions taken
Selects element by ID (e.g., "Click element #42")

4. Action Execution (navigator.py)

Playwright executes the action:

click: Click element by ID
fill: Type text into element by ID
press: Press keyboard keys (Enter, Escape, etc.)
wait: Wait for UI updates

5. Screenshot Capture (screenshot_manager.py)

After each action:

Captures the new UI state
Creates a new labeled screenshot
GPT-4o analyzes and decides next action
Repeats until task is complete

6. Metadata Generation

Saves complete workflow information:

Each step with element ID and reasoning
All screenshots (labeled and unlabeled)
Execution timeline and success status

Supported Applications

Currently optimized for:

Linear (project management)
Notion (knowledge management)

Can be extended to any web application - the system is fully generic and doesn't use hardcoded selectors.

Generalization

The system is fully generic and works across different applications:

No hardcoded selectors: Uses Set-of-Marks to identify elements dynamically
Vision-guided: GPT-4o sees the actual UI and decides actions in real-time
No pre-programming needed: Adapts to any UI structure automatically
Handles app-specific patterns: Special logic for Notion, Linear, etc.
Learns from context: Uses previous actions to inform next steps

Troubleshooting

"Could not find element" errors

Slow down actions: Increase SLOW_MO in .env (e.g., SLOW_MO=1000)
Run in headed mode to see what's happening: BROWSER_MODE=headed
Check the labeled screenshots in the dataset folder to see what elements were detected

"OPENAI_API_KEY is required" error

Make sure you've created a .env file and added your API key:

cp .env.example .env
# Edit .env and add your OpenAI API key

API rate limit errors

If you hit OpenAI rate limits:

Add delays between API calls in the code
Reduce screenshot resolution
Use a higher tier API plan

Session/login issues

Manually log in to the application once:

python -m playwright codegen https://linear.app

This opens a browser where you can log in. The session will be saved automatically.

Screenshots not capturing

Check:

SCREENSHOT_DIR exists and is writable
Playwright has permissions to write files
Full page screenshots are enabled: FULL_PAGE_SCREENSHOT=true

Development

Project Structure

softlight/
├── vision_guided_agent.py      # Main vision-guided orchestrator
├── vision_agent.py             # GPT-4o task interpretation & decision making
├── set_of_marks_labeler.py     # Element extraction & labeling
├── navigator.py                # Playwright browser automation
├── screenshot_manager.py       # Screenshot capture & organization
├── web_app.py                  # Flask web UI & REST API
├── config.py                   # Configuration management
├── utils.py                    # Helper functions
├── requirements.txt            # Python dependencies
├── .env.example                # Environment variables template
├── dataset/                    # Generated screenshots & metadata
│   ├── linear/                 # Linear task datasets
│   └── notion/                 # Notion task datasets
├── browser_state/              # Saved browser sessions (for auto-login)
├── web/                        # Web UI static files (HTML/CSS/JS)
└── README.md                   # This file

Adding New Applications

The system is generic and can work with any web application:

Add the app URL to config.py:

APP_URLS = {
    "linear": "https://linear.app",
    "notion": "https://notion.so",
    "myapp": "https://myapp.com",
}

Log in manually once to save the browser session:

python -m playwright codegen https://myapp.com
# Log in, then save the session

Run a task query:

python vision_guided_agent.py "How do I create something in MyApp?"

The vision-guided approach means no app-specific code is needed - GPT-4o figures out the UI dynamically.

Limitations

Requires manual login for each application (one-time setup)
GPT-4o API calls can be expensive for complex workflows
Vision analysis adds latency compared to pure DOM-based approaches
Limited to 15 steps per task (configurable in code)
May struggle with very dynamic SPAs that heavily mutate the DOM

Contributing

Contributions are welcome! Areas for improvement:

Additional application support (GitHub, Jira, Slack, etc.)
Improved element detection for complex UIs
Better error recovery and retry logic
Multi-tab/window support
Video recording of workflows
Caching of vision analysis results

License

MIT License - see LICENSE file for details

Credits

Built with:

GPT-4o (OpenAI) - Vision-based decision making
Playwright - Browser automation
Flask - Web UI framework
Python - Core language

Inspired by the Set-of-Marks (SoM) visual grounding approach.

Note: This tool is designed for authorized automation of web applications you have access to. Always respect websites' terms of service and robots.txt files.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
dataset		dataset
examples		examples
tests		tests
web		web
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
agent.py		agent.py
config.py		config.py
element_finder.py		element_finder.py
hybrid_agent.py		hybrid_agent.py
navigator.py		navigator.py
requirements.txt		requirements.txt
screenshot_manager.py		screenshot_manager.py
set_of_marks_labeler.py		set_of_marks_labeler.py
state_detector.py		state_detector.py
task_interpreter.py		task_interpreter.py
utils.py		utils.py
vision_agent.py		vision_agent.py
vision_guided_agent.py		vision_guided_agent.py
web_app.py		web_app.py

abbasyed/AgentCapture

Folders and files

Latest commit

History

Repository files navigation