Skip to content

AI Web Automation Agent System uses GPT-4o vision to autonomously navigate web applications from natural language queries. It employs "Set-of-Marks" visual grounding (numbered element labels) to analyze screenshots in real-time and decide actions dynamically. The system captures complete workflows with screenshots and metadata

Notifications You must be signed in to change notification settings

abbasyed/AgentCapture

Repository files navigation

🤖 AI Web Automation Agent System

Vision-Guided Browser Automation with GPT-4o

Python GPT-4o Playwright License

An intelligent web automation system that takes natural language task queries and automatically navigates live web applications using vision-guided decision making and Set-of-Marks visual grounding.

FeaturesInstallationUsageDemoArchitecture


Overview

This system uses GPT-4o with Vision (OpenAI) to interpret natural language queries like "How do I create a project in Linear?" or "How do I create a table in Notion?" and automatically:

  1. Parses the query to understand the app and task
  2. Analyzes screenshots in real-time using Set-of-Marks visual grounding
  3. Decides next actions dynamically based on current UI state
  4. Navigates the live web application using Playwright
  5. Captures screenshots at each step
  6. Generates structured metadata for each workflow

Features

  • Natural Language Interface: Describe what you want to do in plain English
  • Vision-Guided Navigation: GPT-4o analyzes screenshots in real-time to decide next actions
  • Set-of-Marks (SoM) Visual Grounding:
    • Overlays numbered labels on interactive elements
    • Enables precise element identification by ID
    • Eliminates ambiguity in element selection
  • Real-Time Decision Making:
    • No pre-programmed workflows
    • Adapts to any UI state dynamically
    • Handles errors and unexpected states
  • Notion-Specific Intelligence:
    • Distinguishes between title and body content blocks
    • Understands slash command workflows
    • Matches slash commands to task goals (/table vs /database)
  • Comprehensive Screenshot Capture:
    • Labeled screenshots with element markers
    • Before and after each action
    • Final state documentation
  • Web UI & API: Flask web interface and REST API
  • Session Management: Saves browser sessions (no re-login needed)
  • Fully Generic: No hardcoded selectors - works across different applications

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                    Vision-Guided Agent                               │
│                 (vision_guided_agent.py)                             │
│                  Main Orchestrator                                   │
└─────────────────────────────┬───────────────────────────────────────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
    ┌─────────▼──────┐  ┌────▼─────────┐  ┌─▼────────────────┐
    │ Vision Agent   │  │  Navigator   │  │ SoM Labeler      │
    │  (GPT-4o)      │  │ (Playwright) │  │ (Element Marker) │
    │ - Parse task   │  │ - Execute    │  │ - Extract elements│
    │ - Decide next  │  │   actions    │  │ - Label with IDs │
    │   action       │  │ - Manage     │  │ - Create labeled │
    │ - Analyze SoM  │  │   browser    │  │   screenshots    │
    └────────────────┘  └──────┬───────┘  └──────────────────┘
                               │
                    ┌──────────▼──────────┐
                    │ Screenshot Manager  │
                    │ - Capture states    │
                    │ - Save metadata     │
                    │ - Organize datasets │
                    └─────────────────────┘

                    ┌─────────────────────┐
                    │   Web UI (Flask)    │
                    │   web_app.py        │
                    │ - REST API          │
                    │ - Progress tracking │
                    │ - Dataset browser   │
                    └─────────────────────┘

Installation

Prerequisites

  • Python 3.10+
  • OpenAI API key (Get one here)
    • Requires GPT-4o with vision capabilities

Setup

  1. Clone the repository (or navigate to the project directory):

    cd softlight
  2. Install dependencies:

    pip install -r requirements.txt
  3. Install Playwright browsers:

    python -m playwright install
  4. Configure environment variables:

    cp .env.example .env

    Edit .env and add your OpenAI API key:

    OPENAI_API_KEY=your_api_key_here
    
  5. Initial login (one-time setup):

    For the agent to work, you need to manually log in to your target applications once:

    # This will open a browser where you can log in
    python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); context = browser.new_context(); page = context.new_page(); page.goto('https://linear.app'); input('Log in, then press Enter...'); context.storage_state(path='browser_state/linear_state.json'); browser.close()"

    Repeat for Notion or other apps you want to automate.

Usage

There are three ways to run the agent:

1. Web UI (Recommended)

Start the Flask web server:

python web_app.py

Then open your browser to http://localhost:8080

The Web UI provides:

  • Simple text input for natural language queries
  • Real-time execution status
  • Dataset browser to view captured workflows
  • Download datasets as ZIP files

2. Command Line

Run a single task query directly:

python vision_guided_agent.py "How do I create a table in Notion?"

Examples:

# Notion tasks
python vision_guided_agent.py "How do I create a table in Notion?"
python vision_guided_agent.py "How do I create a database in Notion?"

# Linear tasks
python vision_guided_agent.py "How do I create a project in Linear?"
python vision_guided_agent.py "How do I create an issue in Linear?"

3. REST API

Start the web app and use the REST API:

# Start the server
python web_app.py

# In another terminal, make API requests
curl -X POST http://localhost:8080/api/capture/start \
  -H "Content-Type: application/json" \
  -d '{"query": "How do I create a table in Notion?"}'

4. Programmatic Usage

import asyncio
from vision_guided_agent import VisionGuidedAgent

async def main():
    # Create agent
    agent = VisionGuidedAgent()

    # Execute a query
    result = await agent.execute_query(
        "How do I create a table in Notion?",
        headless=False  # Set to True for headless mode
    )

    # Check results
    if result['success']:
        print(f"Task completed successfully!")
        print(f"Screenshots saved to: {result['dataset_location']}")
        print(f"Total steps: {result['execution']['total_steps']}")
    else:
        print(f"Task failed: {result['error']}")

asyncio.run(main())

Example Tasks

Here are example tasks the system can handle:

Notion Tasks

  1. Create a table: "How do I create a table in Notion?"
  2. Create a database: "How do I create a database in Notion?"
  3. Create a page: "How do I create a new page in Notion?"
  4. Add a to-do list: "How do I add a to-do list in Notion?"

Linear Tasks

  1. Create a project: "How do I create a project in Linear?"
  2. Create an issue: "How do I create a new issue in Linear?"
  3. Filter issues: "How do I filter issues by status in Linear?"
  4. Assign an issue: "How do I assign an issue to someone in Linear?"

The system is fully generic and can handle any task query for supported applications.

Output Structure

Screenshots and metadata are organized by app and task:

dataset/
├── linear/
│   ├── create_project/
│   │   ├── step_01_initial_initial_page_state.png
│   │   ├── step_01_labeled.png                      # Set-of-Marks labeled
│   │   ├── step_02_click_after_click_on_element_#42.png
│   │   ├── step_02_labeled.png                      # Set-of-Marks labeled
│   │   ├── step_03_fill_after_fill_on_element_#75.png
│   │   ├── step_04_click_after_click_on_element_#88.png
│   │   ├── step_01_final_final_state_after_task_completion.png
│   │   └── metadata.json                            # Complete execution metadata
│   └── create_issue/
│       └── ...
└── notion/
    ├── create_table/
    │   ├── step_01_initial_initial_page_state.png
    │   ├── step_01_labeled.png
    │   ├── step_02_click_after_click_on_element_#22.png
    │   ├── step_03_press_after_press_on_element_#75.png
    │   ├── step_04_fill_after_fill_on_element_#76.png
    │   ├── step_05_click_after_click_on_element_#75.png
    │   ├── step_01_final_final_state_after_task_completion.png
    │   └── metadata.json
    └── create_database/
        └── ...

Each task directory contains:

  • Labeled screenshots: With numbered element markers (Set-of-Marks)
  • Action screenshots: Captured after each action
  • Final state: Showing task completion
  • metadata.json: Complete execution details

Metadata Format

Each task generates a metadata.json file:

{
  "task": {
    "app": "Notion",
    "task_type": "create table",
    "description": "How do I create a new table in Notion?",
    "query": "How do I create a new table in Notion?"
  },
  "execution": {
    "success": true,
    "error": null,
    "start_time": "2025-11-07T01:39:39.587238",
    "end_time": "2025-11-07T01:40:17.954218",
    "total_steps": 4
  },
  "steps": [
    {
      "step_number": 2,
      "action": "click",
      "element_id": 22,
      "description": "To create a new table, we need to start by opening a new page where we can insert Notion content blocks.",
      "success": true,
      "error": null,
      "screenshot_path": "dataset/notion/create_table/step_02_click_after_click_on_element_#22.png",
      "timestamp": "2025-11-07T01:39:39.587238"
    },
    {
      "step_number": 3,
      "action": "press",
      "element_id": 75,
      "description": "To create a new table, first press Enter from the title to create a body content block.",
      "success": true,
      "screenshot_path": "dataset/notion/create_table/step_03_press_after_press_on_element_#75.png",
      "timestamp": "2025-11-07T01:39:49.217972"
    }
  ],
  "screenshots": [...],
  "metadata_version": "1.0",
  "generated_at": "2025-11-07T01:40:22.856501"
}

The element_id field references the numbered markers in the Set-of-Marks labeled screenshots.

Configuration

Edit .env file to customize behavior:

Key Settings

  • OPENAI_API_KEY: Your OpenAI API key (required)
  • BROWSER_MODE: headed (visible) or headless (background)
  • SLOW_MO: Slow down actions by X milliseconds (default: 500)
  • SCREENSHOT_DIR: Where to save screenshots (default: ./dataset)
  • GPT_MODEL: GPT model to use (default: gpt-4o)
  • SAVE_SESSION: Save browser sessions to avoid re-login (default: true)

See .env.example for all available options.

How It Works

Vision-Guided Workflow

The system uses a vision-guided approach powered by GPT-4o and Set-of-Marks:

1. Task Parsing (vision_agent.py)

GPT-4o analyzes the natural language query:

  • Identifies the application (Linear, Notion, etc.)
  • Extracts the task goal
  • Understands the desired outcome

2. Set-of-Marks Element Labeling (set_of_marks_labeler.py)

For each UI state:

  • Extracts all interactive elements using JavaScript
  • Identifies element properties (text, labels, roles, position)
  • Creates labeled screenshots with numbered markers
  • Special handling for Notion:
    • Distinguishes TITLE vs BODY content blocks
    • Detects H1 elements as titles
    • Identifies slash command input areas

3. Vision-Guided Decision Making (vision_agent.py)

GPT-4o analyzes the labeled screenshot:

  • Sees the numbered elements on the screenshot
  • Reads element descriptions (text, labels, roles)
  • Decides the next action based on:
    • Current UI state
    • Task goal
    • Previous actions taken
  • Selects element by ID (e.g., "Click element #42")

4. Action Execution (navigator.py)

Playwright executes the action:

  • click: Click element by ID
  • fill: Type text into element by ID
  • press: Press keyboard keys (Enter, Escape, etc.)
  • wait: Wait for UI updates

5. Screenshot Capture (screenshot_manager.py)

After each action:

  • Captures the new UI state
  • Creates a new labeled screenshot
  • GPT-4o analyzes and decides next action
  • Repeats until task is complete

6. Metadata Generation

Saves complete workflow information:

  • Each step with element ID and reasoning
  • All screenshots (labeled and unlabeled)
  • Execution timeline and success status

Supported Applications

Currently optimized for:

  • Linear (project management)
  • Notion (knowledge management)

Can be extended to any web application - the system is fully generic and doesn't use hardcoded selectors.

Generalization

The system is fully generic and works across different applications:

  • No hardcoded selectors: Uses Set-of-Marks to identify elements dynamically
  • Vision-guided: GPT-4o sees the actual UI and decides actions in real-time
  • No pre-programming needed: Adapts to any UI structure automatically
  • Handles app-specific patterns: Special logic for Notion, Linear, etc.
  • Learns from context: Uses previous actions to inform next steps

Troubleshooting

"Could not find element" errors

  1. Slow down actions: Increase SLOW_MO in .env (e.g., SLOW_MO=1000)
  2. Run in headed mode to see what's happening: BROWSER_MODE=headed
  3. Check the labeled screenshots in the dataset folder to see what elements were detected

"OPENAI_API_KEY is required" error

Make sure you've created a .env file and added your API key:

cp .env.example .env
# Edit .env and add your OpenAI API key

API rate limit errors

If you hit OpenAI rate limits:

  1. Add delays between API calls in the code
  2. Reduce screenshot resolution
  3. Use a higher tier API plan

Session/login issues

Manually log in to the application once:

python -m playwright codegen https://linear.app

This opens a browser where you can log in. The session will be saved automatically.

Screenshots not capturing

Check:

  1. SCREENSHOT_DIR exists and is writable
  2. Playwright has permissions to write files
  3. Full page screenshots are enabled: FULL_PAGE_SCREENSHOT=true

Development

Project Structure

softlight/
├── vision_guided_agent.py      # Main vision-guided orchestrator
├── vision_agent.py             # GPT-4o task interpretation & decision making
├── set_of_marks_labeler.py     # Element extraction & labeling
├── navigator.py                # Playwright browser automation
├── screenshot_manager.py       # Screenshot capture & organization
├── web_app.py                  # Flask web UI & REST API
├── config.py                   # Configuration management
├── utils.py                    # Helper functions
├── requirements.txt            # Python dependencies
├── .env.example                # Environment variables template
├── dataset/                    # Generated screenshots & metadata
│   ├── linear/                 # Linear task datasets
│   └── notion/                 # Notion task datasets
├── browser_state/              # Saved browser sessions (for auto-login)
├── web/                        # Web UI static files (HTML/CSS/JS)
└── README.md                   # This file

Adding New Applications

The system is generic and can work with any web application:

  1. Add the app URL to config.py:

    APP_URLS = {
        "linear": "https://linear.app",
        "notion": "https://notion.so",
        "myapp": "https://myapp.com",
    }
  2. Log in manually once to save the browser session:

    python -m playwright codegen https://myapp.com
    # Log in, then save the session
  3. Run a task query:

    python vision_guided_agent.py "How do I create something in MyApp?"

The vision-guided approach means no app-specific code is needed - GPT-4o figures out the UI dynamically.

Limitations

  • Requires manual login for each application (one-time setup)
  • GPT-4o API calls can be expensive for complex workflows
  • Vision analysis adds latency compared to pure DOM-based approaches
  • Limited to 15 steps per task (configurable in code)
  • May struggle with very dynamic SPAs that heavily mutate the DOM

Contributing

Contributions are welcome! Areas for improvement:

  • Additional application support (GitHub, Jira, Slack, etc.)
  • Improved element detection for complex UIs
  • Better error recovery and retry logic
  • Multi-tab/window support
  • Video recording of workflows
  • Caching of vision analysis results

License

MIT License - see LICENSE file for details

Credits

Built with:

Inspired by the Set-of-Marks (SoM) visual grounding approach.


Note: This tool is designed for authorized automation of web applications you have access to. Always respect websites' terms of service and robots.txt files.

About

AI Web Automation Agent System uses GPT-4o vision to autonomously navigate web applications from natural language queries. It employs "Set-of-Marks" visual grounding (numbered element labels) to analyze screenshots in real-time and decide actions dynamically. The system captures complete workflows with screenshots and metadata

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published