An intelligent web automation system that takes natural language task queries and automatically navigates live web applications using vision-guided decision making and Set-of-Marks visual grounding.
Features • Installation • Usage • Demo • Architecture
This system uses GPT-4o with Vision (OpenAI) to interpret natural language queries like "How do I create a project in Linear?" or "How do I create a table in Notion?" and automatically:
- Parses the query to understand the app and task
- Analyzes screenshots in real-time using Set-of-Marks visual grounding
- Decides next actions dynamically based on current UI state
- Navigates the live web application using Playwright
- Captures screenshots at each step
- Generates structured metadata for each workflow
- Natural Language Interface: Describe what you want to do in plain English
- Vision-Guided Navigation: GPT-4o analyzes screenshots in real-time to decide next actions
- Set-of-Marks (SoM) Visual Grounding:
- Overlays numbered labels on interactive elements
- Enables precise element identification by ID
- Eliminates ambiguity in element selection
- Real-Time Decision Making:
- No pre-programmed workflows
- Adapts to any UI state dynamically
- Handles errors and unexpected states
- Notion-Specific Intelligence:
- Distinguishes between title and body content blocks
- Understands slash command workflows
- Matches slash commands to task goals (/table vs /database)
- Comprehensive Screenshot Capture:
- Labeled screenshots with element markers
- Before and after each action
- Final state documentation
- Web UI & API: Flask web interface and REST API
- Session Management: Saves browser sessions (no re-login needed)
- Fully Generic: No hardcoded selectors - works across different applications
┌─────────────────────────────────────────────────────────────────────┐
│ Vision-Guided Agent │
│ (vision_guided_agent.py) │
│ Main Orchestrator │
└─────────────────────────────┬───────────────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌─────────▼──────┐ ┌────▼─────────┐ ┌─▼────────────────┐
│ Vision Agent │ │ Navigator │ │ SoM Labeler │
│ (GPT-4o) │ │ (Playwright) │ │ (Element Marker) │
│ - Parse task │ │ - Execute │ │ - Extract elements│
│ - Decide next │ │ actions │ │ - Label with IDs │
│ action │ │ - Manage │ │ - Create labeled │
│ - Analyze SoM │ │ browser │ │ screenshots │
└────────────────┘ └──────┬───────┘ └──────────────────┘
│
┌──────────▼──────────┐
│ Screenshot Manager │
│ - Capture states │
│ - Save metadata │
│ - Organize datasets │
└─────────────────────┘
┌─────────────────────┐
│ Web UI (Flask) │
│ web_app.py │
│ - REST API │
│ - Progress tracking │
│ - Dataset browser │
└─────────────────────┘
- Python 3.10+
- OpenAI API key (Get one here)
- Requires GPT-4o with vision capabilities
-
Clone the repository (or navigate to the project directory):
cd softlight -
Install dependencies:
pip install -r requirements.txt
-
Install Playwright browsers:
python -m playwright install
-
Configure environment variables:
cp .env.example .env
Edit
.envand add your OpenAI API key:OPENAI_API_KEY=your_api_key_here -
Initial login (one-time setup):
For the agent to work, you need to manually log in to your target applications once:
# This will open a browser where you can log in python -c "from playwright.sync_api import sync_playwright; p = sync_playwright().start(); browser = p.chromium.launch(headless=False); context = browser.new_context(); page = context.new_page(); page.goto('https://linear.app'); input('Log in, then press Enter...'); context.storage_state(path='browser_state/linear_state.json'); browser.close()"
Repeat for Notion or other apps you want to automate.
There are three ways to run the agent:
Start the Flask web server:
python web_app.pyThen open your browser to http://localhost:8080
The Web UI provides:
- Simple text input for natural language queries
- Real-time execution status
- Dataset browser to view captured workflows
- Download datasets as ZIP files
Run a single task query directly:
python vision_guided_agent.py "How do I create a table in Notion?"Examples:
# Notion tasks
python vision_guided_agent.py "How do I create a table in Notion?"
python vision_guided_agent.py "How do I create a database in Notion?"
# Linear tasks
python vision_guided_agent.py "How do I create a project in Linear?"
python vision_guided_agent.py "How do I create an issue in Linear?"Start the web app and use the REST API:
# Start the server
python web_app.py
# In another terminal, make API requests
curl -X POST http://localhost:8080/api/capture/start \
-H "Content-Type: application/json" \
-d '{"query": "How do I create a table in Notion?"}'import asyncio
from vision_guided_agent import VisionGuidedAgent
async def main():
# Create agent
agent = VisionGuidedAgent()
# Execute a query
result = await agent.execute_query(
"How do I create a table in Notion?",
headless=False # Set to True for headless mode
)
# Check results
if result['success']:
print(f"Task completed successfully!")
print(f"Screenshots saved to: {result['dataset_location']}")
print(f"Total steps: {result['execution']['total_steps']}")
else:
print(f"Task failed: {result['error']}")
asyncio.run(main())Here are example tasks the system can handle:
- Create a table: "How do I create a table in Notion?"
- Create a database: "How do I create a database in Notion?"
- Create a page: "How do I create a new page in Notion?"
- Add a to-do list: "How do I add a to-do list in Notion?"
- Create a project: "How do I create a project in Linear?"
- Create an issue: "How do I create a new issue in Linear?"
- Filter issues: "How do I filter issues by status in Linear?"
- Assign an issue: "How do I assign an issue to someone in Linear?"
The system is fully generic and can handle any task query for supported applications.
Screenshots and metadata are organized by app and task:
dataset/
├── linear/
│ ├── create_project/
│ │ ├── step_01_initial_initial_page_state.png
│ │ ├── step_01_labeled.png # Set-of-Marks labeled
│ │ ├── step_02_click_after_click_on_element_#42.png
│ │ ├── step_02_labeled.png # Set-of-Marks labeled
│ │ ├── step_03_fill_after_fill_on_element_#75.png
│ │ ├── step_04_click_after_click_on_element_#88.png
│ │ ├── step_01_final_final_state_after_task_completion.png
│ │ └── metadata.json # Complete execution metadata
│ └── create_issue/
│ └── ...
└── notion/
├── create_table/
│ ├── step_01_initial_initial_page_state.png
│ ├── step_01_labeled.png
│ ├── step_02_click_after_click_on_element_#22.png
│ ├── step_03_press_after_press_on_element_#75.png
│ ├── step_04_fill_after_fill_on_element_#76.png
│ ├── step_05_click_after_click_on_element_#75.png
│ ├── step_01_final_final_state_after_task_completion.png
│ └── metadata.json
└── create_database/
└── ...
Each task directory contains:
- Labeled screenshots: With numbered element markers (Set-of-Marks)
- Action screenshots: Captured after each action
- Final state: Showing task completion
- metadata.json: Complete execution details
Each task generates a metadata.json file:
{
"task": {
"app": "Notion",
"task_type": "create table",
"description": "How do I create a new table in Notion?",
"query": "How do I create a new table in Notion?"
},
"execution": {
"success": true,
"error": null,
"start_time": "2025-11-07T01:39:39.587238",
"end_time": "2025-11-07T01:40:17.954218",
"total_steps": 4
},
"steps": [
{
"step_number": 2,
"action": "click",
"element_id": 22,
"description": "To create a new table, we need to start by opening a new page where we can insert Notion content blocks.",
"success": true,
"error": null,
"screenshot_path": "dataset/notion/create_table/step_02_click_after_click_on_element_#22.png",
"timestamp": "2025-11-07T01:39:39.587238"
},
{
"step_number": 3,
"action": "press",
"element_id": 75,
"description": "To create a new table, first press Enter from the title to create a body content block.",
"success": true,
"screenshot_path": "dataset/notion/create_table/step_03_press_after_press_on_element_#75.png",
"timestamp": "2025-11-07T01:39:49.217972"
}
],
"screenshots": [...],
"metadata_version": "1.0",
"generated_at": "2025-11-07T01:40:22.856501"
}The element_id field references the numbered markers in the Set-of-Marks labeled screenshots.
Edit .env file to customize behavior:
OPENAI_API_KEY: Your OpenAI API key (required)BROWSER_MODE:headed(visible) orheadless(background)SLOW_MO: Slow down actions by X milliseconds (default: 500)SCREENSHOT_DIR: Where to save screenshots (default:./dataset)GPT_MODEL: GPT model to use (default:gpt-4o)SAVE_SESSION: Save browser sessions to avoid re-login (default:true)
See .env.example for all available options.
The system uses a vision-guided approach powered by GPT-4o and Set-of-Marks:
GPT-4o analyzes the natural language query:
- Identifies the application (Linear, Notion, etc.)
- Extracts the task goal
- Understands the desired outcome
For each UI state:
- Extracts all interactive elements using JavaScript
- Identifies element properties (text, labels, roles, position)
- Creates labeled screenshots with numbered markers
- Special handling for Notion:
- Distinguishes TITLE vs BODY content blocks
- Detects H1 elements as titles
- Identifies slash command input areas
GPT-4o analyzes the labeled screenshot:
- Sees the numbered elements on the screenshot
- Reads element descriptions (text, labels, roles)
- Decides the next action based on:
- Current UI state
- Task goal
- Previous actions taken
- Selects element by ID (e.g., "Click element #42")
Playwright executes the action:
click: Click element by IDfill: Type text into element by IDpress: Press keyboard keys (Enter, Escape, etc.)wait: Wait for UI updates
After each action:
- Captures the new UI state
- Creates a new labeled screenshot
- GPT-4o analyzes and decides next action
- Repeats until task is complete
Saves complete workflow information:
- Each step with element ID and reasoning
- All screenshots (labeled and unlabeled)
- Execution timeline and success status
Currently optimized for:
- Linear (project management)
- Notion (knowledge management)
Can be extended to any web application - the system is fully generic and doesn't use hardcoded selectors.
The system is fully generic and works across different applications:
- No hardcoded selectors: Uses Set-of-Marks to identify elements dynamically
- Vision-guided: GPT-4o sees the actual UI and decides actions in real-time
- No pre-programming needed: Adapts to any UI structure automatically
- Handles app-specific patterns: Special logic for Notion, Linear, etc.
- Learns from context: Uses previous actions to inform next steps
- Slow down actions: Increase
SLOW_MOin.env(e.g.,SLOW_MO=1000) - Run in headed mode to see what's happening:
BROWSER_MODE=headed - Check the labeled screenshots in the dataset folder to see what elements were detected
Make sure you've created a .env file and added your API key:
cp .env.example .env
# Edit .env and add your OpenAI API keyIf you hit OpenAI rate limits:
- Add delays between API calls in the code
- Reduce screenshot resolution
- Use a higher tier API plan
Manually log in to the application once:
python -m playwright codegen https://linear.appThis opens a browser where you can log in. The session will be saved automatically.
Check:
SCREENSHOT_DIRexists and is writable- Playwright has permissions to write files
- Full page screenshots are enabled:
FULL_PAGE_SCREENSHOT=true
softlight/
├── vision_guided_agent.py # Main vision-guided orchestrator
├── vision_agent.py # GPT-4o task interpretation & decision making
├── set_of_marks_labeler.py # Element extraction & labeling
├── navigator.py # Playwright browser automation
├── screenshot_manager.py # Screenshot capture & organization
├── web_app.py # Flask web UI & REST API
├── config.py # Configuration management
├── utils.py # Helper functions
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── dataset/ # Generated screenshots & metadata
│ ├── linear/ # Linear task datasets
│ └── notion/ # Notion task datasets
├── browser_state/ # Saved browser sessions (for auto-login)
├── web/ # Web UI static files (HTML/CSS/JS)
└── README.md # This file
The system is generic and can work with any web application:
-
Add the app URL to
config.py:APP_URLS = { "linear": "https://linear.app", "notion": "https://notion.so", "myapp": "https://myapp.com", }
-
Log in manually once to save the browser session:
python -m playwright codegen https://myapp.com # Log in, then save the session -
Run a task query:
python vision_guided_agent.py "How do I create something in MyApp?"
The vision-guided approach means no app-specific code is needed - GPT-4o figures out the UI dynamically.
- Requires manual login for each application (one-time setup)
- GPT-4o API calls can be expensive for complex workflows
- Vision analysis adds latency compared to pure DOM-based approaches
- Limited to 15 steps per task (configurable in code)
- May struggle with very dynamic SPAs that heavily mutate the DOM
Contributions are welcome! Areas for improvement:
- Additional application support (GitHub, Jira, Slack, etc.)
- Improved element detection for complex UIs
- Better error recovery and retry logic
- Multi-tab/window support
- Video recording of workflows
- Caching of vision analysis results
MIT License - see LICENSE file for details
Built with:
- GPT-4o (OpenAI) - Vision-based decision making
- Playwright - Browser automation
- Flask - Web UI framework
- Python - Core language
Inspired by the Set-of-Marks (SoM) visual grounding approach.
Note: This tool is designed for authorized automation of web applications you have access to. Always respect websites' terms of service and robots.txt files.