Advanced AI-Powered Browser Automation SDK
CtrlAgent is a sophisticated browser automation framework that leverages Large Language Models (LLMs) to execute complex web tasks through natural language instructions. The system provides both imperative page methods and a programmatic task execution API, enabling developers to automate browser interactions without writing explicit selectors or navigation logic.
- Overview
- System Architecture
- Core Components
- Execution Flow
- Installation
- Quick Start
- API Reference
- LLM Providers
- Browser Providers
- Action System
- DOM Context Extraction
- MCP Integration
- Configuration
- Debug Mode
- Project Structure
- Contributing
- License
CtrlAgent transforms natural language instructions into precise browser actions through an agentic loop architecture. The system captures the current page state using accessibility trees and visual overlays, constructs prompts for LLM reasoning, and executes the resulting actions through Playwright or direct Chrome DevTools Protocol (CDP) calls.
- Natural Language Control: Execute browser tasks using plain English instructions
- Autonomous Navigation: Let the AI agent determine the optimal path to complete complex tasks
- Structured Data Extraction: Extract typed data from web pages with schema validation
- Multi-Provider LLM Support: Compatible with OpenAI, Anthropic, Google Gemini, and DeepSeek
- Anti-Detection: Built-in stealth features through Patchright integration
- MCP Integration: Extensible through Model Context Protocol for custom tool integration
- Local and Cloud Execution: Run browsers locally or through Hyperbrowser cloud service
The architecture follows a layered design pattern that separates concerns across five distinct layers:
+------------------------------------------------------------------+
| API LAYER |
| page.ai() | page.extract() | executeTask() |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| AGENT RUNTIME |
| Task Loop | State Machine | Message Builder |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| INTELLIGENCE LAYER |
| OpenAI | Anthropic | Gemini | DeepSeek |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| CONTEXT PROVIDERS |
| A11y Tree | DOM Snapshot | Visual Overlay |
+------------------------------------------------------------------+
|
v
+------------------------------------------------------------------+
| BROWSER CONTROL |
| Playwright | CDP Protocol | Patchright |
+------------------------------------------------------------------+
The system operates through bidirectional communication between core components:
- User/CLI sends task inputs to the CtrlAgent Core
- CtrlAgent Core orchestrates between LLM Providers and Browser Layer
- Browser Layer captures page state through DOM Context providers
- Actions Engine translates LLM decisions into browser commands
- DOM Context provides structured representations of page state
The main orchestrator that manages browser lifecycle, task execution, and agent coordination.
class CtrlAgent<T extends BrowserProviders = "Local"> {
// Core execution methods
async executeTask(task: string, params?: TaskParams, initPage?: Page): Promise<TaskOutput>
async executeTaskAsync(task: string, params?: TaskParams, initPage?: Page): Promise<Task>
// Page management
async getCurrentPage(): Promise<Page>
async newPage(): Promise<HyperPage>
async getPages(): Promise<HyperPage[]>
// Browser lifecycle
async initBrowser(): Promise<Browser>
async closeAgent(): Promise<void>
}Extended Playwright Page with AI-powered methods for autonomous interaction.
interface HyperPage extends Page {
// Execute task with AI agent
ai(task: string, params?: TaskParams): Promise<TaskOutput>
// Async task execution (non-blocking)
aiAsync(task: string, params?: TaskParams): Promise<Task>
// Structured data extraction
extract<T>(task?: string, outputSchema?: z.AnyZodObject, params?: TaskParams): Promise<T | string>
}The agent operates through a continuous loop that observes, reasons, and acts until task completion.
1. CAPTURE DOM STATE
|-- Build accessibility tree
|-- Generate element mapping (encoded IDs)
|-- Capture screenshot (optional)
|-- Apply visual overlays (optional)
2. BUILD AGENT PROMPT
|-- System prompt with capabilities
|-- Task description
|-- Previous steps context
|-- Current DOM representation
|-- Screenshot (if visual mode enabled)
3. LLM INVOCATION
|-- Send structured prompt to LLM
|-- Receive action list with Zod schema validation
|-- Parse response into typed actions
4. ACTION EXECUTION
|-- For each action in response:
| |-- Resolve target element (if applicable)
| |-- Execute action via CDP or Playwright
| |-- Wait for DOM to settle
|-- Check for completion signal
5. LOOP OR COMPLETE
|-- If complete action received: return output
|-- If max steps reached: return current state
|-- Otherwise: return to step 1
The agent maintains state through the execution cycle:
- DOM Snapshot Cache: Cached for approximately 1 second when
useDomCacheis enabled - Element Mapping: Encoded IDs in format
frameIndex-backendNodeId - Action History: Previous steps and their outcomes for context
- Variables: User-defined key-value pairs for dynamic inputs
# Using pnpm (recommended)
pnpm add browser-control-v2
# Using yarn
yarn add browser-control-v2
# Using npm
npm install browser-control-v2- Node.js 18.0 or higher
- Chromium-based browser (installed automatically via Playwright)
import { CtrlAgent } from 'browser-control-v2';
const agent = new CtrlAgent({
llmConfig: {
provider: 'openai',
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY
}
});
// Initialize browser
await agent.initBrowser();
// Get AI-enhanced page
const page = await agent.newPage();
// Navigate and execute task
await page.goto('https://example.com');
const result = await page.ai('Click the login button and sign in with email test@example.com');
console.log(result.output);
await agent.closeAgent();import { z } from 'zod';
const ProductSchema = z.object({
name: z.string(),
price: z.number(),
inStock: z.boolean()
});
const page = await agent.newPage();
await page.goto('https://shop.example.com/product/123');
const product = await page.extract(
'Extract the product information from this page',
ProductSchema
);
console.log(product); // { name: "Widget", price: 29.99, inStock: true }# Run with natural language command
yarn cli -c "Go to google.com and search for AI browser automation"
# Enable debug mode
yarn cli -c "Navigate to github.com" --debug
# Use cloud browser provider
yarn cli -c "Check my inbox on gmail.com" --hyperbrowserinterface CtrlAgentConfig {
// LLM Configuration
llmConfig: {
provider: 'openai' | 'anthropic' | 'gemini' | 'deepseek';
model: string;
apiKey: string;
};
// Browser Provider
browserProvider?: 'Local' | 'Hyperbrowser';
// Local browser settings
localConfig?: {
headless?: boolean;
proxy?: ProxySettings;
};
// Cloud browser settings
hyperbrowserConfig?: {
apiKey: string;
session?: SessionOptions;
};
// Agent behavior
debugOptions?: DebugOptions;
customActions?: CustomAction[];
}interface TaskParams {
maxSteps?: number; // Maximum agent loop iterations (default: 10)
timeout?: number; // Task timeout in milliseconds
variables?: HyperVariable[]; // Dynamic input values
outputSchema?: z.ZodSchema; // Expected output structure
enableVisualMode?: boolean; // Include screenshots in context
useDomCache?: boolean; // Enable DOM snapshot caching
}CtrlAgent supports multiple LLM providers through a unified adapter interface:
| Provider | Models | Features |
|---|---|---|
| OpenAI | gpt-4o, gpt-4-turbo, gpt-3.5-turbo | Vision, function calling, JSON mode |
| Anthropic | claude-3-5-sonnet, claude-3-opus | Vision, extended context |
| Google Gemini | gemini-2.0-flash, gemini-pro | Vision, PDF processing |
| DeepSeek | deepseek-chat, deepseek-coder | Cost-effective reasoning |
// OpenAI
const agent = new CtrlAgent({
llmConfig: {
provider: 'openai',
model: 'gpt-4o',
apiKey: process.env.OPENAI_API_KEY
}
});
// Anthropic
const agent = new CtrlAgent({
llmConfig: {
provider: 'anthropic',
model: 'claude-3-5-sonnet-20241022',
apiKey: process.env.ANTHROPIC_API_KEY
}
});
// Google Gemini
const agent = new CtrlAgent({
llmConfig: {
provider: 'gemini',
model: 'gemini-2.0-flash',
apiKey: process.env.GEMINI_API_KEY
}
});Uses Patchright, a Playwright fork with anti-detection features:
const agent = new CtrlAgent({
browserProvider: 'Local',
localConfig: {
headless: false, // Visible browser window
channel: 'chrome' // Use system Chrome
}
});Connect to managed cloud browser instances:
const agent = new CtrlAgent({
browserProvider: 'Hyperbrowser',
hyperbrowserConfig: {
apiKey: process.env.HYPERBROWSER_API_KEY,
session: {
persistent: true,
profile: 'default'
}
}
});The agent can execute the following atomic actions:
| Action | Description | Parameters |
|---|---|---|
goToURL |
Navigate to a URL | url: string |
actElement |
Unified element interaction | index: number, action: ElementAction |
extract |
Extract data from page | objective: string |
scroll |
Scroll the page | direction: 'up' | 'down' |
wait |
Wait for condition | duration: number |
refreshPage |
Reload current page | - |
complete |
End task execution | output?: string |
The actElement action supports these interactions:
- click: Left-click on element
- type: Input text into field
- select: Choose dropdown option
- hover: Move cursor over element
- clear: Clear input field content
Extend the agent with domain-specific actions:
const customActions = [{
name: 'screenshot',
description: 'Capture a screenshot of the current page',
schema: z.object({
filename: z.string().describe('Output filename')
}),
run: async (ctx, params) => {
await ctx.page.screenshot({ path: params.filename });
return { success: true, message: `Saved to ${params.filename}` };
}
}];
const agent = new CtrlAgent({
customActions,
// ... other config
});The agent understands page state through multiple context providers:
Leverages the browser accessibility tree for semantic understanding:
[1] button "Sign In" (clickable)
[2] textbox "Email" (editable)
[3] textbox "Password" (editable, password)
[4] link "Forgot Password" (href="/reset")
[5] checkbox "Remember Me" (unchecked)
When visual mode is enabled, numbered overlays are composited onto screenshots:
- Elements are assigned sequential indices
- Bounding boxes drawn on captured screenshot
- Provides visual context for vision-enabled models
Elements are identified using encoded IDs:
Format: {frameIndex}-{backendNodeId}
Example: 0-157 (frame 0, node ID 157)
This encoding supports:
- Cross-frame element resolution
- Stable references across DOM updates
- CDP-based element interaction
Connect external tools through Model Context Protocol:
await agent.initializeMCPClient({
servers: [{
id: 'filesystem',
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-filesystem', '/tmp']
}, {
id: 'database',
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-postgres']
}]
});
// MCP tools are now available as agent actions
const result = await page.ai('Read the contents of config.json and update the settings');# LLM API Keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
DEEPSEEK_API_KEY=...
# Browser Provider
HYPERBROWSER_API_KEY=...
# Optional Settings
BROWSER_HEADLESS=false
DEBUG=trueconst agent = new CtrlAgent({
llmConfig: { /* ... */ },
// Enable CDP-based actions (faster, more reliable)
cdpActions: true,
// DOM caching for performance
useDomCache: true,
// Visual mode for complex UIs
enableVisualMode: false,
// Debug output
debug: process.env.DEBUG === 'true',
debugOptions: {
saveScreenshots: true,
saveMessages: true,
outputDir: './debug'
}
});Enable comprehensive debugging for development and troubleshooting:
const agent = new CtrlAgent({
debug: true,
debugOptions: {
saveScreenshots: true,
saveMessages: true,
traceCDP: true
}
});debug/
{taskId}/
step-0/
elems.txt # DOM text representation
screenshot.png # Composite screenshot with overlays
msgs.json # LLM messages sent
stepOutput.json # Action execution results
step-1/
...
taskOutput.json # Final task output
src/
agent/ # Core agent implementation
actions/ # Action definitions and handlers
examine-dom/ # Element ranking for page.aiAction
messages/ # Prompt construction
mcp/ # MCP client integration
shared/ # DOM utilities and element finding
tools/ # Agent runtime loop
index.ts # CtrlAgent class
browser-providers/ # Browser abstraction layer
local.ts # Local Playwright provider
hyperbrowser.ts # Cloud provider integration
cdp/ # Chrome DevTools Protocol
client.ts # CDP session management
frame-graph.ts # Frame hierarchy tracking
dispatch.ts # Action dispatch via CDP
context-providers/ # Page state extraction
a11y-dom/ # Accessibility tree provider
shared/ # Screenshot and overlay utilities
llm/ # LLM adapters
providers/ # OpenAI, Anthropic, Gemini, DeepSeek
types.ts # Message and response types
types/ # TypeScript definitions
utils/ # Shared utilities
index.ts # Package entry point
Contributions are welcome. Please follow these guidelines:
- Code Style: Run
yarn lintandyarn formatbefore committing - Testing: Add tests for new features, run
yarn testto validate - Documentation: Update this README for API changes
- Commits: Use imperative mood in commit messages
# Build the project
yarn build
# Run linting
yarn lint
# Format code
yarn format
# Run tests
yarn test
# Run CLI in development
yarn cli -c "your task here"
# Run example scripts
yarn example examples/basic-navigation.tsMIT License - see LICENSE for details.
This project builds upon foundational work from:
- Playwright - Browser automation framework
- Patchright - Anti-detection Playwright fork
- Stagehand - Accessibility tree approaches
- Model Context Protocol - Tool integration standard


