AI Desktop Agent — Universal Smart Pipeline
Works with any AI provider · Runs free with local models · Self-healing doctor
Website · Discord · Quick Start · How It Works · API · Changelog
Universal Pipeline, Multi-App Workflows, Provider-Agnostic.
- 🧠 LLM-based task pre-processor — one cheap text LLM call decomposes any command into structured intent. No more brittle regex parsing.
- 📋 Multi-app workflows — copy from Wikipedia, paste in Notepad? Works. 6-checkpoint tracking ensures every step completes (select → copy → switch app → click → paste → verify).
- ⌨️ Site-specific shortcuts — Reddit (j/k/a/c), Twitter/X, YouTube, Gmail, GitHub, Slack + generic hints. Vision LLM uses keyboard instead of slow mouse clicks.
- 🌐 OS-level browser detection — reads Windows registry or macOS LaunchServices for actual default browser. No hardcoded Edge/Safari.
- 🔄 3 smart verification retries — on failure, builds step log digest + checkpoint status so the vision LLM fixes the exact missed step.
- 🔌 Mixed-provider pipelines — kimi for text + anthropic for Computer Use, with per-layer API key resolution from OpenClaw auth-profiles.
- 🔧 Global install fix — config discovery now checks package dir first, then cwd.
- 🏗️ Provider-agnostic internals — no hardcoded model names, no hardcoded app lists, universal checkpoint detection.
Keyboard Shortcuts, Pipeline Fixes, Better URL Handling.
- ⌨️ Keyboard shortcuts registry — common actions (scroll, copy, reddit upvote) execute as direct keystrokes. Zero LLM calls, instant.
- 🔧 Pipeline gate fix — Action Router now always runs, even for browser-context tasks. Shortcuts work everywhere.
- 🌐 Smarter URL extraction — "open gmail and send email to foo@bar.com" correctly navigates to Gmail instead of bar.com.
- 🔄 CDP→UIDriver fallback — Smart Interaction falls back to accessibility tree when browser CDP fails.
- 🛑 Reliable force-stop —
clawdcursor stopkills lingering processes. - 📊 Provider label inference — startup logs show text/vision providers clearly.
Universal Provider Support, OpenClaw Integration, Security Hardening.
- 🔗 OpenClaw integration — auto-discovers all configured providers from OpenClaw's config. No separate API key needed when running as a skill.
- 🌐 Universal provider support — Anthropic, OpenAI, Groq, Together AI, DeepSeek, Kimi, Ollama, or any OpenAI-compatible endpoint. Provider auto-detected from API key format.
- 🧠 Mixed provider pipelines — use Ollama for text (free) + cloud for vision (best quality). Doctor picks the optimal split automatically.
- 🔒 Security hardened — sensitive app policy (agents must ask before email/banking/messaging), safety tiers enforced, no credentials stored in skill files.
- 🔧 Auto-detection as default — no hardcoded models or providers. Doctor dynamically picks the best available setup.
- 🧠 Fluid task decomposition — LLM reasons about what ANY app needs instead of matching hardcoded patterns.
- 🩺 Interactive doctor — scans all providers, detects GPU/VRAM, lets you pick TEXT and VISION LLMs.
- 🖥️ Smart vision fallback — remaining subtasks bundled and handed to vision when cheap layers fail midway.
- 🖥️ Web Dashboard — real-time logs, approve/reject safety confirmations, kill switch. Dark theme, zero dependencies.
- 🪟 Browser foreground focus — Playwright activates Chrome at OS level. No more invisible background tabs.
- Multi-provider — 7+ providers supported out of the box
- 95% cheaper — simple tasks run for $0 with local models
- Self-healing — if a model fails, the pipeline adapts automatically
| Task | v0.4 (single provider) | v0.5+ (local, $0) | v0.5+ (cloud) |
|---|---|---|---|
| Calculator (255*38=) | 43s | 2.6s | 20.1s |
| Notepad (type hello) | 73s | 2.0s | 54.2s |
| File Explorer | 53s | 1.9s | 22.1s |
| Gmail compose | 162s (18 LLM calls) | — | 21.7s (1 LLM call) |
Clawd Cursor ships as an OpenClaw skill. Install it and any OpenClaw agent — yours or community-built — can control your desktop through natural language.
The SKILL.md teaches agents when and how to use Clawd Cursor: REST API for full desktop control, CDP direct for fast browser reads. Agents learn to be independent — no more asking you to screenshot or copy-paste things they can do themselves.
For orchestration best practices (how to avoid overlap and keep OpenClaw + Clawd Cursor efficient), see docs/OPENCLAW-INTEGRATION-RECOMMENDATIONS.md.
# Install as OpenClaw skill
openclaw skills install clawd-cursorgit clone https://github.com/AmrDab/clawd-cursor.git
cd clawd-cursor
npm install
npm run setup # builds + registers 'clawdcursor' command globally
# Just install and start — auto-configures from OpenClaw or env vars
clawdcursor start
# Or specify any provider
clawdcursor start --base-url https://api.example.com/v1 --api-key KEY
# Fine-tune setup interactively (optional)
clawdcursor doctorgit clone https://github.com/AmrDab/clawd-cursor.git
cd clawd-cursor && npm install && npm run setup
# Grant Accessibility permissions to your terminal first!
# System Settings → Privacy & Security → Accessibility → Add Terminal/iTerm
# Make macOS scripts executable
chmod +x scripts/mac/*.sh scripts/mac/*.jxa
# Just start — auto-detects available providers
clawdcursor start
# Or specify any provider
clawdcursor start --base-url https://api.example.com/v1 --api-key KEYgit clone https://github.com/AmrDab/clawd-cursor.git
cd clawd-cursor && npm install && npm run setup
# Linux: browser control via CDP only (no native desktop automation)
# Just start — auto-detects available providers
clawdcursor start
# Or specify any provider
clawdcursor start --base-url https://api.example.com/v1 --api-key KEY📖 See docs/MACOS-SETUP.md for the full macOS onboarding guide.
First run auto-configuration will:
- Scan for AI providers from OpenClaw config, environment variables, and CLI flags
- Quick-test discovered providers (5s timeout per provider)
- Build the optimal pipeline automatically
- Save config and start immediately
The optional doctor command provides interactive configuration:
- Tests your screen capture and accessibility bridge
- Scans all AI providers (Anthropic, OpenAI, Groq, Together, DeepSeek, Kimi, Ollama) and detects GPU/VRAM
- Tests each model and shows you what works with latency
- Lets you pick your TEXT LLM and VISION LLM (or accept the recommended defaults)
- Shows setup instructions for any unconfigured cloud providers
- Builds your optimal pipeline and saves it
Send a task:
clawdcursor task "Open Notepad and type hello world"
# Or via API:
curl http://localhost:3847/task -H "Content-Type: application/json" \
-d '{"task": "Open Notepad and type hello world"}'Note:
npm run setuprunsnpm run build && npm link, which registersclawdcursoras a global command. If you prefer not to link globally, runnpm run buildinstead and usenpx clawdcursorornode dist/index.jsto run commands.
Free (no API key needed):
# Just need Ollama running with any model
ollama pull <model> # e.g. qwen2.5:7b, llama3.2, gemma2
clawdcursor doctor
clawdcursor startAny cloud provider:
echo "AI_API_KEY=your-key-here" > .env
clawdcursor doctor
clawdcursor startDoctor auto-detects your provider from the key format. Supported out of the box:
| Provider | Key prefix | Vision | Computer Use |
|---|---|---|---|
| Anthropic | sk-ant- |
✅ | ✅ |
| OpenAI | sk- |
✅ | ❌ |
| Groq | gsk_ |
✅ | ❌ |
| Together AI | — | ✅ | ❌ |
| DeepSeek | — | ✅ | ❌ |
| Kimi/Moonshot | sk- (long) |
❌ | ❌ |
| Any OpenAI-compatible | — | varies | ❌ |
For providers without key prefix detection, specify explicitly:
clawdcursor doctor --provider together --api-key YOUR_KEYOpenClaw users: No setup needed — Clawd Cursor auto-discovers all your configured providers.
| OS | Status | Notes |
|---|---|---|
| Windows 10/11 | ✅ Full support | Native desktop automation via PowerShell + UI Automation scripts. |
| macOS 13+ | ✅ Full support | Native desktop automation via JXA/System Events scripts. |
| Linux | Browser/CDP flows work. Native desktop automation requires X11 native libs (for @nut-tree-fork/nut-js) and may still vary by distro/desktop environment. |
Linux prerequisites for native automation (Debian/Ubuntu example):
sudo apt-get update
sudo apt-get install -y libxtst6 libx11-xcb1 libxcomposite1 libxdamage1 libxfixes3 libxi6 libxrandr2 libxtst-devIf these libraries are missing, clawdcursor doctor can fail on startup with errors like libXtst.so.6: cannot open shared object file.
Every task is pre-processed by a cheap text LLM, then flows through up to 5 layers. Each layer is cheaper and faster than the next. Most tasks never reach Layer 3.
┌─────────────────────────────────────────────────────┐
│ Pre-processor: LLM Task Decomposition (1 text call) │
│ Parses any natural language → {app, navigate, task, │
│ contextHints}. Opens app + navigates URL before │
│ pipeline starts. Detects multi-app workflows. │
├─────────────────────────────────────────────────────┤
│ Layer 0: Browser (Playwright — free, instant) │
│ Direct browser control via CDP. page.goto(), │
│ brings Chrome to foreground. Zero vision tokens. │
├─────────────────────────────────────────────────────┤
│ Layer 1: Action Router + Shortcuts (instant, free) │
│ Regex + UI Automation. "Open X", "type Y", "click Z"│
│ Includes keyboard shortcuts registry — common │
│ actions like scroll, copy, undo, reddit upvote │
│ execute as direct keystrokes. Zero LLM calls. │
├─────────────────────────────────────────────────────┤
│ Layer 1.5: Smart Interaction (1 LLM call) │
│ CDPDriver (browser) or UIDriver (desktop apps). │
│ LLM plans steps → executes via selectors/a11y. │
├─────────────────────────────────────────────────────┤
│ Layer 2: Accessibility Reasoner (fast, cheap/free) │
│ Reads the accessibility tree, sends to cheap LLM │
│ (Haiku, Qwen, GPT-4o-mini). No screenshots needed │
├─────────────────────────────────────────────────────┤
│ Layer 3: Computer Use / Vision (powerful, expensive) │
│ Full screenshot → vision LLM with site-specific │
│ shortcuts + scroll guidance + multi-app workflows. │
│ 3 smart verification retries with step log analysis. │
└─────────────────────────────────────────────────────┘
The doctor decides which layers are available based on your setup. No API key? Layers 0-2 with Ollama. Anthropic key? All layers with Computer Use.
Clawd Cursor ships with a keyboard shortcuts registry. Common actions execute as direct keystrokes — no LLM calls, no screenshots, instant.
| Category | Examples |
|---|---|
| Navigation | scroll up/down, page up/down, go back/forward |
| Editing | copy, paste, undo, redo, select all |
| Browser | new tab, close tab, refresh, find |
| Social | reddit upvote/downvote, next/prev post |
| System | minimize, maximize, switch window |
Custom shortcuts can be added to src/shortcuts.ts. The action router uses fuzzy matching — "scroll the page down" maps to the scroll-down shortcut automatically.
| Provider | Layer 1 | Layer 2 (text) | Layer 3 (vision) | Computer Use |
|---|---|---|---|---|
| Anthropic | ✅ | Haiku | Sonnet | ✅ Native |
| OpenAI | ✅ | GPT-4o-mini | GPT-4o | ❌ |
| Groq | ✅ | Llama 3.3 70B | Llama 3.2 90B Vision | ❌ |
| Together AI | ✅ | Llama 3.1 70B | Llama 3.2 90B Vision | ❌ |
| DeepSeek | ✅ | DeepSeek Chat | DeepSeek Chat | ❌ |
| Kimi | ✅ | Moonshot-8k | Moonshot-8k | ❌ |
| Ollama | ✅ | Auto-detected | Auto-detected | ❌ |
| No key | ✅ | ❌ | ❌ | ❌ |
Mixed providers: Doctor can configure Ollama for text (free) + a cloud provider for vision (best quality). The pipeline picks the cheapest option for each layer automatically.
The pipeline adapts at runtime:
- Model fails? → Circuit breaker trips, falls to next layer
- API rate limited? → Exponential backoff + automatic retry
- Doctor detects issues? → Falls back to available alternatives (e.g., cloud model unavailable → local Ollama)
npm run doctor🩺 Clawd Cursor Doctor - diagnosing your setup...
📸 Screen capture...
✅ 2560x1440, 110ms
♿ Accessibility bridge...
✅ 20 windows detected, 822ms
🔍 Scanning providers...
Anthropic: ✅ key found (sk-ant-a...)
OpenAI: ❌ no key
Groq: ❌ no key
Together AI: ❌ no key
DeepSeek: ❌ no key
Kimi (Moonshot): ❌ no key
Ollama (Local): ✅ running (qwen2.5:7b, llama3.2)
💡 Cloud providers not configured (add API keys to unlock):
OpenAI: set OPENAI_API_KEY — https://platform.openai.com
Groq: set GROQ_API_KEY — https://console.groq.com
Together AI: set TOGETHER_API_KEY — https://api.together.xyz
Testing models...
Text: claude-haiku-4-5 (Anthropic) ✅ 498ms
Vision: claude-sonnet-4 (Anthropic) ✅ 1217ms
Text: qwen2.5:7b (Ollama) ✅ 4117ms
🎮 GPU detected: NVIDIA GeForce RTX 3080 (10240 MB VRAM)
🧩 Choose your pipeline models (press Enter for recommended).
TEXT LLM (Layer 2):
1. claude-haiku-4-5 (Anthropic, 498ms)
2. qwen2.5:7b (Ollama, 4117ms) ★ recommended
Pick 1-2 (Enter=2):
VISION LLM (Layer 3):
1. claude-sonnet-4 (Anthropic, 1217ms) ★ recommended
Pick 1 (Enter=1):
🧠 Selected pipeline:
Layer 1: Action Router (offline) ✅
Layer 2: qwen2.5:7b via Ollama ✅
Layer 3: claude-sonnet-4 via Anthropic ✅
🖥️ Computer Use API: enabled
💾 Config saved to .clawd-config.json
Options:
--provider <name> Force a provider (anthropic|openai|ollama|kimi)
--api-key <key> Override API key
--no-save Don't save config to disk
http://localhost:3847
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Web dashboard UI |
/task |
POST | Execute a task: {"task": "Open Chrome"} |
/status |
GET | Agent state and current task |
/logs |
GET | Last 200 log entries (JSON array) |
/confirm |
POST | Approve/reject pending action |
/abort |
POST | Stop the current task |
/stop |
POST | Graceful server shutdown |
/health |
GET | Server health + version |
┌───────────────────────────────────────────────────┐
│ Your Desktop (Native Control) │
│ @nut-tree-fork/nut-js · Playwright · OS-level │
└──────────────────────┬────────────────────────────┘
│
┌──────────────────────┴────────────────────────────┐
│ Clawd Cursor Agent │
│ │
│ ┌────────┐ ┌────────┐ ┌───────┐ ┌─────┐ ┌─────┐│
│ │Layer 0 │ │Layer 1 │ │L 1.5 │ │ L2 │ │ L3 ││
│ │Browser │→│Action │→│Smart │→│A11y │→│Vision││
│ │Playwrt │ │Router+ │ │Interac│ │Tree │ │+CU ││
│ │(free) │ │Shortct │ │(1 LLM)│ │(cheap│ │(full)││
│ └────────┘ └────────┘ └───────┘ └─────┘ └─────┘│
│ ↑ │
│ ┌──────────┐ ┌────────────────┐ │
│ │ Doctor │ │ Web Dashboard │ │
│ │ Auto-cfg │ │ localhost:3847 │ │
│ └──────────┘ └────────────────┘ │
│ │
│ Safety Layer · REST API · Circuit Breaker │
└────────────────────────────────────────────────────┘
| Tier | Actions | Behavior |
|---|---|---|
| 🟢 Auto | Navigation, reading, opening apps | Runs immediately |
| 🟡 Preview | Typing, form filling | Logs before executing |
| 🔴 Confirm | Sending messages, deleting, purchases | Pauses for approval |
clawdcursor start Start the agent
clawdcursor doctor Diagnose and auto-configure
clawdcursor task <t> Send a task to running agent
clawdcursor dashboard Open the web dashboard in your browser
clawdcursor kill Stop the running server
clawdcursor stop Stop the running server
Options:
--port <port> API port (default: 3847)
--provider <provider> Auto-detected, or: anthropic|openai|ollama|groq|together|deepseek|kimi|...
--model <model> Override vision model
--api-key <key> AI provider API key
--debug Save screenshots to debug/ folder
| Platform | UI Automation | Browser (CDP) | Status |
|---|---|---|---|
| Windows | PowerShell + .NET UI Automation | ✅ Chrome/Edge | ✅ Full support |
| macOS | JXA + System Events (Accessibility API) | ✅ Chrome/Edge | ✅ Full support |
| Linux | — | ✅ Chrome/Edge (CDP only) | 🔶 Browser only |
- Windows: Uses
powershell.exe+.NET UIAutomationClientfor native app interaction. Shell chaining:cd dir; npm start - macOS: Uses
osascript+ JXA (JavaScript for Automation) + System Events. Requires Accessibility permissions. Shell chaining:cd dir && npm start. See docs/MACOS-SETUP.md. - Both: CDPDriver (browser automation) works identically — connects via WebSocket to
localhost:9222.
# Windows (PowerShell)
Start-Process chrome --ArgumentList "--remote-debugging-port=9222"
# macOS (Bash)
open -a "Google Chrome" --args --remote-debugging-port=9222
# Edge on macOS
open -a "Microsoft Edge" --args --remote-debugging-port=9222- Node.js 18+ (20+ recommended)
- Windows: PowerShell (included with Windows)
- macOS 13+: osascript (included), Accessibility permissions granted
- AI API Key - optional. Works offline with Ollama or Action Router only.
TypeScript · Node.js · @nut-tree-fork/nut-js · sharp · Express · Any OpenAI-compatible API · Anthropic Computer Use · Windows UI Automation · macOS Accessibility (JXA) · Ollama
MIT