A framework to enable multimodal models to operate a computer using the same inputs and outputs as a human operator.
Using vision, speech, and text capabilities, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective. Released Nov 2023, the Self-Operating Computer Framework was one of the first examples of using a multimodal model to view the screen and operate a computer.
The Self-Operating Computer Framework enables AI models to autonomously interact with desktop environments by:
- Viewing and understanding screen content through computer vision
- Planning and executing mouse and keyboard actions to complete objectives
- Supporting multiple modalities including vision, speech, and text inputs
- Maintaining human-like interaction patterns for natural computer operation
- Enable seamless AI-computer interaction across platforms (macOS, Windows, Linux)
- Support extensible multimodal model integrations
- Provide a robust foundation for computer automation research and applications
- Maintain security, privacy, and safety in AI-driven computer operations
- Cross-Platform Support: Native compatibility with macOS, Windows, and Linux
- Multiple AI Model Integration: GPT-4o, GPT-4.1, o1, Gemini Pro Vision, Claude 3, Qwen-VL, and LLaVA
- Advanced Computer Vision: YOLOv8-based object detection and EasyOCR text recognition
- Intelligent Click Targeting: OCR-based element mapping and Set-of-Mark (SoM) prompting
- Voice Input Support: Speech-to-text for natural language objectives
- Real-time Screen Analysis: Live screenshot processing and decision making
- Vision: Screenshot analysis, object detection, text recognition
- Speech: Voice commands for objectives (with additional audio dependencies)
- Text: Direct command-line prompts and interactive input
- Hybrid: Combination of multiple modalities for enhanced accuracy
- Mouse Actions: Precise clicking, dragging, scrolling
- Keyboard Input: Text typing, keyboard shortcuts, navigation
- System Integration: Cross-platform automation with native OS APIs
final-low.mp4
- Python 3.7+ with pip package manager
- Operating System: macOS, Windows, or Linux (with X server)
- API Keys: At least one of the following:
- OpenAI API key for GPT models
- Google AI Studio API key for Gemini
- Anthropic API key for Claude
- Qwen API key for Qwen-VL
- System Permissions: Screen recording and accessibility permissions (required on macOS)
- Install via pip (recommended)
pip install self-operating-computer- Or install from source
git clone https://github.com/OthersideAI/self-operating-computer.git
cd self-operating-computer
pip install -e .- Run the framework
operate- Configure API Key: On first run, you'll be prompted to enter your API key. You can obtain keys here:
- Grant System Permissions (macOS): The Terminal app will request permissions for "Screen Recording" and "Accessibility" in System Preferences > Security & Privacy.
For voice input capabilities, install additional dependencies:
# Install audio requirements
pip install -r requirements-audio.txt
# Install system dependencies
# macOS:
brew install portaudio
# Linux:
sudo apt install portaudio19-dev python3-pyaudioThe Self-Operating Computer Framework can automate a wide variety of desktop tasks:
- Navigate to websites and extract information
- Fill out forms and submit data
- Compare products across multiple sites
- Download files and organize content
- Create and edit documents in various applications
- Manage email and calendar appointments
- Organize files and folders
- Take screenshots and create presentations
- Automate software testing workflows
- Set up development environments
- Execute repetitive coding tasks
- Validate UI/UX across different applications
- Edit images and videos using desktop applications
- Create content across multiple platforms
- Manage digital asset libraries
- Automate social media posting workflows
# Run with default model (GPT-4 with OCR)
operate
# Run with voice input
operate --voice
# Run with direct prompt
operate --prompt "Go to google.com"
# Run in verbose mode for debugging
operate --verbose# Default: GPT-4 with OCR (recommended)
operate
operate -m gpt-4-with-ocr
# Latest GPT-4.1 model
operate -m gpt-4.1-with-ocr
# OpenAI's o1 model
operate -m o1-with-ocr
# GPT-4 with Set-of-Mark prompting
operate -m gpt-4-with-somoperate -m gemini-pro-visionSetup: Requires Google AI Studio API key and desktop application credentials.
operate -m claude-3Setup: Requires Anthropic API key.
operate -m qwen-vlSetup: Requires Qwen API key.
# Install Ollama from https://ollama.ai/download
ollama pull llava
ollama serve
# Run with LLaVA
operate -m llavaNote: LLaVA has high error rates and is experimental. Requires ~5GB storage.
The default gpt-4-with-ocr mode provides enhanced accuracy by:
- Creating a hash map of clickable elements by coordinates
- Enabling text-based element selection
- Improving click precision through OCR analysis
The gpt-4-with-som mode uses visual prompting to enhance grounding:
- YOLOv8-based object detection for UI elements
- Visual markers overlaid on screenshots
- Improved spatial understanding for complex interfaces
Learn more: SoM Prompting Research Paper
Enable natural language voice commands:
operate --voiceRequires additional audio dependencies (see Voice Mode Setup above).
Based on the project's technical requirements, the following constraints must be maintained:
- Modularity: Maintain modular design for adding/removing input/output modalities
- Human-like Simulation: Ensure all interactions can be simulated as a human operator would
- Extensibility: Support extensibility for new input/output types and AI models
- Security & Privacy: Prioritize security and privacy of user data and system access
- Minimal Dependencies: Keep dependencies minimal and well-documented
- API Documentation: Document all APIs and extension points thoroughly
- Error Handling: Provide robust error handling and comprehensive logging
- Cross-Platform: Ensure compatibility across macOS, Windows, and Linux
The framework follows established patterns for maintainability and extensibility:
- Modular AI Model Integration: Each model (GPT-4, Gemini, Claude, etc.) is implemented as a separate module in
/operate/models/ - Utility-Based Architecture: Core functionality is separated into focused utility modules (
screenshot.py,ocr.py,operating_system.py, etc.) - Configuration Management: Centralized configuration handling in
config.pyfor API keys and settings - Cross-Platform Abstraction: Platform-specific code is abstracted in
operating_system.py - Prompt Engineering: Systematic prompt management in
prompts.pyfor consistent AI interactions - Computer Vision Pipeline: Integrated OCR and object detection for enhanced screen understanding
- Error Recovery: Graceful handling of API failures and system permission issues
self-operating-computer/
βββ operate/ # Main package
β βββ main.py # CLI entry point
β βββ operate.py # Core orchestration logic
β βββ config.py # Configuration management
β βββ models/ # AI model integrations
β β βββ apis.py # API clients for AI services
β β βββ prompts.py # System prompts and templates
β β βββ weights/ # YOLOv8 model weights
β βββ utils/ # Utility modules
β βββ operating_system.py # Cross-platform automation
β βββ screenshot.py # Screen capture utilities
β βββ ocr.py # Text recognition
β βββ label.py # Object detection
β βββ style.py # Terminal styling
βββ evaluate.py # Automated testing framework
βββ requirements.txt # Core dependencies
βββ requirements-audio.txt # Voice mode dependencies
βββ SWARM-NOTES.md # Technical guidelines
The framework includes an automated evaluation system to ensure consistent performance:
# Run all test cases
python evaluate.py- Basic Navigation: "Go to Github.com"
- Interactive Tasks: "Go to Youtube.com and play a video"
- Custom Evaluations: GPT-4 evaluates screenshots against success criteria
- Run
python evaluate.pyto ensure all test cases pass - Include evaluation screenshots in PRs that could impact performance
- Test across multiple platforms when possible
We welcome contributions to improve the Self-Operating Computer Framework!
- Fork the repository and create a feature branch
- Make your changes following the coding conventions in SWARM-NOTES.md
- Test your changes using
python evaluate.py - Submit a Pull Request with evaluation screenshots for performance-impacting changes
See CONTRIBUTING.md for detailed guidelines.
- Performance Optimization: Improve screenshot grid overlay for better click accuracy
- Cross-Platform Compatibility: Fix remaining Linux and Windows compatibility issues
- New Model Integration: Add support for additional multimodal models
- Enhanced Security: Implement confirmation prompts for potentially harmful actions
- Prompt Engineering: Improve system prompts for better model performance
# Clone and setup development environment
git clone https://github.com/OthersideAI/self-operating-computer.git
cd self-operating-computer
pip install -e .
# Make changes and test
python evaluate.py
# Follow conventional commit format
git commit -m "feat: add new model integration"Important: This framework executes arbitrary mouse and keyboard actions on your computer. Use with caution and awareness:
- Supervised Operation: Monitor the framework during operation, especially in production environments
- API Key Security: Store API keys securely and never commit them to version control
- Permission Management: Grant only necessary system permissions
- Data Privacy: Be aware that screenshots and actions may be sent to AI model APIs
- Testing Environment: Use in isolated or test environments when possible
- macOS: Full support with native permission handling
- Windows: Supported (some compatibility issues being addressed)
- Linux: Supported with X server (some compatibility issues being addressed)
- RAM: Minimum 4GB (8GB recommended for local models)
- Storage: 1GB for framework + 5GB additional for local LLaVA model
- Network: Internet connection required for cloud-based AI models
- Display: Standard desktop display (multi-monitor setups supported)
- Linux/Windows: Some platform-specific issues exist (contributions welcome)
- LLaVA/Ollama: High error rates, intended as experimental foundation
- API Rate Limits: OpenAI requires $5 minimum spend for GPT-4 access
- Performance: Screenshot processing may have latency on slower systems
- SWARM-NOTES.md - Technical guidelines and conventions
- CONTRIBUTING.md - Contribution guidelines and workflow
- Set-of-Mark Research Paper - SoM prompting methodology
- Discord: Join our Discord Server for real-time discussions in #self-operating-computer
- Twitter: Follow @HyperWriteAI for updates
- LinkedIn: Connect with HyperWriteAI
- Feedback: Reach out to Josh for project input
- OpenAI: Requires $5 minimum API spend for GPT-4 access. Learn more
- Google AI: Free tier available with rate limits. Get API key
- Anthropic: Usage-based pricing. Get API key
- Enhanced security features with user confirmation for sensitive actions
- Improved Linux and Windows compatibility
- Additional multimodal model integrations
- Performance optimization for screenshot processing
- Enhanced error handling and recovery mechanisms
- Mobile platform support exploration
- Web UI for remote operation (currently out of scope)
- Optimal screenshot grid configuration for improved click accuracy
- Advanced prompt engineering for better model performance
- Local model performance improvements
- Multi-step task planning and execution
- Integration with specialized computer vision models
For technical implementation details, development patterns, and safety guidelines, please review SWARM-NOTES.md before contributing.



