A comprehensive AI agent system for World of Warcraft automation that combines computer vision, reinforcement learning, and multimodal AI models to create an intelligent gaming assistant.
This project implements an embodied AI agent capable of playing World of Warcraft through visual perception, object tracking, and intelligent action selection. The system integrates multiple state-of-the-art AI models including SAM2 for object segmentation, MineCLIP for video-language understanding, and custom reinforcement learning agents.
-
Vision System
- Real-time screen capture and processing
- Object detection and segmentation using FastSAM and SAM2
-
AI Models
- MineCLIP: Video-language model for understanding game contexts
- SAM2: Segment Anything Model 2 for object segmentation and tracking
- CLIP: Contrastive language-image pre-training for image understanding
- FastSAM: Fast Segment Anything Model for real-time segmentation
-
Agent System
- Based on OpenAI's [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos] (https://github.com/openai/Video-Pre-Training)
- Reinforcement learning agents with action prediction
- Multimodal policy networks combining vision and language
- Autoregressive action modeling
-
Data Pipeline
- Video and input recording system
- Data preprocessing for model training
- Frame-action synchronization
โโโ vision/ # Computer vision modules
โ โโโ vision_agent.py # Screen capture and window management
โ โโโ tracking_agent.py # Object tracking implementation
โโโ env/ # Environment interfaces
โ โโโ wow_env.py # World of Warcraft environment wrapper
โโโ MineCLIP/ # MineCLIP model implementation
โโโ SAM2/ # SAM2 model and training code
โโโ FastSAM/ # FastSAM model implementation
โโโ data_loader.py # Data loading and preprocessing
โโโ bbox_selector.py # Bounding box selection GUI
โโโ config.py # Configuration management
โโโ recordings/ # Recorded gameplay data
- Real-time Screen Capture: Efficient screen recording with window-specific targeting
- Object Detection: YOLO-based detection for game entities
- Object Segmentation: SAM2-powered precise object segmentation
- Multi-object Tracking: Persistent tracking across video frames
- Bounding Box Selection: Interactive GUI for target selection
- MineCLIP Integration: Video-language understanding for contextual gameplay
- SAM2 Real-time Tracking: Live object segmentation and tracking
- CLIP Analysis: Image-text similarity for decision making
- Custom RL Agents: Trained agents for specific game tasks
- Input Recording: Synchronized recording of video, keyboard, and mouse inputs
- Data Visualization: Tools for analyzing recorded gameplay data
- Preprocessing Pipeline: Automated data preparation for model training
- Autoregressive Data Loading: Temporal sequence processing for RL training
- WoW Environment: OpenAI Gym-compatible environment wrapper
- Action Mapping: Translation between AI decisions and game controls
- Health Monitoring: OCR-based health and status tracking
- Quest Management: Automated quest detection and completion
- Python 3.9+
- PyTorch with MPS/CUDA support
- OpenCV
- macOS (for ScreenCaptureKit integration)
# Install tesseract for OCR (health bar reading)
brew install tesseract
# Install mmseg for DinoV2 (if using)
pip install mmseg
# Clone the repository
git clone <repository-url>
cd Embodied_Agent_WoW
# Install dependencies
pip install -r requirements.txt
# Download model checkpoints
cd SAM2
./download_checkpoints.sh
# Setup MineCLIP
cd MineCLIP
pip install -e .
- SAM2 Checkpoints: Download from the official SAM2 repository
- MineCLIP Models: Available variants include
attn
andavg
- FastSAM Weights: Automatically downloaded on first use
# Start the main agent
python 15.01.main_agent_w_tracking.py
# Available commands:
# v - Start vision agent
# t - Start tracking agent
# s - Show tracking visualization
# a - Take automated actions
# q - Quit
# Run agent with MineCLIP integration
python 19.wow_agent_with_mineclip.py variant=avg ckpt.path=MineCLIP/CKPT/avg.pth
# Record gameplay data
python 00.10.01.save_video_click.py
# Visualize recorded data
python 23.data_loader_visualizer.py
# Train action prediction model
python 21.ft_inverse_dynamics_model.py
# Test trained model
python 28.test_trained_model.py
Model | Purpose | Performance | Notes |
---|---|---|---|
SAM2 | Object Segmentation | Real-time | Multiple size variants |
MineCLIP | Video-Language | Context-aware | Pre-trained on gaming data |
FastSAM | Fast Segmentation | 150+ FPS | Lightweight alternative |
CLIP | Image-Text | High accuracy | Fine-tuned variants |
The project includes checkpoints for various model training epochs, with performance metrics tracked for:
- Action prediction accuracy
- Object tracking precision
- Segmentation IoU scores
- Policy reward optimization
# conf_mineclip.yaml
mineclip:
variant: avg # or attn
resolution: [256, 160]
device: mps # or cuda/cpu
# config.py
AGENT_RESOLUTION = (224, 224)
MINECLIP_CONFIG = {
'resolution': [256, 160],
'fps': 60
}
- Screen Capture: Uses macOS ScreenCaptureKit for efficient capture
- Object Detection: YOLO-based detection with FastSAM segmentation
- Tracking: SAM2-powered persistent object tracking
- Action Recognition: MineCLIP-based understanding of game states
- Video Processing: Frame extraction and preprocessing
- Action Synchronization: Keyboard/mouse event alignment
- Sequence Generation: Temporal sequences for RL training
- Augmentation: Data augmentation for robust training
- Multi-modal Fusion: Combining vision and language models
- Real-time Inference: Optimized for live gameplay
- Transfer Learning: Pre-trained models adapted for gaming
- Ensemble Methods: Multiple models for robust predictions
- Autoregressive Action Modeling: Predicting action sequences
- Multi-object Tracking: Tracking multiple game entities
- Quest Automation: Automated quest completion
- Combat Optimization: AI-driven combat strategies
- Exploration Strategies: Intelligent map exploration
Contributions are welcome! Areas for improvement:
- Additional game environment support
- Model optimization and efficiency improvements
- New vision algorithms integration
- Enhanced data collection tools
This project is for research and educational purposes. Please ensure compliance with game terms of service.
- SAM2: Segment Anything in Images and Videos
- MineCLIP: Foundation Model for MineDojo
- FastSAM: Fast Segment Anything
- CLIP: Contrastive Language-Image Pre-training
- [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos] (https://github.com/openai/Video-Pre-Training)
For questions and collaboration opportunities, please open an issue or contact the development team.
This project represents cutting-edge research in embodied AI and multimodal learning for gaming applications.