Skip to content

Embodied AI agent for World of Warcraft combining computer vision (SAM2, FastSAM), video-language models (MineCLIP), and RL for intelligent game automation.

Notifications You must be signed in to change notification settings

Bourn23/wow_bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

15 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Embodied Agent for World of Warcraft

A comprehensive AI agent system for World of Warcraft automation that combines computer vision, reinforcement learning, and multimodal AI models to create an intelligent gaming assistant.

๐ŸŽฏ Project Overview

This project implements an embodied AI agent capable of playing World of Warcraft through visual perception, object tracking, and intelligent action selection. The system integrates multiple state-of-the-art AI models including SAM2 for object segmentation, MineCLIP for video-language understanding, and custom reinforcement learning agents.

๐Ÿ—๏ธ Architecture

Core Components

  1. Vision System

    • Real-time screen capture and processing
    • Object detection and segmentation using FastSAM and SAM2
  2. AI Models

    • MineCLIP: Video-language model for understanding game contexts
    • SAM2: Segment Anything Model 2 for object segmentation and tracking
    • CLIP: Contrastive language-image pre-training for image understanding
    • FastSAM: Fast Segment Anything Model for real-time segmentation
  3. Agent System

    • Based on OpenAI's [Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos] (https://github.com/openai/Video-Pre-Training)
    • Reinforcement learning agents with action prediction
    • Multimodal policy networks combining vision and language
    • Autoregressive action modeling
  4. Data Pipeline

    • Video and input recording system
    • Data preprocessing for model training
    • Frame-action synchronization

๐Ÿ“ Project Structure

โ”œโ”€โ”€ vision/                          # Computer vision modules
โ”‚   โ”œโ”€โ”€ vision_agent.py             # Screen capture and window management
โ”‚   โ””โ”€โ”€ tracking_agent.py           # Object tracking implementation
โ”œโ”€โ”€ env/                            # Environment interfaces
โ”‚   โ””โ”€โ”€ wow_env.py                  # World of Warcraft environment wrapper
โ”œโ”€โ”€ MineCLIP/                       # MineCLIP model implementation
โ”œโ”€โ”€ SAM2/                           # SAM2 model and training code
โ”œโ”€โ”€ FastSAM/                        # FastSAM model implementation
โ”œโ”€โ”€ data_loader.py                  # Data loading and preprocessing
โ”œโ”€โ”€ bbox_selector.py               # Bounding box selection GUI
โ”œโ”€โ”€ config.py                      # Configuration management
โ””โ”€โ”€ recordings/                     # Recorded gameplay data

๐Ÿš€ Features

Computer Vision

  • Real-time Screen Capture: Efficient screen recording with window-specific targeting
  • Object Detection: YOLO-based detection for game entities
  • Object Segmentation: SAM2-powered precise object segmentation
  • Multi-object Tracking: Persistent tracking across video frames
  • Bounding Box Selection: Interactive GUI for target selection

AI Models

  • MineCLIP Integration: Video-language understanding for contextual gameplay
  • SAM2 Real-time Tracking: Live object segmentation and tracking
  • CLIP Analysis: Image-text similarity for decision making
  • Custom RL Agents: Trained agents for specific game tasks

Data Management

  • Input Recording: Synchronized recording of video, keyboard, and mouse inputs
  • Data Visualization: Tools for analyzing recorded gameplay data
  • Preprocessing Pipeline: Automated data preparation for model training
  • Autoregressive Data Loading: Temporal sequence processing for RL training

Game Integration

  • WoW Environment: OpenAI Gym-compatible environment wrapper
  • Action Mapping: Translation between AI decisions and game controls
  • Health Monitoring: OCR-based health and status tracking
  • Quest Management: Automated quest detection and completion

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.9+
  • PyTorch with MPS/CUDA support
  • OpenCV
  • macOS (for ScreenCaptureKit integration)

System Dependencies

# Install tesseract for OCR (health bar reading)
brew install tesseract

# Install mmseg for DinoV2 (if using)
pip install mmseg

Setup

# Clone the repository
git clone <repository-url>
cd Embodied_Agent_WoW

# Install dependencies
pip install -r requirements.txt

# Download model checkpoints
cd SAM2
./download_checkpoints.sh

# Setup MineCLIP
cd MineCLIP
pip install -e .

Model Downloads

  • SAM2 Checkpoints: Download from the official SAM2 repository
  • MineCLIP Models: Available variants include attn and avg
  • FastSAM Weights: Automatically downloaded on first use

๐ŸŽฎ Usage

Basic Agent Operation

# Start the main agent
python 15.01.main_agent_w_tracking.py

# Available commands:
# v - Start vision agent
# t - Start tracking agent  
# s - Show tracking visualization
# a - Take automated actions
# q - Quit

MineCLIP-Powered Agent

# Run agent with MineCLIP integration
python 19.wow_agent_with_mineclip.py variant=avg ckpt.path=MineCLIP/CKPT/avg.pth

Data Recording

# Record gameplay data
python 00.10.01.save_video_click.py

# Visualize recorded data
python 23.data_loader_visualizer.py

Model Training

# Train action prediction model
python 21.ft_inverse_dynamics_model.py

# Test trained model
python 28.test_trained_model.py

๐Ÿ“Š Models and Performance

Supported Models

Model Purpose Performance Notes
SAM2 Object Segmentation Real-time Multiple size variants
MineCLIP Video-Language Context-aware Pre-trained on gaming data
FastSAM Fast Segmentation 150+ FPS Lightweight alternative
CLIP Image-Text High accuracy Fine-tuned variants

Training Results

The project includes checkpoints for various model training epochs, with performance metrics tracked for:

  • Action prediction accuracy
  • Object tracking precision
  • Segmentation IoU scores
  • Policy reward optimization

๐Ÿ”ง Configuration

Model Configuration

# conf_mineclip.yaml
mineclip:
  variant: avg  # or attn
  resolution: [256, 160]
  device: mps  # or cuda/cpu

Agent Configuration

# config.py
AGENT_RESOLUTION = (224, 224)
MINECLIP_CONFIG = {
    'resolution': [256, 160],
    'fps': 60
}

๐Ÿ“ˆ Components Detail

Vision Pipeline

  1. Screen Capture: Uses macOS ScreenCaptureKit for efficient capture
  2. Object Detection: YOLO-based detection with FastSAM segmentation
  3. Tracking: SAM2-powered persistent object tracking
  4. Action Recognition: MineCLIP-based understanding of game states

Data Processing

  1. Video Processing: Frame extraction and preprocessing
  2. Action Synchronization: Keyboard/mouse event alignment
  3. Sequence Generation: Temporal sequences for RL training
  4. Augmentation: Data augmentation for robust training

Model Integration

  1. Multi-modal Fusion: Combining vision and language models
  2. Real-time Inference: Optimized for live gameplay
  3. Transfer Learning: Pre-trained models adapted for gaming
  4. Ensemble Methods: Multiple models for robust predictions

๐Ÿงช Experimental Features

  • Autoregressive Action Modeling: Predicting action sequences
  • Multi-object Tracking: Tracking multiple game entities
  • Quest Automation: Automated quest completion
  • Combat Optimization: AI-driven combat strategies
  • Exploration Strategies: Intelligent map exploration

๐Ÿค Contributing

Contributions are welcome! Areas for improvement:

  • Additional game environment support
  • Model optimization and efficiency improvements
  • New vision algorithms integration
  • Enhanced data collection tools

๐Ÿ“„ License

This project is for research and educational purposes. Please ensure compliance with game terms of service.

๐Ÿ”— Related Work

๐Ÿ“ž Contact

For questions and collaboration opportunities, please open an issue or contact the development team.


This project represents cutting-edge research in embodied AI and multimodal learning for gaming applications.

About

Embodied AI agent for World of Warcraft combining computer vision (SAM2, FastSAM), video-language models (MineCLIP), and RL for intelligent game automation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published