owl_wms

Basic world models

Training

Deploy Training Codebase and Train on SkyPilot

Setup SkyPilot

# Install SkyPilot
python3 -m pip install -U skypilot

# Authenticate
sky api login -e https://owlskypilot:<password>@cluster.openworldlabs.ai

Create Docker Image for your Current Codebase

Ensure you've configured your Docker Registry settings first. See "Docker Build & Deploy (For multinode training)"

# dockerizes and sets the $IMAGE_REF environment variable used by SkyPilot
./build_and_push.sh

Launch your Trainer

export EXPERIMENT_NAME=new-attention-pattern-v2
export TRAIN_CONFIG=skypilot/config.yaml

# Provision Single Node
sky launch --infra kubernetes --gpus H200:8 --num-nodes 1 --name $EXPERIMENT_NAME $TRAIN_CONFIG

# **OR** Provision Multiple Nodes
sky launch --infra kubernetes --gpus H200:8 --num-nodes 2 --name $EXPERIMENT_NAME $TRAIN_CONFIG

SkyPilot Basic Commands

# Launch multi-node training on SkyPilot
sky launch skypilot/config.yaml

# Check job status
sky status

# View logs
sky logs <cluster_name>

Train On Other Hosts

Setup the model / trainer / requirements

export REPO=https://github.com/wayfarer-labs/owl-wms
export EXPERIMENT_COMMIT=20bb0973336ab8696ad60e26bf1a7d5004191c70

git clone --recursive -j8 $REPO
cd owl-wms
git fetch && git checkout $EXPERIMENT_COMMIT
pip install -r requirements.txt
pip install -r owl-vaes/requirements.txt

# Configure WandB
wandb login

Then run training:

# Single GPU
python train.py --config_path configs/basic.yml

# Multi-GPU (single node)
torchrun --nproc_per_node=8 train.py --config_path configs/basic.yml

Docker Build & Deploy (For multinode training)

Setup

Copy the environment template:
```
cp .env.example .env
```

Configure your Docker registry settings in .env:

# Docker Registry Configuration
PROJECT_ID=your-project-id
REGISTRY=us-central1-docker.pkg.dev
REPOSITORY=your-repository
IMAGE_NAME=your-image-name
DEFAULT_TAG=latest

# Local build configuration
LOCAL_REGISTRY=us-central1-docker.pkg.dev
LOCAL_PROJECT=your-local-project
LOCAL_REPOSITORY=your-local-repo

Build and Deploy

# Build, tag, and push with default tag
./build_and_push.sh

# Build, tag, and push with custom tag
./build_and_push.sh v1.0.0

The script will:

Build the Docker image locally
Tag it for your remote registry
Push to the configured registry
Update skypilot/config.yaml with the new image tag

Multi-Node Training with SkyPilot

Setup

Edit skypilot/config.yaml to specify your training configuration:

# Change this line to point to your config file
train.py --config_path configs/YOUR_CONFIG.yml

Optionally adjust the number of nodes and GPU type:

resources:
  accelerators: H200:8  # 8 H200s per node
num_nodes: 2            # Number of nodes

Prerequisites

Make sure you're authenticated with Google Cloud:

gcloud auth login
gcloud auth application-default login

Make sure you set your Project ID for google cloud.

Name		Name	Last commit message	Last commit date
Latest commit History 617 Commits
.claude		.claude
.devcontainer		.devcontainer
configs		configs
inference		inference
owl-vaes @ 005e9fc		owl-vaes @ 005e9fc
owl_wms		owl_wms
sanity		sanity
skypilot		skypilot
.dockerenv		.dockerenv
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
build_and_push.sh		build_and_push.sh
create_service_account.sh		create_service_account.sh
requirements.txt		requirements.txt
setup_k8s_registry.sh		setup_k8s_registry.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

owl_wms

Training

Deploy Training Codebase and Train on SkyPilot

Train On Other Hosts

Docker Build & Deploy (For multinode training)

Setup

Build and Deploy

Multi-Node Training with SkyPilot

Setup

Prerequisites

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

Wayfarer-Labs/owl-wms

Folders and files

Latest commit

History

Repository files navigation

owl_wms

Training

Deploy Training Codebase and Train on SkyPilot

Train On Other Hosts

Docker Build & Deploy (For multinode training)

Setup

Build and Deploy

Multi-Node Training with SkyPilot

Setup

Prerequisites

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages