Basic world models
Setup SkyPilot
# Install SkyPilot
python3 -m pip install -U skypilot
# Authenticate
sky api login -e https://owlskypilot:<password>@cluster.openworldlabs.aiCreate Docker Image for your Current Codebase
Ensure you've configured your Docker Registry settings first. See "Docker Build & Deploy (For multinode training)"
# dockerizes and sets the $IMAGE_REF environment variable used by SkyPilot
./build_and_push.shLaunch your Trainer
export EXPERIMENT_NAME=new-attention-pattern-v2
export TRAIN_CONFIG=skypilot/config.yaml
# Provision Single Node
sky launch --infra kubernetes --gpus H200:8 --num-nodes 1 --name $EXPERIMENT_NAME $TRAIN_CONFIG
# **OR** Provision Multiple Nodes
sky launch --infra kubernetes --gpus H200:8 --num-nodes 2 --name $EXPERIMENT_NAME $TRAIN_CONFIGSkyPilot Basic Commands
# Launch multi-node training on SkyPilot
sky launch skypilot/config.yaml
# Check job status
sky status
# View logs
sky logs <cluster_name>Setup the model / trainer / requirements
export REPO=https://github.com/wayfarer-labs/owl-wms
export EXPERIMENT_COMMIT=20bb0973336ab8696ad60e26bf1a7d5004191c70
git clone --recursive -j8 $REPO
cd owl-wms
git fetch && git checkout $EXPERIMENT_COMMIT
pip install -r requirements.txt
pip install -r owl-vaes/requirements.txt
# Configure WandB
wandb loginThen run training:
# Single GPU
python train.py --config_path configs/basic.yml
# Multi-GPU (single node)
torchrun --nproc_per_node=8 train.py --config_path configs/basic.yml-
Copy the environment template:
cp .env.example .env
-
Configure your Docker registry settings in
.env:# Docker Registry Configuration PROJECT_ID=your-project-id REGISTRY=us-central1-docker.pkg.dev REPOSITORY=your-repository IMAGE_NAME=your-image-name DEFAULT_TAG=latest # Local build configuration LOCAL_REGISTRY=us-central1-docker.pkg.dev LOCAL_PROJECT=your-local-project LOCAL_REPOSITORY=your-local-repo
# Build, tag, and push with default tag
./build_and_push.sh
# Build, tag, and push with custom tag
./build_and_push.sh v1.0.0The script will:
- Build the Docker image locally
- Tag it for your remote registry
- Push to the configured registry
- Update
skypilot/config.yamlwith the new image tag
-
Edit
skypilot/config.yamlto specify your training configuration:# Change this line to point to your config file train.py --config_path configs/YOUR_CONFIG.yml
-
Optionally adjust the number of nodes and GPU type:
resources: accelerators: H200:8 # 8 H200s per node num_nodes: 2 # Number of nodes
-
Make sure you're authenticated with Google Cloud:
gcloud auth login gcloud auth application-default login
-
Make sure you set your Project ID for google cloud.