A comprehensive starter repository for researchers at the Vector Institute to get started with high-performance computing on Bon Echo and Killarney clusters. This playbook provides everything you need to run machine learning experiments at scale, from basic cluster usage to advanced distributed training workflows.
This repository provides two main components:
- Cluster Introduction: Complete guide to connecting to and using Vector compute resources
- Slurm Examples: Real-world examples showing how to submit jobs, run distributed training, and use cluster services
- Migration Guide: Instructions for moving from legacy Bon Echo to the new Killarney cluster
- Ready-to-run examples for different ML domains (LLM, VLM, MLP, RL)
- Hydra + Submitit integration for configurable experiments and hyperparameter sweeps
- Cluster-optimized configs for different hardware setups (A40, A100, H100, L40S)
- Checkpointing & requeue support for long-running jobs
- Access to Vector Institute compute clusters (Bon Echo or Killarney)
- uv package manager installed
# Clone the repository
git clone https://github.com/VectorInstitute/vec-playbook.git
cd vec-playbook
# Install dependencies
uv sync
Edit templates/configs/user.yaml
with your Slurm account details:
user:
slurm:
account: YOUR_ACCOUNT
# Simple MLP training on Killarney L40S
uv run python -m mlp.single.launch compute=killarney/l40s_1x requeue=off --multirun
- Start here: Getting Started Documentation - Learn the basics of Vector compute
- Try examples: Slurm Examples - Run simple jobs to get familiar
- Use templates: Templates - Run ML training experiments
- Templates: templates/ - Training workflows
- Configs: templates/configs/ - Cluster and experiment configurations
- Advanced: templates/README.md - Detailed usage instructions
- A40 GPUs: 1x, 4x configurations
- A100 GPUs: 1x, 4x configurations
- H100 GPUs: 1x, 8x configurations
- L40S GPUs: 1x, 2x configurations
vec-playbook/
├── getting-started/ # 📖 Learning resources
│ ├── introduction-to-vector-compute/ # Cluster basics
│ └── slurm-examples/ # 🧪 Hands-on examples
├── templates/ # 🧬 ML training templates
│ ├── src/ # Template source code
│ └── configs/ # Cluster & experiment configs
└── README.md # This file
We welcome contributions! Whether it's:
- New training templates
- Additional cluster configurations
- Documentation improvements
- Bug fixes