Slide: https://docs.google.com/presentation/d/1y15fzHoBhjBFpmy44Q3VrW0gJMS7ePhCVsRgXNmqC-g/edit?usp=sharing
Allocate node:
- Single Node: salloc --nodes 1 --qos interactive --time 01:00:00 --constraint gpu --gpus-per-node 4 --account=<acc_name>_g
- Multinode: salloc --nodes 2 --qos interactive --time 00:30:00 --ntasks-per-node=4 --cpus-per-task=32 --constraint gpu --gpus-per-node 4 --account=<acc_name>_g
- Run: ./multinode_setup_and_run.sh [9-11]<filename>.py
Enable PyTorch: module load pytorch/2.6.0
Deep profiling:
- pip install -U tensorboard torch-tb-profiler
- **Add profile context in the code**
- tensorboard --logdir=./log
Debug Tips:
- Check for slow import: python -X importtime file.py
- When using print, use flush=True. E.g.: print('test', flush=True)
0_torch_dist.pyLaunch 4 processes and performs all_reduce1_torch_dist_gpu.pyLaunch 4 processes and performs all_reduce in GPU2_mnist_training.pyA simple MNIST classification pipeline using a single GPU3_mnist_distributed.py4 identical processes performing exactly same task - a simple MNIST classification pipeline GPU4_mnist_manual_ddp.pyA manual implementation of DDP on previous pipeline with distributed sampler5_mnist_manual_ddp_profile.pyTensorboard profiler on previous code, reduced epoch for smaller profile data. Output saved as./log_ddp. Inspect usingtensorboard --logdir=./log_ddp.6_mnist_ddp_pt.pyPyTorch wrapper for DDP. No manual all_reduce needed7_mnist_ddp_pt_timing.pyPyTorch DDP code with arg = GPU number. Prints out the time taken for model training. arg = 4 should give lower runtime.8_mnist_ddp_pt_lr.pyAdjustment of learning rate so that the loss curve matches single GPU.test-profile.pyA smaller DDP ML pipeline with dummy data and Tensorboard profiling for easier inspection. Output saved as./log. Inspect usingtensorboard --logdir=./log.
Useful Commands:
- Multinode allocation: salloc --nodes 2 --qos interactive --time 00:30:00 --ntasks-per-node=4 --cpus-per-task=32 --constraint gpu --gpus-per-node 4 --account=<acc_name>_g
- module load pytorch/2.6.0
- Run: ./multinode_setup_and_run.sh
<filename>.py
9_torch_dist_multi_node.pyall_reduce example for multinode10_mnist_ddp_pt_multinode.pyMNIST training pipeline for MultiNode DDP11_mnist_ddp_pt_multinode_streaming.pyMNIST training pipeline for MultiNode DDP with streaming dataloader from diskcreate_mnist_h5.pyCreate chunked MNIST dataset for streaming