Multi-GPU Training with PyTorch: Data and Model Parallelism

About

The material in this repo demonstrates multi-GPU training using PyTorch. Part 1 covers how to optimize single-GPU training. The necessary code changes to enable multi-GPU training using the data-parallel and model-parallel approaches are then shown.

Software Environment Setup

If it is the first time you are using Conda, make sure you follow the guide of how to use Conda with this link: https://www.carc.usc.edu/user-guides/data-science/building-conda-environment

$ ssh <YourNetID>@discovery.usc.edu  # VPN required if off-campus
$ salloc --partition=gpu --gres=gpu:1 --cpus-per-task=8 --mem=32GB --time=1:00:00
$ mamba create --name torch-env
$ mamba activate torch-env
$ mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
$ mamba install line_profiler --channel conda-forge
$ git clone https://github.com/uschpc/multi_gpu_training.git
$ cd multi_gpu_training

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training with PyTorch: Data and Model Parallelism

About

Software Environment Setup

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Multi-GPU Training with PyTorch: Data and Model Parallelism

About

Software Environment Setup