-
Notifications
You must be signed in to change notification settings - Fork 74
Performance analysis of PyTorch
The PyTorch port to ROCm is under active development especially in regards to performance. We are focussing our efforts on server-grade accelerators (MI25/MI60/...) but the following applies to all supported AMD hardware.
We supply a small microbenchmarking script for PyTorch training on ROCm. To use, download micro_benchmarking_pytorch.py and fp16util.py.
To execute:
python micro_benchmarking_pytorch.py --network <network name> [--batch-size <batch size> ] [--iterations <number of iterations>] [--fp16 <0 or 1> ]
Possible network names are: alexnet
, densenet121
, inception_v3
, resnet50
, resnet101
, SqueezeNet
, and vgg16
.
Default are 10 training iterations, fp16
off (i.e., 0), and a batch size of 64.
If performance on a specific card and/or model is found to be lacking, typically some gains can be made by tuning MIOpen. For this, export MIOPEN_FIND_ENFORCE=3
prior to running the model. This will take some time if untuned configurations are encountered and write to a local performance database. More information on this can be found in the MIOpen documentation.