Description
🐛 Bug
Downloading the pre-trained weights for following models, Alexnet, Resnet_152, Resnet -18, SqueezeNet, VGG11 and trying to load them on any gpu other than cuda:0, it throw error.
To Reproduce
wget https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth
import torch
from torchvision.models.alexnet import AlexNet
class ImageClassifier(AlexNet):
def __init__(self):
super(ImageClassifier, self).__init__()
device1='cuda:0'
device2='cuda:2'
model = ImageClassifier()
state_dict = torch.load("alexnet-owt-4df8aa71.pth", map_location=device2)
model.load_state_dict(state_dict)
model = model.to(device2)
Error
File "test_device.py", line 16, in
state_dict = torch.load("alexnet-owt-4df8aa71.pth", map_location=device2)........
RuntimeError: Attempted to set the storage of a tensor on device "cuda:0" to a storage on different device "cuda:2". This is no longer allowed; the devices must match
Expected behavior
Be able to load the state_dict on any cuda device using map_location.
Enviroment
- PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):1.7.1, 1.8.0,1.8.1
- OS (e.g., Linux): ubuntu 18.04
- How you installed PyTorch / torchvision (
conda
,pip
, source): pip - Build command you used (if compiling from source):
- Python version: 3.7
- CUDA/cuDNN version:10.2
- GPU models and configuration: Nvidia Tesla k80
- Any other relevant information:
Additional context
These models are being used in Torchserve examples are failing in multi-gpu setting to be loaded on different cuda device. As a work around in Torchserve stated dicts are loaded first on cuda:0 then move the model to another device/ cuda+ids which creates this issue where it results in duplicated processes on two gpus and adding to the memory footprint.