Skip to content

Failing to load the pre-trained weights on multi-gpus. #3767

Closed
@HamidShojanazeri

Description

@HamidShojanazeri

🐛 Bug

Downloading the pre-trained weights for following models, Alexnet, Resnet_152, Resnet -18, SqueezeNet, VGG11 and trying to load them on any gpu other than cuda:0, it throw error.

To Reproduce

wget https://download.pytorch.org/models/alexnet-owt-4df8aa71.pth

import torch
from torchvision.models.alexnet import AlexNet
class ImageClassifier(AlexNet):
    def __init__(self):
        super(ImageClassifier, self).__init__()             
device1='cuda:0'
device2='cuda:2'
model = ImageClassifier()
state_dict = torch.load("alexnet-owt-4df8aa71.pth", map_location=device2)
model.load_state_dict(state_dict)
model = model.to(device2)

Error

File "test_device.py", line 16, in
state_dict = torch.load("alexnet-owt-4df8aa71.pth", map_location=device2)........

RuntimeError: Attempted to set the storage of a tensor on device "cuda:0" to a storage on different device "cuda:2". This is no longer allowed; the devices must match

Expected behavior

Be able to load the state_dict on any cuda device using map_location.

Enviroment

  • PyTorch / torchvision Version (e.g., 1.0 / 0.4.0):1.7.1, 1.8.0,1.8.1
  • OS (e.g., Linux): ubuntu 18.04
  • How you installed PyTorch / torchvision (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.7
  • CUDA/cuDNN version:10.2
  • GPU models and configuration: Nvidia Tesla k80
  • Any other relevant information:

Additional context

These models are being used in Torchserve examples are failing in multi-gpu setting to be loaded on different cuda device. As a work around in Torchserve stated dicts are loaded first on cuda:0 then move the model to another device/ cuda+ids which creates this issue where it results in duplicated processes on two gpus and adding to the memory footprint.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions