Skip to content

🐛 [Bug] Loading Torch-TensorRT models (.ts) on multiple GPUs (in TorchServe) #1888

Open
@emilwallner

Description

@emilwallner

Bug Description

Everything works well when I'm using 1 GPU, but as soon as I try to load a model on 4 separate GPUs, I get this error:

MODEL_LOG - RuntimeError: [Error thrown at core/runtime/TRTEngine.cpp:42] Expected most_compatible_device to be true but got false
MODEL_LOG - No compatible device was found for instantiating TensorRT engine

To Reproduce

Steps to reproduce the behavior:

Create a (.ts) model and load it on 4 different GPUs. I don't know if this is specific to TorchServe, or a general issue.

Here's the simple version (TorchServe Handler):

def initialize(self, ctx):
        properties = ctx.system_properties
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        self.model = torch.jit.load('model.ts')

I'm not sure if it relates to this issue. From what I can tell it seems like I need to restrict the CUDA context, however, the GPU is assigned in the handler. I tried these things, but it's still giving me the same problem.

def initialize(self, ctx):
        properties = ctx.system_properties
        self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")
        torch.cuda.set_device(self.device)
        torch_tensorrt.set_device(int(properties.get("gpu_id")))

        with torch.cuda.device(int(properties.get("gpu_id"))):
              self.model = torch.jit.load('model.ts')
              self.model.to(self.device)
              self.model.eval()

I also tried mapping the model straight to the GPU on load, but with the same problem.

Expected behavior

Load a .ts model by specifying the GPU Id without any issues.

Environment

Build information about Torch-TensorRT can be found by turning on debug messages

Official PyTorch image: nvcr.io/nvidia/pytorch:22.12-py3
GPUs: 4x NVIDIA A10G
Pytorch: 1.14.0a0+410ce96
NVIDIA CUDA 11.8.0
TensorRT 8.5.1
Ubuntu 20.04 including Python 3.8
NVIDIA CUDA® 11.8.0
NVIDIA cuBLAS 11.11.3.6
NVIDIA cuDNN 8.7.0.84
NVIDIA NCCL 2.15.5 (optimized for NVIDIA NVLink®)
NVIDIA RAPIDS™ 22.10.01 (For x86, only these libraries are included: cudf, xgboost, rmm, cuml, and cugraph.)
Apex
rdma-core 36.0
NVIDIA HPC-X 2.13
OpenMPI 4.1.4+
GDRCopy 2.3
TensorBoard 2.9.0
Nsight Compute 2022.3.0.0
Nsight Systems 2022.4.2.1
NVIDIA TensorRT™ 8.5.1
Torch-TensorRT 1.1.0a0
NVIDIA DALI® 1.20.0
MAGMA 2.6.2
JupyterLab 2.3.2 including Jupyter-TensorBoard
TransformerEngine 0.3.0

Additional context

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions