Skip to content

GPU inference in Docker container fails due to missing libdevice directory #2201

Closed
@MartijnVanbiervliet

Description

@MartijnVanbiervliet

Bug Report

System information

  • OS Platform and Distribution: RockyLinux 9.2
  • TensorFlow Serving installed from (source or binary): Docker image
  • TensorFlow Serving version: tensorflow/serving:2.14.1-gpu
  • Docker 24.0.5
  • NVIDIA R535 drivers 535.86.10
  • NVIDIA Container Toolkit 1.13.5

Describe the problem

With the latest version of Docker image tensorflow/serving:2.14.1-gpu, I cannot run inference of my model with GPU using Docker image tensorflow/serving. The following error is shown in the logs:

2024-01-30 10:15:19.247458: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:521] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  /usr/bin/../nvidia/cuda_nvcc
  /usr/bin/../../nvidia/cuda_nvcc
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2024-01-30 10:15:19.259311: I external/org_tensorflow/tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-30 10:15:19.260050: I external/org_tensorflow/tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-30 10:15:19.260095: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:109] Couldn't get ptxas version : FAILED_PRECONDITION: Couldn't get ptxas/nvlink version string: INTERNAL: Couldn't invoke ptxas --version
...
2024-01-30 10:15:19.770155: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:559] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
error: libdevice not found at ./libdevice.10.bc

It appears that the CUDA libraries are not installed completely. The libdevice directory doesn't exist in the Docker image. I expected CUDA to be fully installed to support serving models with GPU.

I encounter no problems with tensorflow/serving:2.11.0-gpu.

I considered the following solutions before raising this issue:

Workaround

Install the cuda-toolkit package in the Docker image.

FROM tensorflow/serving:2.14.1-gpu
RUN apt-get update && apt-get install -y cuda-toolkit-11-8

This increases the size of the Docker image by ~4GB (uncompressed).

Alternatively, it also works with the tensorflow/serving:2.14.1-devel-gpu Docker image, but this is even larger in size.

Exact Steps to Reproduce

docker run -u root:root -ti --entrypoint bash tensorflow/serving:2.14.1-gpu

ptxas is not available:

$ ptxas --version
bash: ptxas: command not found

Searching for a directory nvvm or libdevice returns nothing:

find / -type d -name nvvm 2>/dev/null

When using 2.11.0, it does work:

docker run -u root:root -ti --entrypoint bash tensorflow/serving:2.11.0-gpu

ptxas is available:

$ ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:21_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Searching for a directory nvvm returns the directory in the cuda installation directory:

$ find / -type d -name nvvm 2>/dev/null
/usr/local/cuda-11.2/nvvm

Metadata

Metadata

Assignees

Labels

staleThis label marks the issue/pr stale - to be closed automatically if no activitystat:awaiting responsetype:bug

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions