Description
Bug Report
System information
- OS Platform and Distribution: RockyLinux 9.2
- TensorFlow Serving installed from (source or binary): Docker image
- TensorFlow Serving version: tensorflow/serving:2.14.1-gpu
- Docker 24.0.5
- NVIDIA R535 drivers 535.86.10
- NVIDIA Container Toolkit 1.13.5
Describe the problem
With the latest version of Docker image tensorflow/serving:2.14.1-gpu, I cannot run inference of my model with GPU using Docker image tensorflow/serving. The following error is shown in the logs:
2024-01-30 10:15:19.247458: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:521] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
./cuda_sdk_lib
/usr/local/cuda-11.8
/usr/local/cuda
/usr/bin/../nvidia/cuda_nvcc
/usr/bin/../../nvidia/cuda_nvcc
.
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2024-01-30 10:15:19.259311: I external/org_tensorflow/tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-30 10:15:19.260050: I external/org_tensorflow/tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2024-01-30 10:15:19.260095: W external/org_tensorflow/tensorflow/compiler/xla/stream_executor/gpu/asm_compiler.cc:109] Couldn't get ptxas version : FAILED_PRECONDITION: Couldn't get ptxas/nvlink version string: INTERNAL: Couldn't invoke ptxas --version
...
2024-01-30 10:15:19.770155: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:559] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
error: libdevice not found at ./libdevice.10.bc
It appears that the CUDA libraries are not installed completely. The libdevice
directory doesn't exist in the Docker image. I expected CUDA to be fully installed to support serving models with GPU.
I encounter no problems with tensorflow/serving:2.11.0-gpu.
I considered the following solutions before raising this issue:
- FIx for "Couldn't invoke ptxas --version" with cuda-11.3 and jaxlib 0.1.66+cuda111
- Can’t find libdevice directory ${CUDA_DIR}/nvvm/libdevice
- Bug: Unable to find libdevice directory (${CUDA_DIR}/nvvm/libdevice) in GPU Images, Fix: Install cuda-nvcc to GPU images and update XLA_FLAGS
- Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice.
- Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
- New optimizers fail to load CUDA installed through conda
Workaround
Install the cuda-toolkit
package in the Docker image.
FROM tensorflow/serving:2.14.1-gpu
RUN apt-get update && apt-get install -y cuda-toolkit-11-8
This increases the size of the Docker image by ~4GB (uncompressed).
Alternatively, it also works with the tensorflow/serving:2.14.1-devel-gpu
Docker image, but this is even larger in size.
Exact Steps to Reproduce
docker run -u root:root -ti --entrypoint bash tensorflow/serving:2.14.1-gpu
ptxas is not available:
$ ptxas --version
bash: ptxas: command not found
Searching for a directory nvvm
or libdevice
returns nothing:
find / -type d -name nvvm 2>/dev/null
When using 2.11.0, it does work:
docker run -u root:root -ti --entrypoint bash tensorflow/serving:2.11.0-gpu
ptxas is available:
$ ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:21_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Searching for a directory nvvm
returns the directory in the cuda installation directory:
$ find / -type d -name nvvm 2>/dev/null
/usr/local/cuda-11.2/nvvm