Skip to content

missing libcuda.so #1448

Open
Open
@garfieldnate

Description

@garfieldnate

cog version: cog version 0.8.6 (built 2023-08-07T21:51:56Z)
docker version: Docker version 24.0.7, build afdd53b
machine: LambdaLabs gpu_1x_a10, created fresh today
model code: https://github.com/garfieldnate/whisper-ts-cog
Replicate failure link: https://replicate.com/wordscenes/whisper-stable-ts/versions/5a2cde593e684640cd0f9b951ff727b65c69fb4bfe93126296209e072ca1d4fe?prediction=pzvetarbp3yshr4srqu4rhebae

After building and deploying my model to replicate, running it always yields this error:

File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/cog/server/worker.py", line 217, in _predict
result = predict(**payload)
^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 59, in predict
result = self.model.transcribe(
^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/whisper_word_level.py", line 554, in transcribe_stable
add_word_timestamps_stable(
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/timing.py", line 259, in add_word_timestamps_stable
align()
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/timing.py", line 225, in align
alignment = find_alignment_stable(model, tokenizer, text_tokens, mel, num_samples,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/timing.py", line 79, in find_alignment_stable
weights = median_filter(weights, medfilt_width)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/whisper/timing.py", line 40, in median_filter
result = median_filter_cuda(x, filter_width)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/whisper/triton_ops.py", line 107, in median_filter_cuda
kernel[(grid,)](y, x, x.stride(-2), y.stride(-2), BLOCK_SIZE=BLOCK_SIZE)
File "<string>", line 63, in kernel
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/compiler/compiler.py", line 425, in compile
so_path = make_stub(name, signature, constants)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/compiler/make_launcher.py", line 39, in make_stub
so = _build(name, src_path, tmpdir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/common/build.py", line 61, in _build
cuda_lib_dirs = libcuda_dirs()
^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/common/build.py", line 30, in libcuda_dirs
assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!

I ran some diagnostic commands within the container:

find / -name libcuda.so
/usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-11.8/compat/libcuda.so

ldconfig -p | grep libcuda
        libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
        libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so

export | grep nvidia
declare -x LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471"
declare -x PATH="/root/.pyenv/shims:/root/.pyenv/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

ls /usr/local/nvidia/lib64
ls: cannot access '/usr/local/nvidia/lib64': No such file or directory

So it seems that libcuda.so does exist in the image, but the nvidia directory which is added toLD_LIBRARY_PATH does not exist.

I have definitely specified "gpu: true" in cog.yml, but I wonder if somehow cog thinks I don't need a GPU for this model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BackendIssues with the replicate backend

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions