Description
cog version: cog version 0.8.6 (built 2023-08-07T21:51:56Z)
docker version: Docker version 24.0.7, build afdd53b
machine: LambdaLabs gpu_1x_a10, created fresh today
model code: https://github.com/garfieldnate/whisper-ts-cog
Replicate failure link: https://replicate.com/wordscenes/whisper-stable-ts/versions/5a2cde593e684640cd0f9b951ff727b65c69fb4bfe93126296209e072ca1d4fe?prediction=pzvetarbp3yshr4srqu4rhebae
After building and deploying my model to replicate, running it always yields this error:
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/cog/server/worker.py", line 217, in _predict
result = predict(**payload)
^^^^^^^^^^^^^^^^^^
File "/src/predict.py", line 59, in predict
result = self.model.transcribe(
^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/whisper_word_level.py", line 554, in transcribe_stable
add_word_timestamps_stable(
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/timing.py", line 259, in add_word_timestamps_stable
align()
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/timing.py", line 225, in align
alignment = find_alignment_stable(model, tokenizer, text_tokens, mel, num_samples,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/stable_whisper/timing.py", line 79, in find_alignment_stable
weights = median_filter(weights, medfilt_width)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/whisper/timing.py", line 40, in median_filter
result = median_filter_cuda(x, filter_width)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/whisper/triton_ops.py", line 107, in median_filter_cuda
kernel[(grid,)](y, x, x.stride(-2), y.stride(-2), BLOCK_SIZE=BLOCK_SIZE)
File "<string>", line 63, in kernel
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/compiler/compiler.py", line 425, in compile
so_path = make_stub(name, signature, constants)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/compiler/make_launcher.py", line 39, in make_stub
so = _build(name, src_path, tmpdir)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/common/build.py", line 61, in _build
cuda_lib_dirs = libcuda_dirs()
^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/triton/common/build.py", line 30, in libcuda_dirs
assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!
I ran some diagnostic commands within the container:
find / -name libcuda.so
/usr/local/cuda-11.8/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-11.8/compat/libcuda.so
ldconfig -p | grep libcuda
libcudart.so.11.0 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.11.0
libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
export | grep nvidia
declare -x LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/x86_64-linux-gnu:/usr/local/nvidia/lib64:/usr/local/nvidia/bin"
declare -x NVIDIA_REQUIRE_CUDA="cuda>=11.8 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471"
declare -x PATH="/root/.pyenv/shims:/root/.pyenv/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
ls /usr/local/nvidia/lib64
ls: cannot access '/usr/local/nvidia/lib64': No such file or directory
So it seems that libcuda.so does exist in the image, but the nvidia directory which is added toLD_LIBRARY_PATH
does not exist.
I have definitely specified "gpu: true" in cog.yml, but I wonder if somehow cog thinks I don't need a GPU for this model.