Skip to content

Misc. bug: NVIDIA CUB implementations of GGML argsort and top-k cause CUDA graph capture failure for some tensor shapes #21682

@fairydreaming

Description

@fairydreaming

Name and Version

$ ./bin/llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
version: 8737 (009a113)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

Other (Please specify in the next section)

Command line

./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[2048,500,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[2048,501,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[4096,500,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[4096,501,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[2048,500,1,1],order=1)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[2048,501,1,1],order=1)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[4096,500,1,1],order=1)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[4096,501,1,1],order=1)"

Problem description & steps to reproduce

Important: this problem applies only to GGML CUDA backend compiled with NVIDIA CUB usage. I used:

cmake .. -DGGML_CUDA=1 -DGGML_CUDA_USE_CUB=1

to build llama.cpp.

When profiling GGML OPs for my DeepSeek DSA implementation I noticed that test-backend-ops perf failed when testing ggml_argsort() and ggml_top_k() - but only for some tensor shapes. Specifically it works fine if the second dimension is lower or equal to 500 and fails if it's higher than 500. So for example these performance tests work correctly:

./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[2048,500,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[4096,500,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[2048,500,1,1],order=1)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[4096,500,1,1],order=1)"

while these fail (500 seems to be some magic number there) with CUDA error: operation not permitted when stream is capturing error:

./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[2048,501,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[4096,501,1,1],k=2048,ties=0)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[2048,501,1,1],order=1)"
./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[4096,501,1,1],order=1)"

They all work fine when CUDA graphs are disabled (GGML_CUDA_DISABLE_GRAPHS=1 environment variable set).

This happens on CUDA 12.8, CUDA 12.9 and CUDA 13.1. For CUDA 13.2 GGML top_k uses different CUB methods, so this function works, but argsort still fails.

I did some research and apparently CUDA graph capture support was not guaranteed for all CUB routines (NVIDIA/cccl#321). So I'm not entirely sure if this a bug in CUB or simply undocumented behavior. Anyway, I'm documenting this here to make it a known problem.

Patch with test cases:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index c7129d47a..2b07039fa 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -8913,6 +8913,13 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
     test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {200000, 1,  1, 1}));
     test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {200000, 16, 1, 1}));
 
+    test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {2048, 500, 1, 1}, GGML_SORT_ORDER_DESC));
+    test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {2048, 501, 1, 1}, GGML_SORT_ORDER_DESC));
+    test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {4096, 500, 1, 1}, GGML_SORT_ORDER_DESC));
+    test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {4096, 501, 1, 1}, GGML_SORT_ORDER_DESC));
+    test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {8192, 500, 1, 1}, GGML_SORT_ORDER_DESC));
+    test_cases.emplace_back(new test_argsort(GGML_TYPE_F32, {8192, 501, 1, 1}, GGML_SORT_ORDER_DESC));
+
     test_cases.emplace_back(new test_top_k(GGML_TYPE_F32, {2, 1, 1, 1}, 1));
     for (auto k : {1, 10, 40, 400}) {
         for (auto nrows : {1, 16}) {
@@ -8922,6 +8929,13 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
         }
     }
 
+    test_cases.emplace_back(new test_top_k(GGML_TYPE_F32, {2048, 500, 1, 1}, 2048));
+    test_cases.emplace_back(new test_top_k(GGML_TYPE_F32, {2048, 501, 1, 1}, 2048));
+    test_cases.emplace_back(new test_top_k(GGML_TYPE_F32, {4096, 500, 1, 1}, 2048));
+    test_cases.emplace_back(new test_top_k(GGML_TYPE_F32, {4096, 501, 1, 1}, 2048));
+    test_cases.emplace_back(new test_top_k(GGML_TYPE_F32, {8192, 500, 1, 1}, 2048));
+    test_cases.emplace_back(new test_top_k(GGML_TYPE_F32, {8192, 501, 1, 1}, 2048));
+
     for (auto nrows : {1, 4, 8, 16}) {
         for (auto cols : {128, 1024, 4096, 8192, 16384, 32768, 65536, 131072, 200000, 2000000}) {
             test_cases.emplace_back(new test_cumsum(GGML_TYPE_F32, {cols, nrows, 1, 1}));

First Bad Commit

I think it was always broken.

Relevant log output

Logs

Note: to see the error message you need PR #21676

$ ./bin/test-backend-ops perf -o "TOP_K(type=f32,ne=[2048,501,1,1],k=2048,ties=0)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
CUDA error: operation not permitted when stream is capturing
  current device: 0, in function argsort_f32_i32_cuda_cub at /home/phm/projects/llama.cpp/ggml/src/ggml-cuda/argsort.cu:102
  DeviceSegmentedSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys, temp_indices, dst, ncols * nrows, nrows, offset_iterator, offset_iterator + 1, stream)
/home/phm/projects/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
[New LWP 68449]
[New LWP 68448]
[New LWP 68444]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x000073f6bcb10813 in __GI___wait4 (pid=68450, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x000073f6bcb10813 in __GI___wait4 (pid=68450, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000073f6bd0d47f3 in ggml_print_backtrace () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#2  0x000073f6bd0d499b in ggml_abort () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#3  0x000073f6ba3bc677 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#4  0x000073f6ba29a8ab in argsort_f32_i32_cuda_cub(ggml_cuda_pool&, float const*, int*, int, int, ggml_sort_order, CUstream_st*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#5  0x000073f6ba5251b2 in ggml_cuda_op_top_k(ggml_backend_cuda_context&, ggml_tensor*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#6  0x000073f6ba3cecc0 in ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#7  0x000073f6ba3d439b in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#8  0x000073f6ba3d602e in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#9  0x000073f6bd0ebcf2 in ggml_backend_graph_compute () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#10 0x00005762cf66671e in test_backend(ggml_backend*, test_mode, char const*, char const*, printer*, char const*) ()
#11 0x00005762cf634ea0 in main ()
[Inferior 1 (process 68443) detached]
Aborted (core dumped)
$ ./bin/test-backend-ops perf -o "ARGSORT(type=f32,ne=[2048,501,1,1],order=1)"
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 97247 MiB):
  Device 0: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition, compute capability 12.0, VMM: yes, VRAM: 97247 MiB
Testing 2 devices

Backend 1/2: CUDA0
  Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition
  Device memory: 97247 MB (96640 MB free)

ggml_backend_cuda_graph_compute: CUDA graph warmup complete
CUDA error: operation not permitted when stream is capturing
  current device: 0, in function argsort_f32_i32_cuda_cub at /home/phm/projects/llama.cpp/ggml/src/ggml-cuda/argsort.cu:102
  DeviceSegmentedSort::SortPairsDescending(d_temp_storage, temp_storage_bytes, temp_keys, temp_keys, temp_indices, dst, ncols * nrows, nrows, offset_iterator, offset_iterator + 1, stream)
/home/phm/projects/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:97: CUDA error
[New LWP 69290]
[New LWP 69289]
[New LWP 69285]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007a47b5d10813 in __GI___wait4 (pid=69291, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#0  0x00007a47b5d10813 in __GI___wait4 (pid=69291, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007a47b63347f3 in ggml_print_backtrace () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#2  0x00007a47b633499b in ggml_abort () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#3  0x00007a47b35bc677 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#4  0x00007a47b349a8ab in argsort_f32_i32_cuda_cub(ggml_cuda_pool&, float const*, int*, int, int, ggml_sort_order, CUstream_st*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#5  0x00007a47b349b8c3 in ggml_cuda_op_argsort(ggml_backend_cuda_context&, ggml_tensor*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#6  0x00007a47b35cecd0 in ggml_cuda_compute_forward(ggml_backend_cuda_context&, ggml_tensor*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#7  0x00007a47b35d439b in ggml_cuda_graph_evaluate_and_capture(ggml_backend_cuda_context*, ggml_cgraph*, bool, bool, void const*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#8  0x00007a47b35d602e in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-cuda.so.0
#9  0x00007a47b634bcf2 in ggml_backend_graph_compute () from /home/phm/projects/llama.cpp/build-cuda/bin/libggml-base.so.0
#10 0x00005b4d1ca54a3e in test_backend(ggml_backend*, test_mode, char const*, char const*, printer*, char const*) ()
#11 0x00005b4d1ca23ea0 in main ()
[Inferior 1 (process 69284) detached]
Aborted (core dumped)

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions