Skip to content

Conversation

@jakekarnes42
Copy link
Contributor

Problem

IM2COL_3D fails on CUDA when the input is a non‑contiguous view (v=1). CPU backend passes. The CUDA kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.

Reproduction steps:

C:\Users\jakek\llama-dev\>git clone git@github.com:ggml-org/llama.cpp.git
C:\Users\jakek\llama-dev\llama.cpp>cd llama.cpp
C:\Users\jakek\llama-dev\llama.cpp>cmake -S . -B build-cuda -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=120 -DCMAKE_TOOLCHAIN_FILE=%VCPKG_ROOT%\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=x64-windows -DCMAKE_FIND_PACKAGE_PREFER_CONFIG=ON
C:\Users\jakek\llama-dev\llama.cpp>cmake --build build-cuda --config Release --parallel
C:\Users\jakek\llama-dev\llama.cpp>build-cuda\bin\Release\llama-cli.exe --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32606 MiB, 30841 MiB free)
C:\Users\jakek\llama-dev\llama.cpp>cd build-cuda
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>ctest --output-on-failure -C Release -j 8
  ...
22: [IM2COL_3D] NMSE = 1.987277064 > 0.000000100 IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=1,v=1): FAIL 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=1,v=0): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=1,v=1): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=0): OK 
22: [IM2COL_3D] NMSE = 1.991386136 > 0.000000100 IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=1): FAIL 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=0): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=1): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=3,d2=1,v=0): OK 
22: [IM2COL_3D] NMSE = 1.985351196 > 0.000000100 IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=3,d2=1,v=1): FAIL 
  ...
  13959/14471 tests passed
  Backend CUDA0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL


97% tests passed, 1 tests failed out of 30

Label Time Summary:
curl             =   0.73 sec*proc (1 test)
eval-callback    =   0.73 sec*proc (1 test)
main             =  99.60 sec*proc (27 tests)
model            =   0.16 sec*proc (2 tests)

Total Test time (real) =  70.36 sec

The following tests FAILED:
         22 - test-backend-ops (Failed)                         main
Errors while running CTest
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CUDA0
  ...
  13959/14471 tests passed
  Backend CUDA0: FAIL
Backend 2/2: CPU
  Skipping
1/2 backends passed
FAIL
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CPU
 ...
   14471/14471 tests passed
  Backend CPU: OK
2/2 backends passed
OK

Fix

This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.

After the patch, the tests pass for both CUDA and CPU backends.

C:\Users\jakek\llama-dev\llama.cpp>cmake --build build-cuda --config Release --parallel
C:\Users\jakek\llama-dev\llama.cpp>build-cuda\bin\Release\llama-cli.exe --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32606 MiB, 30841 MiB free)
C:\Users\jakek\llama-dev\llama.cpp>cd build-cuda
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>ctest --output-on-failure -C Release -j 8
Test project C:/Users/jakek/llama-dev/llama.cpp/build-cuda
      Start 22: test-backend-ops
      ...
30/30 Test #22: test-backend-ops ..................   Passed   71.29 sec

100% tests passed, 0 tests failed out of 30

Label Time Summary:
curl             =   0.79 sec*proc (1 test)
eval-callback    =   0.79 sec*proc (1 test)
main             = 102.28 sec*proc (27 tests)
model            =   0.17 sec*proc (2 tests)

Total Test time (real) =  71.31 sec
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CUDA0
  ...
    14471/14471 tests passed
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CPU
 ...
   14471/14471 tests passed
  Backend CPU: OK
2/2 backends passed
OK

The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. 

This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 13, 2025
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
@jakekarnes42
Copy link
Contributor Author

@JohannesGaessler - Thanks for the review. I agree with your suggestion to use ggml_element_size rather than a fixed value and I've applied that change.

@JohannesGaessler JohannesGaessler merged commit 3d4053f into ggml-org:master Sep 15, 2025
47 of 48 checks passed
@jakekarnes42 jakekarnes42 deleted the im2col_3d_stride_fix branch September 17, 2025 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants