CUDA: fix im2col_3d to respect non-contiguous inputs (views) #15956

jakekarnes42 · 2025-09-13T05:07:39Z

Problem

IM2COL_3D fails on CUDA when the input is a non‑contiguous view (v=1). CPU backend passes. The CUDA kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.

Reproduction steps:

C:\Users\jakek\llama-dev\>git clone git@github.com:ggml-org/llama.cpp.git
C:\Users\jakek\llama-dev\llama.cpp>cd llama.cpp
C:\Users\jakek\llama-dev\llama.cpp>cmake -S . -B build-cuda -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=120 -DCMAKE_TOOLCHAIN_FILE=%VCPKG_ROOT%\scripts\buildsystems\vcpkg.cmake -DVCPKG_TARGET_TRIPLET=x64-windows -DCMAKE_FIND_PACKAGE_PREFER_CONFIG=ON
C:\Users\jakek\llama-dev\llama.cpp>cmake --build build-cuda --config Release --parallel
C:\Users\jakek\llama-dev\llama.cpp>build-cuda\bin\Release\llama-cli.exe --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32606 MiB, 30841 MiB free)
C:\Users\jakek\llama-dev\llama.cpp>cd build-cuda
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>ctest --output-on-failure -C Release -j 8
  ...
22: [IM2COL_3D] NMSE = 1.987277064 > 0.000000100 IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=1,v=1): FAIL 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=1,v=0): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=1,v=1): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=0): OK 
22: [IM2COL_3D] NMSE = 1.991386136 > 0.000000100 IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=1): FAIL 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=0): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=3,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=1,d2=3,v=1): OK 
22: IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=3,d2=1,v=0): OK 
22: [IM2COL_3D] NMSE = 1.985351196 > 0.000000100 IM2COL_3D(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[20,20,10,3],ne_kernel=[3,3,3,3],IC=1,s0=1,s1=3,s2=1,p0=0,p1=3,p2=3,d0=1,d1=3,d2=1,v=1): FAIL 
  ...
  13959/14471 tests passed
  Backend CUDA0: FAIL
Backend 2/2: CPU
  Skipping CPU backend
1/2 backends passed
FAIL


97% tests passed, 1 tests failed out of 30

Label Time Summary:
curl             =   0.73 sec*proc (1 test)
eval-callback    =   0.73 sec*proc (1 test)
main             =  99.60 sec*proc (27 tests)
model            =   0.16 sec*proc (2 tests)

Total Test time (real) =  70.36 sec

The following tests FAILED:
         22 - test-backend-ops (Failed)                         main
Errors while running CTest
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CUDA0
  ...
  13959/14471 tests passed
  Backend CUDA0: FAIL
Backend 2/2: CPU
  Skipping
1/2 backends passed
FAIL
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CPU
 ...
   14471/14471 tests passed
  Backend CPU: OK
2/2 backends passed
OK

Fix

This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.

After the patch, the tests pass for both CUDA and CPU backends.

C:\Users\jakek\llama-dev\llama.cpp>cmake --build build-cuda --config Release --parallel
C:\Users\jakek\llama-dev\llama.cpp>build-cuda\bin\Release\llama-cli.exe --list-devices
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Available devices:
  CUDA0: NVIDIA GeForce RTX 5090 (32606 MiB, 30841 MiB free)
C:\Users\jakek\llama-dev\llama.cpp>cd build-cuda
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>ctest --output-on-failure -C Release -j 8
Test project C:/Users/jakek/llama-dev/llama.cpp/build-cuda
      Start 22: test-backend-ops
      ...
30/30 Test #22: test-backend-ops ..................   Passed   71.29 sec

100% tests passed, 0 tests failed out of 30

Label Time Summary:
curl             =   0.79 sec*proc (1 test)
eval-callback    =   0.79 sec*proc (1 test)
main             = 102.28 sec*proc (27 tests)
model            =   0.17 sec*proc (2 tests)

Total Test time (real) =  71.31 sec
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CUDA0
  ...
    14471/14471 tests passed
  Backend CUDA0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK
C:\Users\jakek\llama-dev\llama.cpp\build-cuda>.\bin\Release\test-backend-ops.exe test -b CPU
 ...
   14471/14471 tests passed
  Backend CPU: OK
2/2 backends passed
OK

The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.

ggml/src/ggml-cuda/im2col.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

jakekarnes42 · 2025-09-14T21:38:02Z

@JohannesGaessler - Thanks for the review. I agree with your suggestion to use ggml_element_size rather than a fixed value and I've applied that change.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 13, 2025

JohannesGaessler approved these changes Sep 14, 2025

View reviewed changes

ggml/src/ggml-cuda/im2col.cu Outdated Show resolved Hide resolved

use ggml_element_size() for src strides

96c8e79

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

JohannesGaessler merged commit 3d4053f into ggml-org:master Sep 15, 2025
47 of 48 checks passed

jakekarnes42 deleted the im2col_3d_stride_fix branch September 17, 2025 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix im2col_3d to respect non-contiguous inputs (views) #15956

CUDA: fix im2col_3d to respect non-contiguous inputs (views) #15956

Uh oh!

jakekarnes42 commented Sep 13, 2025

Uh oh!

Uh oh!

jakekarnes42 commented Sep 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA: fix im2col_3d to respect non-contiguous inputs (views) #15956

CUDA: fix im2col_3d to respect non-contiguous inputs (views) #15956

Uh oh!

Conversation

jakekarnes42 commented Sep 13, 2025

Problem

Fix

Uh oh!

Uh oh!

jakekarnes42 commented Sep 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants