Add support for CUMSUM and TRI for CUDA. #17584

pwilkin · 2025-11-28T23:15:53Z

Extracted and adapted kernels by @gabe-l-hart from #16623

am17an · 2025-11-29T00:51:36Z

For cumsum we should use https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceScan.html and use this kernel as a fallback

wsbagnsv1 · 2025-11-29T04:19:20Z

I have a small optimization for the tri kernel (;
Since its memory bandwidth bound there is not much room, but I think those should actually be real improvements and the nsight numbers show real improvements (+18% scheduler utilization). Also the improved kernel seems to have less jitter (~56% decrease, though im not 100% sure this is real, could be run variation). Also its not a big change anyways (;

Benchmark Results

1. llama.cpp benchmark (50 runs each)

Device	Dataset	Old Kernel	New Kernel	Delta
Device 0 (RTX 4070 Ti)	Large (1024)	476.54 GB/s (±17.79)	490.05 GB/s (±7.82)	+2.84%
		527.44 μs	512.26 μs	-2.88%
	Small (256)	1282.55 GB/s (±53.22)	1333.17 GB/s (±29.37)	+3.95%
		6.10 μs	5.86 μs	-3.93%
Device 1 (RTX 2070)	Large (1024)	490.77 GB/s (±0.15)	490.52 GB/s (±0.22)	-0.05%
		511.37 μs	511.64 μs	+0.05%
	Small (256)	356.65 GB/s (±4.47)	361.48 GB/s (±7.81)	+1.35%
		21.91 μs	21.63 μs	-1.28%

2. Profiler Statistics rtx 2070 (Nsight)

Metric	Old Kernel	New Kernel	Delta
Eligible Warps / Scheduler	0.390	0.460	+17.95%
Warp Cycles / Instruction	26.87	24.92	-7.24%
Physical DRAM Speed	406.65 GB/s	406.42 GB/s	-0.05%
Executed Instructions	24.6 M	26.5 M	+7.44%

@@ -1,16 +1,7 @@
 #include "tri.cuh"
 #include "ggml.h"
 
-// Triangle type comparison - determines which elements to keep
-__device__ static inline bool tri_compare(const int i, const int r, const ggml_tri_type type) {
-    switch (type) {
-        case GGML_TRI_TYPE_LOWER:      return i < r;
-        case GGML_TRI_TYPE_LOWER_DIAG: return i <= r;
-        case GGML_TRI_TYPE_UPPER:      return i > r;
-        case GGML_TRI_TYPE_UPPER_DIAG: return i >= r;
-        default: return false;
-    }
-}
+
 
 template<typename T>
 static __global__ void tri_kernel(
@@ -31,10 +22,22 @@ static __global__ void tri_kernel(
     const T * src_row = (const T *) ((const char *) src + i1*nb01 + i2*nb02 + i3*nb03);
     T       * dst_row = (T       *) ((      char *) dst + i1*nb1  + i2*nb2  + i3*nb3);
 
+    // Optimization: Avoid control flow (switch) inside the hot loop.
+    // Map the 4 triangle types to a generic "split point" and "keep direction" logic.
+    // LOWER / UPPER_DIAG: Split at 'r' (i1). LOWER_DIAG / UPPER: Split at 'r + 1'.
+    int add_to_split = 0;
+    if (ttype == GGML_TRI_TYPE_LOWER_DIAG || ttype == GGML_TRI_TYPE_UPPER) {
+        add_to_split = 1;
+    }
+    int64_t split_point = i1 + add_to_split;
+    bool prefix_keep = (ttype == GGML_TRI_TYPE_LOWER || ttype == GGML_TRI_TYPE_LOWER_DIAG);
+
     // Each thread processes elements at stride blockDim.x
     for (int64_t i0 = threadIdx.x; i0 < ne00; i0 += blockDim.x) {
-        dst_row[i0] = tri_compare(i0, i1, ttype)
-            ? src_row[i0] : static_cast<T>(0.f);
+        // If prefix_keep is true, keep (i0 < split_point). Else, keep (i0 >= split_point).
+        bool keep = ((i0 < split_point) == prefix_keep);
+        dst_row[i0] = keep ? src_row[i0] : T(0);
     }
 }

ggml/src/ggml-cuda/cumsum.cu

JohannesGaessler · 2025-11-29T09:29:58Z

ggml/src/ggml-cuda/cumsum.cu

+        // Load value and compute prefix sum within warp
+        float val = static_cast<float>(src_row[i0]);
+        val = warp_prefix_inclusive_sum(val);
+        dst_row[i0] = static_cast<T>(val);


It would be much preferable to store the temporary results in registers or shared memory rather than global memory.

Isn't val here already stored in a register though? I'm afraid I'll need some more guidance here.

dst_row is in global memory. With this code you are writing data to VRAM on this line, only to later read this data again, add a value to it, and write it back. So you have 3x as much I/O to the comparatively slow VRAM vs. the comparatively faster SRAM or registers where you could be storing it instead until you write the data once at the end of the kernel.

Now I get it, thanks!

ggml/src/ggml-cuda/tri.cu

JohannesGaessler · 2025-11-29T09:41:25Z

Regarding the implementation proposed by @wsbagnsv1 . If one were to do something like that the in my opinion correct way to do it would be to calculate start and end points for copying and for zeroing and to then simply do 2 loops over those areas. If at all possible a conditional statement inside the loop should be avoided. But that would potentially make the kernel less flexible if other patterns for ggml_tri_type are ever implemented (don't know what the intended use cases are). That is why I did not suggest this change, I very much doubt that GGML_TRI is going to have a meaningful impact on end-to-end performance unless it's very poorly implemented.

pwilkin · 2025-12-01T15:12:16Z

Okay, when adding in @JohannesGaessler's remarks about not calculating the comparison in kernel code, @wsbagnsv1's optimizations just flowed naturally, so I just combined it.

EDIT: nvm, had wrong strides

pwilkin · 2025-12-01T17:05:01Z

Okay, I implemented the double loop algorithm. I think those cases that are now templated are the only cases that will be supported, so it's probably fine this way.

pwilkin · 2025-12-01T20:42:35Z

@gabe-l-hart would be grateful if you could look at the HIP code fixes, I have completely no idea what I'm doing there (and not able to test either aside from the CI).

gabe-l-hart · 2025-12-01T23:25:58Z

would be grateful if you could look at the HIP code fixes

Unfortunately, I'm not much use here as I also don't have any background with HIP. I just tried installing it on my GB10 device, but haven't had any luck.

gabe-l-hart · 2025-12-01T23:38:39Z

ggml/src/ggml-cuda/common.cuh


+static __device__ __forceinline__ unsigned int get_warp_mask() {
+#ifdef __HIP_PLATFORM_AMD__
+    return __ballot(1); // HIP equivalent


I know basically nothing about HIP, but according to this doc, it seems like __activemask(); should be supported? The main difference referenced there is the warp size of 64 vs 32 which I could absolutely imagine being accidentally hard coded somewhere.

Specifically, I see #define WARP_SIZE 32 at the top of this file.

cc/ @IMbackK

the WARP_SIZE is deprecated and the remaining uses should only be used in places affecting performance, but not correctness, the non-deprecated equivalent is ggml_cuda_get_physical_warp_size

__activemask is indeed supported and works, but i will need to check how long - will do that later.

We will need to change the return type of this and the kernel below, @pwilkin you can do so or skip the kernel on hip and i will fix it in a follow up.

@IMbackK okay, I'll comment it out then and add a TODO, prefer to leave it so someone who knows what they're doing then leave an untested vibe-coded patch :)

am17an · 2025-12-02T04:53:01Z

@pwilkin not sure if you missed my comment, but CUB should be superior for most cases

pwilkin · 2025-12-02T08:40:54Z

@pwilkin not sure if you missed my comment, but CUB should be superior for most cases

Ah, completely forgot about that one! Yeah, will do.

pwilkin · 2025-12-02T15:12:44Z

All right, implemented CUB-compatible version per @am17an's request, removed the global memory access per @JohannesGaessler's request (I'd be lying if I said I figured all of that on my own, fortunately, it turns out the new DeepSeek 3.2 Speciale is quite good at both optimizing kernels and explaining it).

After all the optimizations expecially the biggest case improved a lot, also, the fallback implementation is performance-wise very similar to the BlockScan implementation.

ggml/src/ggml-cuda/cumsum.cu

IMbackK · 2025-12-04T14:02:20Z

Performance wise it could be better by adjusting to physical warp layout, although not that much better since its so io bound anyhow but it dose test fine on hip/warp64:

Backend 1/4: ROCm0
  Device description: AMD Instinct MI100
  Device memory: 32752 MB (32724 MB free)

  CUMSUM(type=f32,ne=[10,5,4,3]): OK
  CUMSUM(type=f32,ne=[127,5,4,3]): OK
  CUMSUM(type=f32,ne=[128,5,4,3]): OK
  CUMSUM(type=f32,ne=[128,128,4,4]): OK
  CUMSUM(type=f32,ne=[255,5,4,3]): OK
  CUMSUM(type=f32,ne=[256,5,4,3]): OK
  CUMSUM(type=f32,ne=[511,5,4,3]): OK
  CUMSUM(type=f32,ne=[512,5,4,3]): OK
  CUMSUM(type=f32,ne=[1023,5,4,3]): OK
  CUMSUM(type=f32,ne=[1024,5,4,3]): OK
  CUMSUM(type=f32,ne=[2047,5,4,3]): OK
  CUMSUM(type=f32,ne=[2048,5,4,3]): OK
  CUMSUM(type=f32,ne=[242004,1,1,1]): OK
  CUMSUM(type=f32,ne=[375960,1,1,1]): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=3): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=2): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=1): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=0): OK
  18/18 tests passed
  Backend ROCm0: OK
Backend 2/4: ROCm1
  Device description: AMD Radeon RX 6800 XT
  Device memory: 16368 MB (16304 MB free)

  CUMSUM(type=f32,ne=[10,5,4,3]): OK
  CUMSUM(type=f32,ne=[127,5,4,3]): OK
  CUMSUM(type=f32,ne=[128,5,4,3]): OK
  CUMSUM(type=f32,ne=[128,128,4,4]): OK
  CUMSUM(type=f32,ne=[255,5,4,3]): OK
  CUMSUM(type=f32,ne=[256,5,4,3]): OK
  CUMSUM(type=f32,ne=[511,5,4,3]): OK
  CUMSUM(type=f32,ne=[512,5,4,3]): OK
  CUMSUM(type=f32,ne=[1023,5,4,3]): OK
  CUMSUM(type=f32,ne=[1024,5,4,3]): OK
  CUMSUM(type=f32,ne=[2047,5,4,3]): OK
  CUMSUM(type=f32,ne=[2048,5,4,3]): OK
  CUMSUM(type=f32,ne=[242004,1,1,1]): OK
  CUMSUM(type=f32,ne=[375960,1,1,1]): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=3): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=2): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=1): OK
  TRI(type=f32,ne=[10,10,4,3],tri_type=0): OK
  18/18 tests passed
  Backend ROCm1: OK

Anyhow ill do a follow up pr, also to add rocprim to the cub path, if you dont want to do it yourself.

pwilkin · 2025-12-04T14:14:28Z

@IMbackK please check if this is the correct way.

IMbackK · 2025-12-04T14:24:45Z

Please use something like:

constexpr int warp_size = ggml_cuda_get_physical_warp_size();

instead of processor macros.

Also that cant work as is without also making warp_prefix_inclusive_sum handle 64 wide warps

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/cumsum.cu

ggml/src/ggml-cuda/tri.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

ggml/src/ggml-cuda/common.cuh

pwilkin · 2025-12-04T15:30:59Z

Okay, I've added the requested changes and ran the tests, but need @IMbackK and @JohannesGaessler to look at the final shape of things to verify that I haven't done anything stupid.

ggml/src/ggml-cuda/common.cuh

ggml/src/ggml-cuda/cumsum.cu

ggml/src/ggml-cuda/tri.cu

…l_cuda_info()

ggml/src/ggml-cuda/cumsum.cu

IMbackK

looks reasonable to me now, tests ok on a warp64 gpu. I am fine with this.

pwilkin · 2025-12-04T16:32:51Z

Alright then, going to wait for CI just in case and then merge.

pwilkin · 2025-12-04T21:19:47Z

Aight, the key tests have passed so I'm merging this.

jacekpoplawski · 2025-12-05T04:00:30Z

looks good:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	test	t/s
qwen3next ?B Q6_K	61.20 GiB	79.67 B	CUDA	99	pp512	535.49 ± 4.54
qwen3next ?B Q6_K	61.20 GiB	79.67 B	CUDA	99	tg128	42.29 ± 0.13

build: 96fe9ba (7276)

Add support for CUMSUM and TRI for CUDA.

d138a03

pwilkin requested a review from ggerganov as a code owner November 28, 2025 23:15

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 28, 2025

Minor optimizations.

67207d2

pwilkin requested review from JohannesGaessler and am17an November 28, 2025 23:27

loci-dev mentioned this pull request Nov 28, 2025

UPSTREAM PR #17584: Add support for CUMSUM and TRI for CUDA. auroralabs-loci/llama.cpp#355

Open

Correct warp_prefix_inclusive_sum in float2 variant to return float2

fab0029

JohannesGaessler reviewed Nov 29, 2025

View reviewed changes

pwilkin added 2 commits December 1, 2025 16:10

Optimize TRI

51c40a5

Whitespace

c30f565

pwilkin added 3 commits December 1, 2025 16:15

Fix strides.

31b55fa

Implement double loop

d1ca1c2

Whitespace

5289b53

Fix HIP compilation bugs

f422ba8

gabe-l-hart reviewed Dec 1, 2025

View reviewed changes

pwilkin added 2 commits December 2, 2025 14:29

Optimizations + big case performance tests

df917cc

Implement using CUB with fallback to custom kernel

76382d7

CISC reviewed Dec 4, 2025

View reviewed changes

ggml/src/ggml-cuda/cumsum.cu Outdated Show resolved Hide resolved

am17an reviewed Dec 4, 2025

View reviewed changes

ggml/src/ggml-cuda/cumsum.cu Show resolved Hide resolved

ggml/src/ggml-cuda/cumsum.cu Show resolved Hide resolved

Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS

069413a

Vary warp-size based on physical warp size

5aa7438

Add GGML_UNUSED_VARS in tri as well

579eba6

JohannesGaessler reviewed Dec 4, 2025

View reviewed changes

pwilkin and others added 4 commits December 4, 2025 15:36

Use constexpr and call prefix_inclusive with warp_size template param

08b3f2d

Update ggml/src/ggml-cuda/cumsum.cu

9cd0eff

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Apply suggestions from code review

9574264

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

Change to tid % warp_size

efd619a

IMbackK suggested changes Dec 4, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

IMbackK suggested changes Dec 4, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

Fix strides; hardcode mask; add ggml_lane_mask_t

86a0853

JohannesGaessler reviewed Dec 4, 2025

View reviewed changes

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/common.cuh Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/cumsum.cu Outdated Show resolved Hide resolved

ggml/src/ggml-cuda/tri.cu Show resolved Hide resolved

Missing renames, remove unused get_warp_mask(), explicit calls to ggm…

de45c63

…l_cuda_info()

JohannesGaessler reviewed Dec 4, 2025

View reviewed changes

ggml/src/ggml-cuda/cumsum.cu Outdated Show resolved Hide resolved

Too hasty...

8a7375c

IMbackK approved these changes Dec 4, 2025

View reviewed changes

JohannesGaessler approved these changes Dec 4, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 4, 2025

UPSTREAM PR #16623: metal: TRI, FILL, EXPM1, SOFTPLUS auroralabs-loci/llama.cpp#434

Open

engrtipusultan mentioned this pull request Dec 4, 2025

Eval bug: Request for Qwen3-Next-80B-A3B Vulkan Inference Optimization #17751

Open

pwilkin merged commit 96fe9ba into ggml-org:master Dec 4, 2025
70 of 73 checks passed

github-actions bot mentioned this pull request Dec 5, 2025

Reddit News Daily 2025-12-05 gitlawr/reddit-daily-news#84

Open

Add support for CUMSUM and TRI for CUDA. #17584

Add support for CUMSUM and TRI for CUDA. #17584

Conversation

pwilkin commented Nov 28, 2025

Uh oh!

am17an commented Nov 29, 2025

Uh oh!

wsbagnsv1 commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

1. llama.cpp benchmark (50 runs each)

2. Profiler Statistics rtx 2070 (Nsight)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Nov 29, 2025

Uh oh!

pwilkin commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Dec 1, 2025

Uh oh!

pwilkin commented Dec 1, 2025

Uh oh!

gabe-l-hart commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an commented Dec 2, 2025

Uh oh!

pwilkin commented Dec 2, 2025

Uh oh!

pwilkin commented Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IMbackK commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Dec 4, 2025

Uh oh!

IMbackK commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pwilkin commented Dec 4, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wsbagnsv1 commented Nov 29, 2025 •

edited

Loading

pwilkin commented Dec 1, 2025 •

edited

Loading

IMbackK commented Dec 4, 2025 •

edited

Loading

IMbackK commented Dec 4, 2025 •

edited

Loading