CUDA: add attention sinks for tile and wmma #15178

am17an · 2025-08-08T17:16:07Z

Adding attention sink support for older GPUs (Volta and below), this would complete support for attention sinks in the flash attention code

on P100
master

ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |        443.35 ± 1.08 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |         52.81 ± 0.05 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |        501.63 ± 0.68 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |         52.77 ± 0.04 |

PR

  Device 0: Tesla P100-SXM2-16GB, compute capability 6.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |        687.26 ± 2.64 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |         52.83 ± 0.03 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |        823.87 ± 1.32 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |         52.76 ± 0.05 |

on V100

master (with fix) - at the moment it looks this model is broken on solely Volta because it goes through the wmma path even though attention sinks are not supported

  Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |       1081.62 ± 2.53 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |        117.00 ± 0.20 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |       1189.98 ± 3.06 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |        117.38 ± 0.29 |

PR

 Device 0: Tesla V100-PCIE-16GB, compute capability 7.0, VMM: yes
| model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s | 
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |          pp8192 |      2231.48 ± 15.04 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |      512 |  1 |           tg128 |        117.85 ± 0.10 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |          pp8192 |      2801.53 ± 29.66 |
| gpt-oss ?B MXFP4 MoE           |  11.27 GiB |    20.91 B | CUDA       |  99 |     1024 |  1 |           tg128 |        117.79 ± 0.13 |

JohannesGaessler

This PR should produce correct results, but I think some of the synchronizations can be optimized out. In addition to the usual tests for correctness, please also check compute-sanitizer --tool=racecheck ./tests/test-backend-ops -o FLASH_ATTN_EXT, the compute sanitizer should come with the CUDA installation but it may not be on the PATH (on my system it's under /opt/cuda/bin/compute-sanitizer).

ggml/src/ggml-cuda/fattn-tile-f16.cu

ggml/src/ggml-cuda/fattn-tile-f32.cu

ggml/src/ggml-cuda/fattn-wmma-f16.cu

…rp_reduce_max from wmma

am17an · 2025-08-09T11:30:55Z

@JohannesGaessler the compute-sanitizer tests are all green. Tested on P100 and V100

IMbackK · 2025-08-09T19:47:18Z

If possible i would like to be tagged for prs that touch the wmma code.

Port of ggml-org/llama.cpp#15178

CUDA: add attention sinks for tile and wmma

4946c19

am17an requested a review from JohannesGaessler as a code owner August 8, 2025 17:16

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Aug 8, 2025

JohannesGaessler reviewed Aug 9, 2025

View reviewed changes

Review: formatting changes + remove syncthreads from tile + remove wa…

1ef7fd0

…rp_reduce_max from wmma

JohannesGaessler approved these changes Aug 9, 2025

View reviewed changes

am17an merged commit 34c9d76 into ggml-org:master Aug 9, 2025
47 checks passed

am17an deleted the cuda_fattn_tile_wmma branch August 9, 2025 12:00

Thireus added a commit to Thireus/ik_llama.cpp that referenced this pull request Aug 11, 2025

CUDA: add attention sinks for tile and wmma

f71ef6b

Port of ggml-org/llama.cpp#15178

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: add attention sinks for tile and wmma #15178

CUDA: add attention sinks for tile and wmma #15178

Uh oh!

am17an commented Aug 8, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Aug 9, 2025

Uh oh!

Uh oh!

IMbackK commented Aug 9, 2025

Uh oh!

Uh oh!

CUDA: add attention sinks for tile and wmma #15178

CUDA: add attention sinks for tile and wmma #15178

Uh oh!

Conversation

am17an commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an commented Aug 9, 2025

Uh oh!

Uh oh!

IMbackK commented Aug 9, 2025

Uh oh!

Uh oh!

am17an commented Aug 8, 2025 •

edited

Loading