CUDA: 4D FlashAttention support #14628

JohannesGaessler · 2025-07-11T08:07:53Z

This PR adds 4-dimensional CUDA FlashAttention support for #14363 . The data layout for the fixup was changed but there should be no change to performance. As discussed in #14505 (comment) , the CUDA code requires mask->ne[2] == 1, otherwise it would require additional complexity to ensure that the GQA-specific optimizations in fattn-mma-f16.cuh produce correct results.

ggerganov

Tests are passing on RTX 2060

JohannesGaessler · 2025-07-11T11:02:58Z

There was some issue with the WMMA kernel (which is now fixed), merge when convenient for you.

* CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel

CISC · 2025-07-13T10:18:09Z

Something is wrong, I'm getting a ton of failures on 3090Ti (CUDA 12.9):

[...]
  FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=def,type_KV=q8_0,permute=[0,1,2,3]): OK
  FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=def,type_KV=q8_0,permute=[0,2,1,3]): OK
  FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=def,type_KV=q4_0,permute=[0,1,2,3]): OK
  FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=1,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=def,type_KV=q4_0,permute=[0,2,1,3]): OK
[FLASH_ATTN_EXT] NMSE = 0.421540541 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,1,2,3]): FAIL
[FLASH_ATTN_EXT] NMSE = 0.471500105 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=f16,permute=[0,2,1,3]): FAIL
  FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=bf16,permute=[0,1,2,3]): not supported [CUDA0] 
  FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=bf16,permute=[0,2,1,3]): not supported [CUDA0] 
[FLASH_ATTN_EXT] NMSE = 0.458659731 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q8_0,permute=[0,1,2,3]): FAIL
[FLASH_ATTN_EXT] NMSE = 0.460324585 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q8_0,permute=[0,2,1,3]): FAIL
[FLASH_ATTN_EXT] NMSE = 0.445988407 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q4_0,permute=[0,1,2,3]): FAIL
[FLASH_ATTN_EXT] NMSE = 0.465820280 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=f32,type_KV=q4_0,permute=[0,2,1,3]): FAIL
[FLASH_ATTN_EXT] NMSE = 0.409744725 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=def,type_KV=f16,permute=[0,1,2,3]): FAIL
[FLASH_ATTN_EXT] NMSE = 0.420985664 > 0.000500000   FLASH_ATTN_EXT(hsk=128,hsv=128,nh=4,nr23=[16,1],kv=512,nb=3,mask=1,max_bias=0.000000,logit_softcap=0.000000,prec=def,type_KV=f16,permute=[0,2,1,3]): FAIL
[...]
  6407/6551 tests passed
  Backend CUDA0: FAIL

ggerganov · 2025-07-13T10:19:59Z

You are testing master. This wa merged in another brabch

CISC · 2025-07-13T10:22:08Z

You are testing master. This wa merged in another brabch

Ah, LOL, sorry. :)

Why is master failing though?

JohannesGaessler · 2025-07-13T11:31:38Z

If master is failing, can you do a git bisect to determine since when?

ggerganov · 2025-07-13T11:38:02Z

Its failing the mask->ne[2] != 1 tests. These are not relevant

github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 11, 2025

ggerganov approved these changes Jul 11, 2025

View reviewed changes

ggerganov force-pushed the gg/llama-high-throughput branch from f23950a to ab82dc2 Compare July 11, 2025 08:27

JohannesGaessler added 2 commits July 11, 2025 10:50

CUDA: 4D FlashAttention support

2b9e536

CUDA: fix WMMA FA kernel

2f9b295

JohannesGaessler force-pushed the cuda-fa-4d-3 branch from 326e4e2 to 2f9b295 Compare July 11, 2025 08:50

ggerganov merged commit c43f275 into ggml-org:gg/llama-high-throughput Jul 11, 2025
47 checks passed

ggerganov pushed a commit that referenced this pull request Jul 12, 2025

CUDA: 4D FlashAttention support (#14628)

886d3f1

* CUDA: 4D FlashAttention support * CUDA: fix WMMA FA kernel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: 4D FlashAttention support #14628

CUDA: 4D FlashAttention support #14628

Uh oh!

JohannesGaessler commented Jul 11, 2025

Uh oh!

ggerganov left a comment

Uh oh!

JohannesGaessler commented Jul 11, 2025

Uh oh!

Uh oh!

CISC commented Jul 13, 2025

Uh oh!

ggerganov commented Jul 13, 2025

Uh oh!

CISC commented Jul 13, 2025

Uh oh!

JohannesGaessler commented Jul 13, 2025

Uh oh!

ggerganov commented Jul 13, 2025

Uh oh!

Uh oh!

CUDA: 4D FlashAttention support #14628

CUDA: 4D FlashAttention support #14628

Uh oh!

Conversation

JohannesGaessler commented Jul 11, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jul 11, 2025

Uh oh!

Uh oh!

CISC commented Jul 13, 2025

Uh oh!

ggerganov commented Jul 13, 2025

Uh oh!

CISC commented Jul 13, 2025

Uh oh!

JohannesGaessler commented Jul 13, 2025

Uh oh!

ggerganov commented Jul 13, 2025

Uh oh!

Uh oh!