Vulkan + AMD GPU + FlashAttention: Extreme performance degredation #12629

remon-nashid · 2025-03-28T15:56:47Z

remon-nashid
Mar 28, 2025

Is this expected? Using the latest llama.cpp-vulkan build on an AMD 7900 xtx card. Following are the llama-bench results without and with flash attention.

Note that I've reproduced these results with various models from 3B and up to 32B.

FA disabled

model	size	params	backend	threads	test	t/s
qwen2 3B Q6_K	2.36 GiB	3.09 B	Vulkan,BLAS,RPC	16	pp512	2798.42 ± 107.47
qwen2 3B Q6_K	2.36 GiB	3.09 B	Vulkan,BLAS,RPC	16	tg128	86.11 ± 0.74

FA enabled

model	size	params	backend	threads	fa	test	t/s
qwen2 3B Q6_K	2.36 GiB	3.09 B	Vulkan,BLAS,RPC	16	1	pp512	78.29 ± 4.37
qwen2 3B Q6_K	2.36 GiB	3.09 B	Vulkan,BLAS,RPC	16	1	tg128	2.85 ± 0.56

I couldn't find this reported elsewhere. Usually people complain about FA with ROCm instead of Vulkan. However, please let me know if this has been reported or if there are efforts that I could follow.

Thanks

daniandtheweb · 2025-03-28T19:25:08Z

daniandtheweb
Mar 28, 2025

Right now flash attention on vulkan is only supported on some NVIDIA drivers with the coopmat2 extension, on any other gpu using flash attention causes the computation to be offloaded to the CPU, that's the reason for the performance loss.

6 replies

daniandtheweb Mar 28, 2025

I don't really know if there's currently any work being done to add flash attention to vulkan. Currently there seem to be quite a bit of work done on the vulkan backend in general so it may be in the devs roadmap.
If you need flash attention you can use ROCm for now, compiling it with support for ROCWMMA makes the flash attention performance even better than without flash attention at all and for RDNA3 cards (or at least my 7800XT) ROCm performance is still better than vulkan's.

acbits Apr 12, 2025

I don't really know if there's currently any work being done to add flash attention to vulkan. Currently there seem to be quite a bit of work done on the vulkan backend in general so it may be in the devs roadmap. If you need flash attention you can use ROCm for now, compiling it with support for ROCWMMA makes the flash attention performance even better than without flash attention at all and for RDNA3 cards (or at least my 7800XT) ROCm performance is still better than vulkan's.

I found out this after I ran Exaone models which never completed their response with FA turned on on my hardware(AMD GPU+Vulkan)

Can we print a warning that FA would be performed on the CPU on platforms where it is not supported?

remon-nashid Apr 13, 2025
Author

Yes, seems it should at least be mentioned in the "feature matrix" wiki page.

ZUIcat May 4, 2025

I also agree with adding explanations on the wiki or print warnings during runtime, and I only realized this problem these past few days.

netrunnereve May 4, 2025
Collaborator

I've went and added this to the wiki for now. AFAIK it's not that simple to print a warning here since the Vulkan backend doesn't know initially if we've enabled flash attention or not (it just runs the provided GGML ops, which may include flash attention). Meanwhile the GGML code that sends the ops to the backend and enables flash attention based off -fa doesn't know if our GPU has coopmat2 support or not.

Or maybe everything I'm saying here is completely wrong and someone more experienced can chime in 🤷‍♀️.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vulkan + AMD GPU + FlashAttention: Extreme performance degredation #12629

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Vulkan + AMD GPU + FlashAttention: Extreme performance degredation #12629

Uh oh!

Uh oh!

remon-nashid Mar 28, 2025

Replies: 1 comment · 6 replies

Uh oh!

daniandtheweb Mar 28, 2025

Uh oh!

Uh oh!

daniandtheweb Mar 28, 2025

Uh oh!

acbits Apr 12, 2025

Uh oh!

remon-nashid Apr 13, 2025 Author

Uh oh!

ZUIcat May 4, 2025

Uh oh!

netrunnereve May 4, 2025 Collaborator

remon-nashid
Mar 28, 2025

Replies: 1 comment 6 replies

daniandtheweb
Mar 28, 2025

remon-nashid Apr 13, 2025
Author

netrunnereve May 4, 2025
Collaborator