Vulkan with FlashAttention: Extreme performance degredation #12629
Unanswered
remon-nashid
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Right now flash attention on vulkan is only supported on some NVIDIA drivers with the coopmat2 extension, on any other gpu using flash attention causes the computation to be offloaded to the CPU, that's the reason for the performance loss. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Is this expected? Using the latest llama.cpp-vulkan build on an AMD 7900 xtx card. Following are the llama-bench results without and with flash attention.
Note that I've reproduced these results with various models from 3B and up to 32B.
FA disabled
FA enabled
I couldn't find this reported elsewhere. Usually people complain about FA with ROCm instead of Vulkan. However, please let me know if this has been reported or if there are efforts that I could follow.
Thanks
Beta Was this translation helpful? Give feedback.
All reactions