-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: add FP32 FlashAttention vector kernel #7188
CUDA: add FP32 FlashAttention vector kernel #7188
Conversation
Fixes #7055 . |
This happens regularly, but it's never going to be ok to add backend-specific functions to |
29e01c3
to
de85f90
Compare
TG speed up is significant but PP is slower quite a bit, I don't know why. |
There simply isn't yet a kernel optimized for large batch sizes. |
ggml-cuda.cu
Outdated
for (int id = 0; id < ggml_backend_cuda_get_device_count(); ++id) { | ||
if (ggml_cuda_info().devices[id].cc < CC_VOLTA) { | ||
return false; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary to check every device here, instead get the context and check only the device for this context. Something like this:
ggml_backend_cuda_context * cuda_ctx = (ggml_backend_cuda_context *)backend->context;
if (ggml_cuda_info().devices[ctx->device].cc < CC_VOLTA) {
return false;
}
de85f90
to
e0d1184
Compare
Just adding a small data point, with KoboldCPP compiled with this, with a Q8_K 11b model on 2 x 1080 Ti (Pascal) setup, I get:
So a significant improvement in my case. Whereas with FP16 FA, I saw a decrease. So it definitely has utility for a subset of users. |
e0d1184
to
41f5f3a
Compare
I don't have any ALiBi models set up for testing but according to |
hi, i get an error when trying to run with -fa on my p100 is support dropped? |
Pascal is still supported, make an issue. |
This PR adds an FP32 FlashAttention kernel that is very similar to the FP16 kernel. It enables using FlashAttention on NVIDIA GPUs without fast FP16 and without tensor cores. It should also provide a speedup on more recent NVIDIA GPUs for batch size 1 and FP32 precision. I have moved the FP16 and FP32 FlashAttention vector kernels to separate files in order to speed up compilation. I also added a function
ggml_backend_cuda_get_device_cc
toggml-cuda.h
in order to avoid breakingtests/test-backend-ops
on NVIDIA GPUs without tensor cores. Unlike with the FP16 kernel there are no weird issues with arrays of size 1 vs. regular variables.Performance on 1x P40: