-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: quantized KV support for FA vec #7527
CUDA: quantized KV support for FA vec #7527
Conversation
Considering the results you obtained, I'd say that KV q8_0/q5_1 7.25BPW is imo a must due to the big gap between 7.00BPW and 8.50BPW, and most importantly due to the very sizeable benefit q8_0/q5_1 brings compared to q8_0/q5_0 in terms of lesser quality loss versus q8_0/q8_0, this for a difference of only 0.25BPW. q8_0/q5_1 is the compromise I'd likely use daily, and I think it should be part of the "short list"! |
I have a prototype for restricting the compilation only to those types listed in the OP. On my desktop the compile time with @ggerganov @slaren can you share some insights regarding what maximum compile time you would still find acceptable and whether you think a full or a reduced build should be the default? Another approach would be to split the compilation across more files which would help on e.g. user desktop machines but maybe not in CI? |
My opinion is that we should probably keep just 3 options:
The compile time even on |
fab0e7b
to
bb5fd6d
Compare
Which scenario are you talking about? Multi-threaded or single-threaded? |
bb5fd6d
to
9ecde9f
Compare
The type combination I chose for "the fastest" is q4_0/q4_0; there are still performance issues with the FP32 -> q4_0 conversion code though so end-to-end performance is still suboptimal. |
The default multi-threaded build (i.e. using the subset of 3 KV quants) takes the same amount of time as on time LLAMA_NO_CCACHE=1 LLAMA_CUDA=1 make -j main
real 0m35,972s
user 3m57,118s
sys 0m28,550s This is on a AMD Ryzen 9 5950X 16-Core Processor. The build using all quants is indeed quite slow, but being option I think is acceptable: time LLAMA_NO_CCACHE=1 LLAMA_CUDA=1 LLAMA_CUDA_FA_ALL_QUANTS=1 make -j main
real 6m18,667s
user 11m51,335s
sys 0m32,225s |
If you're concerned about compilation time for developers in particular it would definitely be possible to speed that up. A lot of the code is not strictly needed in terms of correctness but only exists for better performance, e.g. in There is a relatively recent NVCC option called Using I noticed that |
Yes, I think any reasonable reorganization of the build that reduces the compile time improves the developer experience. I build the CUDA backend to run some tests from time-to-time - my main development environment is MacOS, so it does not affect me a lot. But I can see it becoming a problem if the build time keeps increasing, so we have to be mindful and look for opportunities to improve. The multi-file reorganization of the CUDA code is already a big win and it's paying out a lot |
These changes would only be useful when working on the affected files, eg. when working on the mmq kernels. But we could not use them when testing the performance of the CUDA backend, and it also wouldn't help when working on other parts of the code since these files do not need to be recompiled. |
I think the build time with only the 3 options is already very good. Lately I am far more concerned about the build time of llama.cpp (the file), a full CUDA build (this PR) takes 26 seconds for me, building llama.cpp only takes 13 seconds. However the solution for that is easy, there are lots of code in there that could be split into multiple files. |
I was also thinking about the compilation time in terms of the CI though; I feel like it would be useful if the latency between opening a PR and feedback about a failed run would be lower. And for that parallelization probably won't do much since (I assume) the CI is working with very limited resources. |
@slaren Yes, we should do that. And also for the @JohannesGaessler |
Strictly speaking they would also be useful when working on e.g. |
We should have tests for this in |
This is not really what I am seeing, with q8_0 I see a 15-20% drop in performance. Is this the expected performance? With
build: a5436a0 (2981) |
When I refactored the code I disabled the compilation of most kernel variants, including the one for batch size 1 since I was only checking for correctness. But I accidentally committed this change so the tg performance was bad. This is the performance you should be seeing:
The pp performance is expected to be worse because there are no efficient kernels for batch sizes >> 1. |
Yes, that's close to what I see now. It would be fairly easy to parallelize the compilation of the different template instantiations across multiple sources files by using extern templates. Essentially you would have to remove the static from the templates, declare each instantiation as |
Let's add this to disable CUDA graphs with quantized KV for now: diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index b82167cb..cd8b2e31 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -2585,6 +2585,14 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
#endif
}
+ if (node->op == GGML_OP_FLASH_ATTN_EXT && (ggml_is_quantized(node->src[1]->type) || ggml_is_quantized(node->src[2]->type))) {
+ // disable CUDA graphs for quantized FLASH_ATTN_EXT for now
+ use_cuda_graph = false;
+#ifndef NDEBUG
+ GGML_CUDA_LOG_WARN("%s: disabling CUDA graphs due to quantized FLASH_ATTN_EXT\n", __func__);
+#endif
+ }
+
if (node->op == GGML_OP_CPY) {
// store the copy op parameter which changes with each token.
cuda_ctx->cuda_graph->updated_kernel_arg.push_back((char **) &(node->src[1]->data)); |
9c370ae
to
a62d7cb
Compare
4d9ed09
to
f4003cf
Compare
I can't manage to wrangle the compiler into doing what I want, I think I'm doing something subtly wrong. @slaren can you post the discussed example? The way I would split the template variants across files is via K type, V type, and head size since you can treat each tuple of those three values in the same way. |
Here it is: 1ca802a Probably the first macro is unnecessary ( |
Thank you very much for the example, I pushed a prototype for what I have in mind. Specifically:
Thoughts? |
Looks good to me. I compared the compilation time with the previous commit: Before:
After:
So it seems quite effective, although compilation with a single core will be significantly slower. |
How are you getting the compilation times? |
I modified the diff --git a/Makefile b/Makefile
index 278a06d2..db64b76b 100644
--- a/Makefile
+++ b/Makefile
@@ -504,7 +504,7 @@ define NVCC_COMPILE
endef # NVCC_COMPILE
else
define NVCC_COMPILE
- $(NVCC) $(NVCCFLAGS) $(CPPFLAGS) -Xcompiler "$(CUDA_CXXFLAGS)" -c $< -o $@
+ /bin/time -f "$< %e s" $(NVCC) $(NVCCFLAGS) $(CPPFLAGS) -Xcompiler "$(CUDA_CXXFLAGS)" -c $< -o $@
endef # NVCC_COMPILE
endif # JETSON_EOL_MODULE_DETECT |
69515d0
to
84d9277
Compare
I pushed an implementation for split compilation where each individual FlashAttention compilation job takes roughly as long as
is as follows:
I didn't test it in the same detail but the compilation time for |
Shouldn't the generated files be added to the repository? The python script is also missing the executable permission. |
I was thinking it would be better to add the autogenerated files towards the end of the review process in case there are requests for changes. |
The CI says: |
I don't think it's a timeout, and it was cancelled before the compilation actually started so I think it is unlikely to be caused by these changes. Let's see if it happens again.. |
We should add a check in |
|
After 9b59641, flash attention doesn't work with phi-3 medium.
|
This PR adds CUDA support for quantizing the KV cache. Only the kernels optimized for small batch sizes are implemented so prompt processing speed is still bad. Also as of right now I have not added any mechanisms for reducing the compilation time which currently sits at ~7 minutes on my desktop machine and is definitely too long. I think the way to go about this is to just not compile all possible combinations for KV cache types but rather only those that actually make sense to use based on the results in #7412 . In my opinion compiling the reduced version should be the default but I don't have a strong opinion on this. I would then add a compile flag to compile all combinations if one wants to. I consider a solution for the long compile time the only thing still missing from this PR.
I think the following type combinations make sense:
That would reduce the number of possible combinations from 36 to 10. Compilation may still take too long, so maybe adding yet another compile flag for fast compilation would make sense.
Note: currently there is a bug on master that causes incorrect results when quantizing the K cache, see #7492 . A workaround is to disable CUDA graphs via the environment variable
GGML_CUDA_DISABLE_GRAPHS=1
but obviously that is going to result in worse performance.Performance stays mostly the same with a quantized KV cache; quantizing K makes the performance slightly better, quantizing V makes the performance slightly worse.
I was unfortunately not able to recycle much of the existing code. The vector FlashAttention kernels are run with the same number of threads as the head size so the existing code for dequantizing values that produces 2 values is not a good fit for V. Similarly for K the existing q4/q5 dot products are also a poor fit because those dot products work on chunks of
8*WARP_SIZE == 256
values in parallel while the head size of e.g. LLaMA 3 is only 128.