-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[CUDA backend ONLY] Use just K-cache for MLA + FA: 47% saving on KV-cache size #13529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Wow, this is pretty huge. Would go a long way towards supporting long contexts on potato devices. |
You'll need to use something like this to get the merge of the 2 PRs: git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/13435/head:pr-13435
git fetch origin pull/13529/head:pr-13529
git merge pr-13435 --no-edit
git merge pr-13529 --no-edit (there may be a better way, but I'm pretty dumb when it comes to using |
Yeah, it's nuts: I can now get 160k context, with a |
#13435 got merged, so this could be converted back to PR now I think. |
It really needs people to test if the other back-ends can handle a non-contiguous V-cache view like this first, or preferably do as @JohannesGaessler suggested and make the other back-ends' FA implementation use just the K-cache like he did. |
The CPU implementation does not seem to support it. This command generates junk for me:
Maybe look into fixing it and adding |
Does this have any speed implications by not storing the values ? |
Yeah, I think having a non-cont V-cache is likely to break most of the backends. I'll try and have a look at fixing this and adding the test today or tomorrow. |
If the backends are rewritten to only use the K-cache, then it could have a big performance improvement for some of the backends due to not having to access the V-cache in memory (eg: a CPU's L3-cache will only have to store K-cache elements). If the backends are just fixed to use the non-contiguous view of the K-cache in place of the V-cache, there will likely be some degree of performance reduction though. |
I changed the title to make it clear that it will only work for the CUDA backend for now. I'm away until Thursday so will try and get this working for CPU when I get back home then. |
Just looking to see if I can revive this PR, but not sure about how to proceed: 1. Carry on using
|
The KV cache can have a flag The other thing that is missing is to update the CPU implementation to support non-contiguous |
One thing to keep in mind is that (for the memory savings to be automatic) the allocation code would already need to check support for whether or not the noncontiguous V cache is supported. Unless the list of backends with FA support for Deepseek is very short and we could simply add support to those backends that don't yet have it (I don't have a good overview).
Unless I did the implementation wrong it should work even if V is null. |
I'm still not sure if we should just leave the Leaving it as a |
271560e
to
7d51644
Compare
Something seems to be reading I think I've managed to get this back to the same state as before (ie: CUDA only, context-shifting disabled, and no tests yet). |
Got a bit further figuring out where the CPU backend might be going wrong: llama.cpp/ggml/src/ggml-cpu/ops.cpp Line 7101 in ed52f36
Which then gets treated as contiguous for these: llama.cpp/ggml/src/ggml-cpu/ops.cpp Line 7117 in ed52f36
llama.cpp/ggml/src/ggml-cpu/ops.cpp Line 7133 in ed52f36
llama.cpp/ggml/src/ggml-cpu/ops.cpp Line 7137 in ed52f36
BUT, not 100% sure it's here as there is this test at the top of llama.cpp/ggml/src/ggml-cpu/ops.cpp Line 6978 in ed52f36
which should have asserted out before then...? I can't really see a better fix than just making a temp copy of this using the stride: const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3)); as the parameters of llama.cpp/ggml/src/ggml-cpu/vec.h Line 262 in ed52f36
There is also the potential to store the 64-element RoPEed part separately and then |
From #13435:
I've only tested this to work with #13435 for now, but it should still work with the other backends' flash attention implementations so long as they don't assume that the V-cache they are passed is contiguous:
The full context of 160k tokens now takes up less than 11GB:
! 😮
I've just disabled context shifting for now, as like I said in the other post; I'm not at all confident I can cleanly change all that is required to deal with the empty V-cache:
I will leave it as a draft for now, as the other backends' flash attention implementations need to either:
A. Check they can work with the non-contiguous V-cache passed to them.
B. Copy @JohannesGaessler's strategy of only using the last 512 elements of the K-cache in place of the V-cache.
I think (B) is preferable, as there is likely to be some significant gains possible regarding CPU cache (re-)use, etc.