-
Notifications
You must be signed in to change notification settings - Fork 126
Faster MLA prompt processing #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It is either the standard KV cache or MLA cache, not both.
Much easier to follow, at least for my brain, when we have X_rope : rotational position encoding X_nope : no position encoding instead of X_pe and X_nope, where I was wondering wtf is 'pe' and 'nope'.
@@ -3178,33 +3178,30 @@ static bool llama_kv_cache_init( | |||
ggml_tensor * k; | |||
ggml_tensor * v; | |||
if (cparams.mla_attn && model.layers[i].wk_b && model.layers[i].wv_b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to print something if mla_attn is requested but not able to be run instead of just silently failing over to standard attention, I just saw a report of a user not realizing that this was happening and not sure why MLA was not giving any performance difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added a hopefully visible warning.
Cuts KV cache size in nearly half at the expense of slower TG performance for long contexts (it becomes similar to no-MLA).
The PR also adds a compile time option to disable the transposed KV cache when using MLA (simple look for
|
This PR speeds up prompt processing (PP) when MLA is enabled. It is still slower than no-MLA, so I'm making this a draft for now to try some more. Still it would be great if somebody else tested to confirm that a) I did not introduce bugs and b) It is indeed faster on their systems.
The PR also adds the changes suggested by @saood06 in the review of #188
Speedup is achieved by concatenating the no- and rotational position encoding parts of
K
andQ
(this also eliminates thek_r
cache), which allows us to combine the formerkq_nope
andkq_pe
matrix multiplications into a single matrix multiplication. This also eliminates the fairly expensive addition ofkq_nope
andkq_pe
.Here is a comparison between PP performance on the main branch and this PR for DeepSeek-Lite quantized with
IQ4_XS
and running on a Ryzen-7950X usingQ8_0
for K-cacheTG performance (the whole point of MLA) is not sacrificed. Here the results of
llama-bench -gp -Np,64
for different prompt lengthsNp
Not sure if the ~9% improvement at 16k tokens is real. It may be just due to less thermal trottling because of the prompt processing part finishing quicker.