Skip to content

Conversation

ikawrakow
Copy link
Owner

@ikawrakow ikawrakow commented Feb 12, 2025

This PR speeds up prompt processing (PP) when MLA is enabled. It is still slower than no-MLA, so I'm making this a draft for now to try some more. Still it would be great if somebody else tested to confirm that a) I did not introduce bugs and b) It is indeed faster on their systems.

The PR also adds the changes suggested by @saood06 in the review of #188

Speedup is achieved by concatenating the no- and rotational position encoding parts of K and Q (this also eliminates the k_r cache), which allows us to combine the former kq_nope and kq_pe matrix multiplications into a single matrix multiplication. This also eliminates the fairly expensive addition of kq_nope and kq_pe.

Here is a comparison between PP performance on the main branch and this PR for DeepSeek-Lite quantized with IQ4_XS and running on a Ryzen-7950X using Q8_0 for K-cache

model test t/s (main) t/s (PR) Speedup
deepseek2 16B IQ4_XS pp512 478.58 ± 5.14 489.40 ± 1.08 1.023
deepseek2 16B IQ4_XS pp1024 438.56 ± 0.75 458.37 ± 1.51 1.045
deepseek2 16B IQ4_XS pp2048 378.95 ± 1.40 407.83 ± 2.07 1.076
deepseek2 16B IQ4_XS pp4096 294.71 ± 2.86 327.88 ± 0.18 1.113
deepseek2 16B IQ4_XS pp8192 204.52 ± 0.27 234.17 ± 0.37 1.145
deepseek2 16B IQ4_XS pp16384 126.31 ± 0.13 148.35 ± 0.38 1.174

TG performance (the whole point of MLA) is not sacrificed. Here the results of llama-bench -gp -Np,64 for different prompt lengths Np

model test t/s (main) t/s (PR) Speedup
deepseek2 16B IQ4_XS tg64@pp128 33.58 ± 0.06 33.80 ± 0.00 1.007
deepseek2 16B IQ4_XS tg64@pp256 32.67 ± 0.00 32.76 ± 0.01 1.003
deepseek2 16B IQ4_XS tg64@pp512 32.38 ± 0.08 32.68 ± 0.05 1.009
deepseek2 16B IQ4_XS tg64@pp1024 31.50 ± 0.02 32.02 ± 0.00 1.017
deepseek2 16B IQ4_XS tg64@pp2048 30.01 ± 0.01 30.31 ± 0.03 1.010
deepseek2 16B IQ4_XS tg64@pp4096 27.08 ± 0.03 27.54 ± 0.10 1.017
deepseek2 16B IQ4_XS tg64@pp8192 22.82 ± 0.00 23.12 ± 0.01 1.013
deepseek2 16B IQ4_XS tg64@pp16384 17.24 ± 0.00 18.74 ± 0.09 1.087

Not sure if the ~9% improvement at 16k tokens is real. It may be just due to less thermal trottling because of the prompt processing part finishing quicker.

Iwan Kawrakow added 6 commits February 12, 2025 07:41
It is either the standard KV cache or MLA cache, not both.
Much easier to follow, at least for my brain, when we have
  X_rope : rotational position encoding
  X_nope :         no position encoding
instead of X_pe and X_nope, where I was wondering wtf is 'pe'
and 'nope'.
@@ -3178,33 +3178,30 @@ static bool llama_kv_cache_init(
ggml_tensor * k;
ggml_tensor * v;
if (cparams.mla_attn && model.layers[i].wk_b && model.layers[i].wv_b) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to print something if mla_attn is requested but not able to be run instead of just silently failing over to standard attention, I just saw a report of a user not realizing that this was happening and not sure why MLA was not giving any performance difference.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added a hopefully visible warning.

@ikawrakow ikawrakow marked this pull request as ready for review February 13, 2025 06:49
Cuts KV cache size in nearly half at the expense of slower
TG performance for long contexts (it becomes similar to
no-MLA).
@ikawrakow
Copy link
Owner Author

The PR also adds a compile time option to disable the transposed KV cache when using MLA (simple look for MLA_USE_TRANSPOSED_CACHE and set it to 0). This cuts KV cache size in nearly half at the expense of a lower TG performance with long contexts. PP performance stays about the same. Here is a comparison between MLA with and without transposed cache

model test t/s (with c^T) t/s (without c^T)
deepseek2 16B IQ4_XS tg64@pp128 33.58 ± 0.06 33.05 ± 0.05
deepseek2 16B IQ4_XS tg64@pp256 32.67 ± 0.00 31.54 ± 0.07
deepseek2 16B IQ4_XS tg64@pp512 32.38 ± 0.08 30.26 ± 0.33
deepseek2 16B IQ4_XS tg64@pp1024 31.50 ± 0.02 28.50 ± 0.01
deepseek2 16B IQ4_XS tg64@pp2048 30.01 ± 0.01 24.75 ± 0.01
deepseek2 16B IQ4_XS tg64@pp4096 27.08 ± 0.03 20.67 ± 0.09
deepseek2 16B IQ4_XS tg64@pp8192 22.82 ± 0.00 14.89 ± 0.01

@ikawrakow ikawrakow merged commit 05242ff into main Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants