Skip to content

MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xiang1guo
Copy link

@xiang1guo xiang1guo commented May 20, 2025

Background

A possible mino fix based on #12801 if I understand correctly.

With the new DS-V3 GGUF file for reduced KV cache, if the user sets no KV offload with LLAMA_ARG_NO_KV_OFFLOAD=1, we still see the MLA computation is assigned to gpu-backend(tested with sycl backend), which is unexpected in my understanding.

Solution

The root cause is that the last SV_absorb matmul's output node for MLA is not controlled by env and assigned to be GPU-backend when doing graph splits (expand up stage). The node is added together with #12801 for MLA with reduced KV Cache, and then the setting in here is invalid because of the new node. This PR forces that node to be the CPU backend if the user sets LLAMA_ARG_NO_KV_OFFLOAD=1 so that the computation between the KV cache store and the attention output can be assigned to CPU.

Testing results.(SYCL Backend)

The model I used is the newly updated DeepSeek-V3-0324 for Q4_K_M version.
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M

Functionality wise

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. Before the change.

## SPLIT #247: CPU # 2 inputs: [Kcur-28 (   1M)] [Vcur-28 (reshaped) (transposed) (   1M)] 
node #2242 (       CPY): cache_k_l28 (view) ( ( 306K) [  CPU         ]:        CPU#Kcur-28#0 (   1M) [ NULL         ]   cache_k_l28 (view) ( 306K) [  CPU         ]
node #2246 (       CPY): cache_v_l28 (view) ( (   1M) [  CPU         ]: CPU#Vcur-28 (reshape (   1M) [ NULL         ]   cache_v_l28 (view) (   1M) [  CPU         ]

## SPLIT #248: SYCL0 # 2 inputs: [cache_k_l28 (view) (permuted) (   1M)] [cache_v_l28 (view) (permuted) (   2M)] 
node #2252 (   MUL_MAT):            node_2252 ( 512M) [SYCL0         ]: SYCL0#cache_k_l28 (v (   1M) [ NULL         ]   Qcur-28 (permuted) ( 144M) [SYCL0         ]
node #2253 (  SOFT_MAX):            node_2253 ( 512M) [SYCL0         ]:            node_2252 ( 512M) [SYCL0         ]      SYCL0#leaf_13#0 (   4M) [ NULL         ]
node #2254 (   MUL_MAT):            node_2254 ( 128M) [SYCL0         ]: SYCL0#cache_v_l28 (v (   2M) [ NULL         ]            node_2253 ( 512M) [SYCL0         ]
node #2255 (   MUL_MAT):            node_2255 (  32M) [SYCL0         ]: blk.28.attn_v_b.weig (   4M) [SYCL0         ]            node_2254 ( 128M) [SYCL0         ]

## SPLIT #249: CPU # 1 inputs: [ (permuted) (  32M)] 
node #2257 (      CONT):           kqv_out-28 (  32M) [  CPU         ]:    CPU# (permuted)#0 (  32M) [ NULL         ]

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. After the change.

## SPLIT #14: CPU # 3 inputs: [Kcur-2 (   1M)] [Vcur-2 (reshaped) (transposed) (   1M)] [Qcur-2 (permuted) ( 144M)] 
node #119 (       CPY): cache_k_l2 (view) (c ( 306K) [  CPU         ]:         CPU#Kcur-2#0 (   1M) [ NULL         ]    cache_k_l2 (view) ( 306K) [  CPU         ]
node #123 (       CPY): cache_v_l2 (view) (c (   1M) [  CPU         ]: CPU#Vcur-2 (reshaped (   1M) [ NULL         ]    cache_v_l2 (view) (   1M) [  CPU         ]
node #129 (   MUL_MAT):             node_129 ( 512M) [  CPU         ]: cache_k_l2 (view) (p (   1M) [  CPU         ] CPU#Qcur-2 (permuted ( 144M) [ NULL         ]
node #130 (  SOFT_MAX):             node_130 ( 512M) [  CPU         ]:             node_129 ( 512M) [  CPU         ]              leaf_13 (   4M) [  CPU         ]
node #131 (   MUL_MAT):             node_131 ( 128M) [  CPU         ]: cache_v_l2 (view) (p (   2M) [  CPU         ]             node_130 ( 512M) [  CPU         ]

Performance-wise

Though it depends on the CPU/GPU kernel implementation for MLA, and profiling results show low efficiency in the first matmul in SYCL backend, therefore, the performance speedup is not very meaningful. I just put some perf numbers here for reference.
Prefill speedup: 7%
Decoding speedup: 40%

Testing results.(CUDA Backend tested with 4070)

The model I used is the newly updated DeepSeek-V3-0324 for Q4_K_M version.
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M

Functionality wise

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. Before the change.

## SPLIT #9: CPU # 2 inputs: [Kcur-0 (   1M)] [Vcur-0 (reshaped) (transposed) (   1M)] 
node # 23 (       CPY): cache_k_l0 (view) (c ( 306K) [  CPU         ]:         CPU#Kcur-0#0 (   1M) [ NULL         ]    cache_k_l0 (view) ( 306K) [  CPU         ]
node # 27 (       CPY): cache_v_l0 (view) (c (   1M) [  CPU         ]: CPU#Vcur-0 (reshaped (   1M) [ NULL         ]    cache_v_l0 (view) (   1M) [  CPU         ]

## SPLIT #10: CUDA0 # 3 inputs: [cache_k_l0 (view) (permuted) (   1M)] [leaf_13 (   4M)] [cache_v_l0 (view) (permuted) (   2M)] 
node # 33 (   MUL_MAT):              node_33 ( 512M) [CUDA0         ]: CUDA0#cache_k_l0 (vi (   1M) [ NULL         ]    Qcur-0 (permuted) ( 144M) [CUDA0         ]
node # 34 (  SOFT_MAX):              node_34 ( 512M) [CUDA0         ]:              node_33 ( 512M) [CUDA0         ]      CUDA0#leaf_13#0 (   4M) [ NULL         ]
node # 35 (   MUL_MAT):              node_35 ( 128M) [CUDA0         ]: CUDA0#cache_v_l0 (vi (   2M) [ NULL         ]              node_34 ( 512M) [CUDA0         ]

## SPLIT #11: CUDA0 # 1 inputs: [blk.0.attn_v_b.weight (   4M)] 
node # 36 (   MUL_MAT):              node_36 (  32M) [CUDA0         ]: CUDA0#blk.0.attn_v_b (   4M) [ NULL         ]              node_35 ( 128M) [CUDA0         ]

## SPLIT #12: CPU # 1 inputs: [ (permuted) (  32M)] 
node # 38 (      CONT):            kqv_out-0 (  32M) [  CPU         ]:    CPU# (permuted)#0 (  32M) [ NULL         ]

LLAMA_ARG_NO_KV_OFFLOAD=1 Force KV cache on CPU. After the change.

## SPLIT #9: CPU # 3 inputs: [Kcur-0 (   1M)] [Vcur-0 (reshaped) (transposed) (   1M)] [Qcur-0 (permuted) ( 144M)] 
node # 23 (       CPY): cache_k_l0 (view) (c ( 306K) [  CPU         ]:         CPU#Kcur-0#0 (   1M) [ NULL         ]    cache_k_l0 (view) ( 306K) [  CPU         ]
node # 27 (       CPY): cache_v_l0 (view) (c (   1M) [  CPU         ]: CPU#Vcur-0 (reshaped (   1M) [ NULL         ]    cache_v_l0 (view) (   1M) [  CPU         ]
node # 33 (   MUL_MAT):              node_33 ( 512M) [  CPU         ]: cache_k_l0 (view) (p (   1M) [  CPU         ] CPU#Qcur-0 (permuted ( 144M) [ NULL         ]
node # 34 (  SOFT_MAX):              node_34 ( 512M) [  CPU         ]:              node_33 ( 512M) [  CPU         ]              leaf_13 (   4M) [  CPU         ]
node # 35 (   MUL_MAT):              node_35 ( 128M) [  CPU         ]: cache_v_l0 (view) (p (   2M) [  CPU         ]              node_34 ( 512M) [  CPU         ]
node # 36 (   MUL_MAT):              node_36 (  32M) [  CPU         ]: blk.0.attn_v_b.weigh (   4M) [  CPU         ]              node_35 ( 128M) [  CPU         ]
node # 38 (      CONT):            kqv_out-0 (  32M) [  CPU         ]:           (permuted) (  32M) [  CPU         ]

Performance-wise

Prefill: 94% of master branch.
Decoding: 98.6% of master branch.

@xiang1guo
Copy link
Author

xiang1guo commented May 20, 2025

@jukofyork Please help take a look at this change. Thanks!

@xiang1guo
Copy link
Author

@ggerganov Can you please help to take a look at this PR?

@ggerganov
Copy link
Member

I think the change is good. Would like to see some reports how much it affects the performance on some CUDA configurations.

@xiang1guo
Copy link
Author

I think the change is good. Would like to see some reports how much it affects the performance on some CUDA configurations.

@ggerganov Thanks for the reply! I will try to find one and collect some data.

@jukofyork
Copy link
Collaborator

@jukofyork Please help take a look at this change. Thanks!

Sorry for the slow reply - I'm away until late tomorrow, but will try to look at it then.

@xiang1guo
Copy link
Author

@jukofyork Please help take a look at this change. Thanks!

Sorry for the slow reply - I'm away until late tomorrow, but will try to look at it then.

Yeah, feel free on this one~ I noticed your comment in last PR that you are on leave :)

@xiang1guo
Copy link
Author

I think the change is good. Would like to see some reports how much it affects the performance on some CUDA configurations.

@ggerganov Thanks for the reply! I will try to find one and collect some data.

@ggerganov I updated CUDA backend performance data in PR description, the performance data is slightly slower than master branch. It depends on the CPU and GPU kernel implementation I think.

@slaren
Copy link
Member

slaren commented May 22, 2025

If using a CPU KV cache with MLA, I would suggest using --overridetensor to force attn_v_b in the CPU. Otherwise, even with this change, this tensor is going to be stored in VRAM and copied back to the CPU on every evaluation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants