MLA kv cache: fix split graph backend assignment when kv cache store on CPU #13648
+4
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
A possible mino fix based on #12801 if I understand correctly.
With the new DS-V3 GGUF file for reduced KV cache, if the user sets no KV offload with
LLAMA_ARG_NO_KV_OFFLOAD=1
, we still see the MLA computation is assigned to gpu-backend(tested with sycl backend), which is unexpected in my understanding.Solution
The root cause is that the last
SV_absorb matmul
's output node for MLA is not controlled by env and assigned to be GPU-backend when doing graph splits (expand up stage). The node is added together with #12801 for MLA with reduced KV Cache, and then the setting in here is invalid because of the new node. This PR forces that node to be the CPU backend if the user setsLLAMA_ARG_NO_KV_OFFLOAD=1
so that the computation between the KV cache store and the attention output can be assigned to CPU.Testing results.(SYCL Backend)
The model I used is the newly updated
DeepSeek-V3-0324
forQ4_K_M
version.https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M
Functionality wise
LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. Before the change.LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. After the change.Performance-wise
Though it depends on the CPU/GPU kernel implementation for MLA, and profiling results show low efficiency in the first matmul in SYCL backend, therefore, the performance speedup is not very meaningful. I just put some perf numbers here for reference.
Prefill speedup: 7%
Decoding speedup: 40%
Testing results.(CUDA Backend tested with 4070)
The model I used is the newly updated
DeepSeek-V3-0324
forQ4_K_M
version.https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD/tree/main/Q4_K_M
Functionality wise
LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. Before the change.LLAMA_ARG_NO_KV_OFFLOAD=1
Force KV cache on CPU. After the change.Performance-wise
Prefill: 94% of master branch.
Decoding: 98.6% of master branch.