kv-cache : prepare K/V buffers for separation #14517
                
     Closed
            
            
          
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
from #14363
Currently, the K and V buffers in the unified KV cache are shared among all the participating sequences (hence the name "unified"). With the upcoming change #14363, the buffers can become separate from each other in order to increase the throughput for parallel decoding use cases. This PR is a preparation step to support that.
There should be no functional changes.
Handling of variable V heads is also done when
ggml_set_rows()is used.LLAMA_SET_ROWS=1 ./bin/llama-cli -hf mradermacher/OpenELM-3B-Instruct-GGUF:Q8_0 \ -p "I believe the meaning of life is" -no-cnv -n 32 -t 1 -s 2 --top-k 1Outdated
The only new restriction is that we require the number of KV heads for all layers to be equal:
llama.cpp/src/llama-kv-cache-unified.cpp
Lines 70 to 77 in 40f8c48
Support for varying number of KV heads should be simple - just need to make the correct view of
v_idxswhen FA is disabled. But leaving this for when we actually need it.