[WIP] Llama 4: Hybrid KV buffer (disable radix attention) #5853
+309
−60
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
We set KV buffer of different sizes for global and local attention layers in Llama 4 for better memory usage.
Modifications
We modify the size of KV buffer in MHATokenToKVPool and set two TokenToKVPoolAllocator for global and local attention.
The hybrid ratio between 0 to 1 can be set by --enable-hybrid-kvcache. default set is 1.0 (0 = pure uniform: local_size / global_size = 1, 1.0 = pure hybrid: local_size / global_size = local_attention_size / context_length)
Currently, only support cases when page size = 1, disable radix attention and disable cuda graph.
Checklist