[WIP] Llama 4: Hybrid KV buffer (disable radix attention) #5853

tarinkk · 2025-04-28T18:45:31Z

Motivation

We set KV buffer of different sizes for global and local attention layers in Llama 4 for better memory usage.

Modifications

We modify the size of KV buffer in MHATokenToKVPool and set two TokenToKVPoolAllocator for global and local attention.
The hybrid ratio between 0 to 1 can be set by --enable-hybrid-kvcache. default set is 1.0 (0 = pure uniform: local_size / global_size = 1, 1.0 = pure hybrid: local_size / global_size = local_attention_size / context_length)
Currently, only support cases when page size = 1, disable radix attention and disable cuda graph.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

…adix enable hybrid cache without radix attention

ch-wan and others added 10 commits April 24, 2025 10:58

8 layers

a66f124

succeed in 8 layer test

734f908

hybrid ratio and local size

38d617f

set kv buffer

33e8a1c

alloc local kv indices

4c0456d

enable hybrid cache page size = 1

920027d

BugFix: disable hybrid cache

4df83fc

default hybrid ratio

28c645e

Merge remote-tracking branch 'upstream/main' into llama4KVbuffer/no-r…

383c5dc

…adix enable hybrid cache without radix attention

Bug Fixing: cuda_graph_runner

e0cc231

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Llama 4: Hybrid KV buffer (disable radix attention) #5853

[WIP] Llama 4: Hybrid KV buffer (disable radix attention) #5853

tarinkk commented Apr 28, 2025

[WIP] Llama 4: Hybrid KV buffer (disable radix attention) #5853

Are you sure you want to change the base?

[WIP] Llama 4: Hybrid KV buffer (disable radix attention) #5853

Conversation

tarinkk commented Apr 28, 2025

Motivation

Modifications

Checklist