[WIP] Hybrid Memory Allocator #16178

WoosukKwon · 2025-04-07T09:28:10Z

Key differences from #16101:

Only create one specialize manager for each type of attention. For instance, Gemma 3 uses 2 managers (a full attention manager and a SWA manager) instead of 6 (one full attention manager and five SWA managers).
Do not use group_id in hashing the block.
Introduce Hybrid Allocators that support specific combinations of attention, instead of implementing a generic logic that works for all possible combinations.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

github-actions · 2025-04-07T09:28:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

heheda12345 · 2025-04-07T12:46:19Z

This implementation is quite simple and some designs inspire me a lot. But I think it may be too specialized.
The design of this PR is based on the assumption that the block_size of all groups are the same, which is not true when we want to extend the hybrid allocator to Jamba-like models. I think it's OK to specialize to gemma-3 and llama-4 cases at this moment to simplify some logic, but we need to make sure that the design can be extended to more general cases.

For the key differences you mentioned:

2 managers vs 6 managers, I prefer to implement 6 managers in the first hybrid allocator PR to keep things simple and use a follow-up PR to support 2 managers as efficient as possible. The current implementation doesn't reduce the time complexity compared with the 6 manager design.
Hashing. I agree that we can make the assumption that all groups have the same block_size at this moment to keep the hashing logic unchanged.
"Introduce Hybrid Allocators that support specific combinations of attention" I think it is unnecessary as the general implementation in [v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101 is clean and can achieve the same speed as your specialized version.

comaniac

My two cents to the discussion points:

What isn't clear to me in this PR is how will we initialize the hybrid memory allocator, specifically how will we pass arguments to SingleMemoryAllocator and FullAndSwaMemoryAllocator given they expect different arguments. If we could unify the arguments of HybridMemoryAllocator, it seems to me that we are able to support both 2 and 6 managers easily, so that it doesn't matter whether FullAndSwaMemoryAllocator in this PR has 2 or 6 managers.
ooc, is the reason of not hashing group IDs only because of the complexity, or is there any other motivations?
So the structure becomes KVCacheManager.HybridMemoryAllocator IIUC. It seems reasonable to me. One point worth to discuss is how people understand the scope of "manager" and "allocator". Intuitively, "allocator" is in charge of allocating blocks, but this is not the current allocator is doing. On the other hand, if we move all allocation logic to the allocator, then it seems unnecessary to introduce the allocator at all (just like Chen's PR). In short, if the current logic in the allocator is the only specialized logic for hybrid memory, then I'd prefer the solution in this PR (but with another name than the allocator); otherwise it might need more discussions.

comaniac · 2025-04-07T18:54:25Z

vllm/v1/core/specialized_manager.py

+            if all(group_id in cached_blocks for group_id in group_ids):
+                computed_blocks.append(cached_blocks)


Note: Probably need a fast path for the case of len(group_ids)==1.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

mergify · 2025-04-25T16:50:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist · 2025-06-11T03:18:42Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

WoosukKwon added 5 commits April 6, 2025 18:17

tmp

10e7965

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

merge main

9bc35db

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

ddf5ae9

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

prototype

57b3e86

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

a395714

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon requested review from robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 7, 2025 09:28

WoosukKwon marked this pull request as draft April 7, 2025 09:28

mergify bot added the v1 label Apr 7, 2025

comaniac reviewed Apr 7, 2025

View reviewed changes

mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025

WoosukKwon added 13 commits April 13, 2025 21:34

merge

eda8acc

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into woosuk-hybrid-mem

6804c5e

Merge branch 'main' into woosuk-hybrid-mem

6626828

Merge branch 'main' into woosuk-hybrid-mem

41b7d5d

wip

2cdcd22

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

wip

44c3d85

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

wip

f91b50f

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

ea4bc01

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

7d9e93b

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

minor

3e97ae4

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

wip

c454765

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into woosuk-hybrid-mem

8221c64

fix

fbfe9f2

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon added 3 commits April 16, 2025 10:28

add get_hybrid

13274da

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

Merge branch 'main' into woosuk-hybrid-mem

8359f83

fix

0fa9747

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

mergify bot added the needs-rebase label Apr 25, 2025

heheda12345 mentioned this pull request Apr 30, 2025

[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101

Closed

WoosukKwon closed this Jun 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Hybrid Memory Allocator #16178

[WIP] Hybrid Memory Allocator #16178

Uh oh!

WoosukKwon commented Apr 7, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

heheda12345 commented Apr 7, 2025

Uh oh!

comaniac left a comment

Uh oh!

comaniac Apr 7, 2025

Uh oh!

mergify bot commented Apr 25, 2025

Uh oh!

gemini-code-assist bot commented Jun 11, 2025

Uh oh!

Uh oh!

		if all(group_id in cached_blocks for group_id in group_ids):
		computed_blocks.append(cached_blocks)

Uh oh!

[WIP] Hybrid Memory Allocator #16178

[WIP] Hybrid Memory Allocator #16178

Uh oh!

Conversation

WoosukKwon commented Apr 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

heheda12345 commented Apr 7, 2025

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

comaniac Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 25, 2025

Uh oh!

gemini-code-assist bot commented Jun 11, 2025

Uh oh!

Uh oh!

WoosukKwon commented Apr 7, 2025 •

edited by github-actions bot

Loading