Skip to content

[WIP] Hybrid Memory Allocator #16178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 21 commits into from
Closed

[WIP] Hybrid Memory Allocator #16178

wants to merge 21 commits into from

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Apr 7, 2025

Key differences from #16101:

  1. Only create one specialize manager for each type of attention. For instance, Gemma 3 uses 2 managers (a full attention manager and a SWA manager) instead of 6 (one full attention manager and five SWA managers).
  2. Do not use group_id in hashing the block.
  3. Introduce Hybrid Allocators that support specific combinations of attention, instead of implementing a generic logic that works for all possible combinations.

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Copy link

github-actions bot commented Apr 7, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@WoosukKwon WoosukKwon marked this pull request as draft April 7, 2025 09:28
@mergify mergify bot added the v1 label Apr 7, 2025
@heheda12345
Copy link
Collaborator

This implementation is quite simple and some designs inspire me a lot. But I think it may be too specialized.
The design of this PR is based on the assumption that the block_size of all groups are the same, which is not true when we want to extend the hybrid allocator to Jamba-like models. I think it's OK to specialize to gemma-3 and llama-4 cases at this moment to simplify some logic, but we need to make sure that the design can be extended to more general cases.

For the key differences you mentioned:

  1. 2 managers vs 6 managers, I prefer to implement 6 managers in the first hybrid allocator PR to keep things simple and use a follow-up PR to support 2 managers as efficient as possible. The current implementation doesn't reduce the time complexity compared with the 6 manager design.
  2. Hashing. I agree that we can make the assumption that all groups have the same block_size at this moment to keep the hashing logic unchanged.
  3. "Introduce Hybrid Allocators that support specific combinations of attention" I think it is unnecessary as the general implementation in [v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101 is clean and can achieve the same speed as your specialized version.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents to the discussion points:

  1. What isn't clear to me in this PR is how will we initialize the hybrid memory allocator, specifically how will we pass arguments to SingleMemoryAllocator and FullAndSwaMemoryAllocator given they expect different arguments. If we could unify the arguments of HybridMemoryAllocator, it seems to me that we are able to support both 2 and 6 managers easily, so that it doesn't matter whether FullAndSwaMemoryAllocator in this PR has 2 or 6 managers.
  2. ooc, is the reason of not hashing group IDs only because of the complexity, or is there any other motivations?
  3. So the structure becomes KVCacheManager.HybridMemoryAllocator IIUC. It seems reasonable to me. One point worth to discuss is how people understand the scope of "manager" and "allocator". Intuitively, "allocator" is in charge of allocating blocks, but this is not the current allocator is doing. On the other hand, if we move all allocation logic to the allocator, then it seems unnecessary to introduce the allocator at all (just like Chen's PR). In short, if the current logic in the allocator is the only specialized logic for hybrid memory, then I'd prefer the solution in this PR (but with another name than the allocator); otherwise it might need more discussions.

Comment on lines 92 to 93
if all(group_id in cached_blocks for group_id in group_ids):
computed_blocks.append(cached_blocks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Probably need a fast path for the case of len(group_ids)==1.

@mergify mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Copy link

mergify bot commented Apr 25, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @WoosukKwon.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants