[kv_offload+HMA][1/N]: Worker-side support for multiple HMA groups#34680
Open
orozery wants to merge 1 commit intovllm-project:mainfrom
Open
[kv_offload+HMA][1/N]: Worker-side support for multiple HMA groups#34680orozery wants to merge 1 commit intovllm-project:mainfrom
orozery wants to merge 1 commit intovllm-project:mainfrom
Conversation
This commit enables the kv_offload worker to load KV data for multiple KV cache groups. It still gets a list of block IDs to load, where the block IDs are ordered by group group0_block_ids...group1_block_ids... We add an additional field group_sizes which encode the sizes of each group of block IDs. This allows the worker to determine positions where a new group of block IDs start. This is needed in order to check whether respective data from the offloaded medium (e.g. CPU) should be skipped, in the case where offloaded_block_size > logical_block_size (gpu_block_size). Signed-off-by: Or Ozeri <oro@il.ibm.com>
8ea97a9 to
13fc036
Compare
Contributor
There was a problem hiding this comment.
Code Review
The pull request successfully introduces support for multiple KV cache groups in the kv_offload worker. This is achieved by adding a group_sizes field to the load/store specifications, allowing the worker to correctly handle unaligned transfers (where offloaded block size exceeds logical block size) for each group independently. The changes include updates to the metadata classes, the core offloading handler logic, and comprehensive test coverage. The implementation maintains backward compatibility for single-group transfers and includes robust assertions to ensure data integrity during the transfer process.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enables the kv_offload worker to load KV data for multiple KV cache groups.
It still gets a list of block IDs to load, where the block IDs are ordered by group group0_block_ids...group1_block_ids... We add an additional field group_sizes which encode the sizes of each group of block IDs. This allows the worker to determine positions where a new group of block IDs start. This is needed in order to check whether respective data from the offloaded medium (e.g. CPU) should be skipped, in the case where offloaded_block_size > logical_block_size (gpu_block_size).