-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[V1][P/D] Local attention optimization for NIXL #18170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
I think this PR is not needed once #17996 and a followup PR (for Llama4 support) are merged. |
Agree with @WoosukKwon. You can get the block_ids that need to be transfer from P node to D node with #17996. Let's discuss more after that PR. |
It makes sense to use the hybrid memory allocator eventually, however we needed this optimization now so that is why I worked on it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not critical, but you can actually transfer less than in most cases:
math.ceil(chunk_size / self.block_size)
for local attention you only need to transfer
math.ceil((seqlen % chunk_size) / self.block_size)
@mgoin Got it. I'm comfortable merging this PR as a temporary fix. Please just add a |
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Optimizes the V1 NixlConnector for Llama 4 models by implementing layer-specific KV cache transfers.
During the worker's
register_kv_caches
phase, it pre-calculates the attention window size (in blocks) for each layer based on the Llama 4 config (no_rope_layers
andattention_chunk_size
).For RoPE layers (local attention), this identifies the specific chunk of KV cache needed, while NoPE layers (global attention) are marked to use their full cache.
When a KV transfer is initiated in
start_load_kv
, the connector iterates through each layer. If a layer is identified as having a local attention window, the list of physical block IDs to be transferred is truncated to match this window, selecting only the most recent, relevant blocks. The_get_block_descs_ids
method now uses alayer_idx
to generate NIXL descriptors corresponding only to the memory regions of that specific layer and the selected (potentially chunked) block IDs.This reduces the data transferred for Llama 4's RoPE layers (local attention with chunk_size=8192), leading to improved TTFT. Thanks to @tlrmchlsmth for benchmarking, showing a reduction in multi-node NIXL P/D overhead from ~10% to ~5% on 4xH200 Llama-4-Scout for 100k input tokens, compared to TTFT on a single node.
For a prefill of 100k tokens, we are able to transfer 69% less data since 3/4 layers have an attention_chunk_size of 8192 (
(100000*1/4)+(8192*3/4) = 31,144
)Calculating per-layer transfer limits and using layer-indexed descriptor generation provides an example for supporting other attention types like sliding window attention in the future.