Skip to content

[V1][P/D] Local attention optimization for NIXL #18170

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 17, 2025

Conversation

mgoin
Copy link
Member

@mgoin mgoin commented May 14, 2025

Optimizes the V1 NixlConnector for Llama 4 models by implementing layer-specific KV cache transfers.

During the worker's register_kv_caches phase, it pre-calculates the attention window size (in blocks) for each layer based on the Llama 4 config (no_rope_layers and attention_chunk_size).
For RoPE layers (local attention), this identifies the specific chunk of KV cache needed, while NoPE layers (global attention) are marked to use their full cache.

When a KV transfer is initiated in start_load_kv, the connector iterates through each layer. If a layer is identified as having a local attention window, the list of physical block IDs to be transferred is truncated to match this window, selecting only the most recent, relevant blocks. The _get_block_descs_ids method now uses a layer_idx to generate NIXL descriptors corresponding only to the memory regions of that specific layer and the selected (potentially chunked) block IDs.

This reduces the data transferred for Llama 4's RoPE layers (local attention with chunk_size=8192), leading to improved TTFT. Thanks to @tlrmchlsmth for benchmarking, showing a reduction in multi-node NIXL P/D overhead from ~10% to ~5% on 4xH200 Llama-4-Scout for 100k input tokens, compared to TTFT on a single node.

For a prefill of 100k tokens, we are able to transfer 69% less data since 3/4 layers have an attention_chunk_size of 8192 ((100000*1/4)+(8192*3/4) = 31,144)

Calculating per-layer transfer limits and using layer-indexed descriptor generation provides an example for supporting other attention types like sliding window attention in the future.

Signed-off-by: mgoin <mgoin64@gmail.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin added 3 commits May 14, 2025 22:33
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
@mgoin mgoin marked this pull request as ready for review May 14, 2025 22:57
@mgoin mgoin added the v1 label May 14, 2025
@mgoin mgoin changed the title [WIP] Local attention optimization for NIXL [V1][P/D] Local attention optimization for NIXL May 14, 2025
@WoosukKwon WoosukKwon requested a review from heheda12345 May 15, 2025 01:48
@WoosukKwon
Copy link
Collaborator

I think this PR is not needed once #17996 and a followup PR (for Llama4 support) are merged.
cc @heheda12345 for confirmation.

@heheda12345
Copy link
Collaborator

Agree with @WoosukKwon. You can get the block_ids that need to be transfer from P node to D node with #17996. Let's discuss more after that PR.

@mgoin
Copy link
Member Author

mgoin commented May 15, 2025

It makes sense to use the hybrid memory allocator eventually, however we needed this optimization now so that is why I worked on it.
Either way, the logic to get the block descs will need to change similarly. Since the changes are local to the NIXL connector I don't see much harm in landing with intent to remove in the long-term, but I understand.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not critical, but you can actually transfer less than in most cases:

math.ceil(chunk_size / self.block_size)

for local attention you only need to transfer

math.ceil((seqlen % chunk_size) / self.block_size)

@WoosukKwon
Copy link
Collaborator

@mgoin Got it. I'm comfortable merging this PR as a temporary fix. Please just add a TODO comment.

mgoin added 3 commits May 16, 2025 15:16
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2025
@mgoin mgoin merged commit fd195b1 into vllm-project:main May 17, 2025
76 checks passed
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants