-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders #17483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…kens Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…er_layer_attn_metadata
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…tn_metadata Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
… into slot_mapping
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@heheda12345 Please fix the CI failure 😅 |
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
@heheda12345 may I know if the pre-commit check caught this? the rename of the If it doesn't then there might need some investigation to find out the reason why it is not caught in the CI. |
#806) ### What this PR does / why we need it? 1. Fix V1 error found by [nightly_ci](https://github.com/vllm-project/vllm-ascend/actions/runs/14950004754/job/41998136610), broken by [[v1] Pass BlockTable and KVCacheSpec to AttentionMetadataBuilders #17483](vllm-project/vllm#17483), make `InputBatch` parameter consistent with vllm. 2. Disable benmark and fix it in upstream. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed --------- Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com>
…lm-project#17483) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
…me in PR vllm-project#17483 (vllm-project#17961) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
@tjtanaa An interesting question. I think pre-commit can pass because we have
|
Looks like this commit caused regression to the FlashInfer backend (with the FlashInfer's latest commit: 25fb40) at least on GB200 and B200. With this commit, the following command failed with CUDA out-of-memory failures:
It worked fine right before this commit. |
The same CUDA out-of-memory failure also occurred on H100 with the same command above, i.e.
BTW, I was using CUDA 12.8. Thanks! |
…lm-project#17483) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…me in PR vllm-project#17483 (vllm-project#17961) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…lm-project#17483) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
…me in PR vllm-project#17483 (vllm-project#17961) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
…lm-project#17483) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: minpeter <kali2005611@gmail.com>
…me in PR vllm-project#17483 (vllm-project#17961) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Should merge after #17394
Hybrid allocator will need to build attention metadata for each kv cache group because different kv cache groups may have different attention type and block_table. To achieve that, we will introduce one AttentionMetadataBuilder and one BlockTable for each group.
To prepare for this, this PR makes AttentionMetadataBuilder to access its block_table and KVCacheSpec, instead of reading from model_runner.
And as slot_mapping will also be different for different kv cache groups, this pr moves the slot_mapping_cpu tensor from runner to BlockTable.
Splitted from #16101