-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Frontend] Add prefix sorting as a precursor to BatchLLM optimization #13762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…tion Co-authored-by: xinji1 <xinji1@microsoft.com> Co-authored-by: Fanghao Zhou <fanghaozhou@microsoft.com> Co-authored-by: Zhen Zheng <zhengzhen@microsoft.com> Signed-off-by: Taosong Fang <constfrost@foxmail.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
cc @WoosukKwon @comaniac for the next step. |
Signed-off-by: Taosong Fang <constfrost@foxmail.com>
Signed-off-by: Taosong Fang <constfrost@foxmail.com>
This pull request has merge conflicts that must be resolved before it can be |
Modify some code comments.
@fangtaosong Hello, I noticed that you haven't maintained this PR recently. Maybe you are busy. I am interested in BatchLLM and would like to continue working on this PR. Is that ok? |
Thank you for your interest in continuing this PR and BatchLLM. Please go ahead and work on it, I'll be happy to see the progress being made. You can get more information from the PR at #12641. At the same time, I highly recommend that you share your work plan (such as adding new modules or performing a rebase, etc.) publicly or privately. |
This pull request has merge conflicts that must be resolved before it can be |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This PR implements request sorting to maximize prefix reuse when both chunked prefill and prefix caching are enabled. This serves as the first step towards the full BatchLLM optimization proposed in our RFC.
Motivation:
Currently, vLLM performs implicit (or just-in-time) shared prefix identification and metadata collection, and then performs cascade attention when there's a single shared prefix for all requests, as described in PR #11635. However, as suggested by WoosukKwon, this approach does not fully utilize the shared prefix in offline scenarios where there are many requests with different shared prefixes.
In offline settings, all requests are available before inference begins, making implicit prefix identification suboptimal. By explicitly sorting requests based on their shared prefixes, we can better maximize prefix reuse, improve KV-cache management, and significantly enhance throughput for batched requests.
Changes:
--enable-prefix-sorting
flag to control prefix sorting--enable-chunked-prefill
and--enable-prefix-caching
are enabledPerformance improvement:
Test setup:
Test Script:
Test Commands:
Results:
The results show that:
This is the first part of the BatchLLM optimization, focusing on request sorting only. Support for more complex prefix sharing patterns will be addressed in a separate PR.
Important Notes:
This optimization is currently only recommended when chunked prefill is enabled. With the current FlashInfer Cascade implementation in the default mode, prefix clustering can actually lead to a ~20% performance degradation. To achieve optimal performance across all modes, please refer to our original BatchLLM implementation in PR #12641.
The current PR serves as an initial step towards full BatchLLM optimization, focusing on request sorting only. Support for more complex prefix sharing patterns about
Scheduler
,ModelRunner
andKernel
will be addressed in a separate PRs