[Frontend] Add prefix sorting as a precursor to BatchLLM optimiza… #13740
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR implements request sorting to maximize prefix reuse when both chunked prefill and prefix caching are enabled. This serves as the first step towards the full BatchLLM optimization proposed in our RFC.
Motivation:
Currently, vLLM performs implicit (or just-in-time) shared prefix identification and metadata collection, and then performs cascade attention when there's a single shared prefix for all requests, as described in PR #11635. However, as suggested by WoosukKwon, this approach does not fully utilize the shared prefix in offline scenarios where there are many requests with different shared prefixes.
In offline settings, all requests are available before inference begins, making implicit prefix identification suboptimal. By explicitly sorting requests based on their shared prefixes, we can better maximize prefix reuse, improve KV-cache management, and significantly enhance throughput for batched requests.
Changes:
--enable-prefix-sorting
flag to control prefix sorting--enable-chunked-prefill
and--enable-prefix-caching
are enabledPerformance improvement:
Test setup:
Test Script:
Test Commands:
Results:
The results show that:
This is the first part of the BatchLLM optimization, focusing on request sorting only. Support for more complex prefix sharing patterns will be addressed in a separate PR.
Important Notes:
This optimization is currently only recommended when chunked prefill is enabled. With the current FlashInfer Cascade implementation in the default mode, prefix clustering can actually lead to a ~20% performance degradation. To achieve optimal performance across all modes, please refer to our original BatchLLM implementation in PR #12641.
The current PR serves as an initial step towards full BatchLLM optimization, focusing on request sorting only. Support for more complex prefix sharing patterns about
Scheduler
,ModelRunner
andKernel
will be addressed in a separate PRs