-
-
Notifications
You must be signed in to change notification settings - Fork 9k
[V1][PP] Continue scheduling prefill chunks #13637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The main part looks good, have not reviewed test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I got the overall idea of the PR but didn't get the full details. Please check out my comments.
@WoosukKwon thanks for the review. The challenging points in this PR that make the code complicate are summarized as follows:
|
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
Benchmark with PP=2 on 2xL4 GPUs: Example Serving Command
Benchmark Command
|
@WoosukKwon I refactored the PR a bit to reduce possible confusion, along with the benchmark results. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left some comments.
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the optimization. A couple minor issues/questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac I see some performance regression in the shareGPT benchmark
@comaniac It actually hangs at the end of the benchmark... 😭 Can you please check again? |
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
Fixed a bug that doesn't consider the Please try again. |
As we discussed offline, let's hold off this PR until we make v0.8.0 release. |
@WoosukKwon should be good to go given the 0.8.0 branch has been cut? |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac thanks for the PR! Looks good to me in general, but I still have a few questions on scheduled_req_ids
. Please check out my comments.
vllm/v1/core/sched/scheduler.py
Outdated
# With PP, when the input prompt is divided into chunks, we can | ||
# schedule a new chunk even before the previous chunk has completed | ||
# the full pipeline stages. This helps reduce TTFT. | ||
self.scheduled_req_ids: dict[str, int] = defaultdict(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dumb question: What is scheduled_req_ids
used for? IIUC, this PR eliminates its usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a good catch. Since we now simply check whether num_new_tokens=0 to determine whether we could continue scheduling a request, we may not need this anymore...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac If so, can you please remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
Signed-off-by: Cody Yu <hao.yu.cody@gmail.com>
This pull request has merge conflicts that must be resolved before it can be |
This PR supports pipelining prefill chunks. The key changes are:
Schedule
num_computed_tokens
inupdate_from_output
, but this prevents us from scheduling the request conservatively, so we now advancenum_computed_tokens
right after scheduling.scheduled_req_ids
is nowDict[str, int]
, where the value is the number of batches this request was scheduled. The value can be larger than one only when we schedule multiple prefill chunks.Cached Request Data
Update from output
request.num_computed_tokens
because this value may have been advanced after this batch. Instead, it should refer toscheduler_output.scheduled_xxx_reqs[req_id].num_computed_tokens
. This is the right number when this batch was scheduled.cc @WoosukKwon @ruisearch42