-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[V1] Eagerly remove finished requests from the batch #14388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -31,7 +31,8 @@ | |
from vllm.v1.engine.mm_input_cache import MMInputCacheClient | ||
from vllm.v1.kv_cache_interface import (FullAttentionSpec, KVCacheConfig, | ||
KVCacheSpec) | ||
from vllm.v1.outputs import LogprobsTensors, ModelRunnerOutput | ||
from vllm.v1.outputs import (EMPTY_MODEL_RUNNER_OUTPUT, LogprobsTensors, | ||
ModelRunnerOutput) | ||
from vllm.v1.sample.metadata import SamplingMetadata | ||
from vllm.v1.sample.rejection_sampler import INVALID_TOKEN_ID, RejectionSampler | ||
from vllm.v1.spec_decode.ngram_proposer import NgramProposer | ||
|
@@ -867,6 +868,9 @@ def execute_model( | |
intermediate_tensors: Optional[IntermediateTensors] = None, | ||
) -> Union[ModelRunnerOutput, torch.Tensor]: | ||
self._update_states(scheduler_output) | ||
if not scheduler_output.total_num_scheduled_tokens: | ||
# Return empty ModelRunnerOuptut if there's no work to do. | ||
return EMPTY_MODEL_RUNNER_OUTPUT | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just wondering: why not return There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, just curious: is this compatible with PP? cc @comaniac There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I could do but I think it would actually be more intrusive, i.e. change the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I admit I haven't fully digested the PP impl ... so yes would be great if @comaniac could comment on this! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems compatible but may be further cleaned up. In short, when PP>1, I guarantee only the batches with Specifically since |
||
|
||
if self.is_multimodal_model: | ||
# Run the multimodal encoder if any. | ||
|
@@ -1013,15 +1017,14 @@ def execute_model( | |
spec_token_ids = self.generate_draft_token_ids( | ||
valid_sampled_token_ids) | ||
|
||
model_runner_output = ModelRunnerOutput( | ||
return ModelRunnerOutput( | ||
req_ids=self.input_batch.req_ids, | ||
req_id_to_index=self.input_batch.req_id_to_index, | ||
sampled_token_ids=valid_sampled_token_ids, | ||
spec_token_ids=spec_token_ids, | ||
logprobs=logprobs_lists, | ||
prompt_logprobs_dict=prompt_logprobs_dict, | ||
) | ||
return model_runner_output | ||
|
||
def generate_draft_token_ids( | ||
self, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please document here why we should do
has_requests
rather thanhas_unfinished_requests
(like the PR description)?