-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] AsyncLLMEngine
#9826
Draft
robertgshaw2-neuralmagic
wants to merge
65
commits into
vllm-project:main
Choose a base branch
from
neuralmagic:rework-rs-proto
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+991
−680
Draft
[V1] AsyncLLMEngine
#9826
Changes from all commits
Commits
Show all changes
65 commits
Select commit
Hold shift + click to select a range
8f8662e
prototype
robertgshaw2-neuralmagic 01c4ca8
revert spurious 2.5 changes
robertgshaw2-neuralmagic 1ad8a48
stash
robertgshaw2-neuralmagic f9084f6
cleanup
robertgshaw2-neuralmagic 72bccd9
add MQLLMEnginev1
robertgshaw2-neuralmagic a6cab52
work with MQLLMEngine
robertgshaw2-neuralmagic 885ed16
format
robertgshaw2-neuralmagic 3ed66cf
cleanup formatting
robertgshaw2-neuralmagic 8ae8ce9
revert exmple change
robertgshaw2-neuralmagic 5c72515
update comment
robertgshaw2-neuralmagic f9b33fa
formatting
robertgshaw2-neuralmagic 82539b9
updated
robertgshaw2-neuralmagic d42a54e
stash
robertgshaw2-neuralmagic 3a2d02a
format
robertgshaw2-neuralmagic 6028ee1
Merge branch 'main' into rs-prototype-2
robertgshaw2-neuralmagic 6bd37c1
update
robertgshaw2-neuralmagic 196d822
revert bind/connect
robertgshaw2-neuralmagic a089cd1
revert comment
robertgshaw2-neuralmagic 974aa06
formatting
robertgshaw2-neuralmagic fe1e1b4
formatting tweaks
robertgshaw2-neuralmagic 9c27fbb
move detokenizer into engine
robertgshaw2-neuralmagic 95b5af1
format
robertgshaw2-neuralmagic 3999279
stash
robertgshaw2-neuralmagic b4dd571
revert bad import
robertgshaw2-neuralmagic f01f992
format
robertgshaw2-neuralmagic be333fa
format
robertgshaw2-neuralmagic aefb498
add files
robertgshaw2-neuralmagic 6d7f473
stash
robertgshaw2-neuralmagic f431f8a
update
robertgshaw2-neuralmagic be431e4
update
robertgshaw2-neuralmagic 36b7fa5
fix api client example to work with v1
robertgshaw2-neuralmagic 3a5ce74
formatting
robertgshaw2-neuralmagic 0d0251e
updated
robertgshaw2-neuralmagic 046d78f
update
robertgshaw2-neuralmagic 34c0665
update
robertgshaw2-neuralmagic 52b790f
stash
robertgshaw2-neuralmagic 4f9a86e
Stash
robertgshaw2-neuralmagic 697b98f
stash
robertgshaw2-neuralmagic fa5c01d
LLMEngineWorking
robertgshaw2-neuralmagic 0ca42d8
format
robertgshaw2-neuralmagic b6497d5
updated
robertgshaw2-neuralmagic ae88c73
updated
robertgshaw2-neuralmagic 2161152
update
robertgshaw2-neuralmagic 6a57297
aded processor
robertgshaw2-neuralmagic 3665602
udpated
robertgshaw2-neuralmagic ed567ca
updated
robertgshaw2-neuralmagic f4005da
updated formats
robertgshaw2-neuralmagic 67a53ed
revert
robertgshaw2-neuralmagic 458b54f
finished
robertgshaw2-neuralmagic 75ff707
updated
robertgshaw2-neuralmagic 669648f
split core process into separate class
njhill 127f09c
stash
robertgshaw2-neuralmagic 99f683e
Merge pull request #22 from njhill/rework-splitcore
robertgshaw2-neuralmagic dc6163c
updated
robertgshaw2-neuralmagic d21cb8f
updated
robertgshaw2-neuralmagic 565ffa6
working again
robertgshaw2-neuralmagic 2960fbc
format
robertgshaw2-neuralmagic 5d23709
updated
robertgshaw2-neuralmagic f2f2e40
updated
robertgshaw2-neuralmagic c10c9d8
better interface
robertgshaw2-neuralmagic b8767a9
formatting
robertgshaw2-neuralmagic ab783e1
format
robertgshaw2-neuralmagic 423f47d
update
robertgshaw2-neuralmagic 7c977d3
updated
robertgshaw2-neuralmagic 3c14bdf
format
robertgshaw2-neuralmagic File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,13 @@ | ||
from collections import deque | ||
from dataclasses import dataclass | ||
from typing import Deque, Dict, Iterable, List, Optional, Set, Tuple, Union | ||
from typing import Deque, Dict, Iterable, List, Optional, Set, Union | ||
|
||
from vllm.config import CacheConfig, LoRAConfig, SchedulerConfig | ||
from vllm.logger import init_logger | ||
from vllm.multimodal import MultiModalDataDict | ||
from vllm.sampling_params import SamplingParams | ||
from vllm.v1.core.kv_cache_manager import KVCacheManager | ||
from vllm.v1.engine import EngineCoreOutput | ||
from vllm.v1.outputs import ModelRunnerOutput | ||
from vllm.v1.request import Request, RequestStatus | ||
|
||
|
@@ -227,13 +228,12 @@ def update_from_output( | |
self, | ||
scheduler_output: "SchedulerOutput", | ||
model_runner_output: "ModelRunnerOutput", | ||
) -> List[Tuple[Request, int]]: | ||
) -> List[EngineCoreOutput]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Im not sure it makes sense for this method to be in The only item related to setting the scheduler here is updating which |
||
# NOTE(woosuk): This method doesn't consider speculative decoding. | ||
sampled_token_ids = model_runner_output.sampled_token_ids_cpu.tolist() | ||
num_scheduled_tokens = scheduler_output.num_scheduled_tokens | ||
new_running: List[Request] = [] | ||
# (request, num_sampled_tokens) | ||
sampled: List[Tuple[Request, int]] = [] | ||
engine_core_outputs: List[EngineCoreOutput] = [] | ||
for request in self.running: | ||
req_id = request.request_id | ||
request.num_computed_tokens += num_scheduled_tokens[req_id] | ||
|
@@ -247,17 +247,30 @@ def update_from_output( | |
# generates at most one token at each step. | ||
token_id = sampled_token_ids[req_index] | ||
request.output_token_ids.append(token_id) | ||
sampled.append((request, 1)) | ||
num_new_tokens = 1 | ||
|
||
# TODO: Update the KV cache manager for prefix caching. | ||
|
||
# Check if the request is finished. | ||
# Check for stop and update request state. | ||
# This must be called before me make the EngineCoreOutput. | ||
stopped = self._check_stop(request) | ||
|
||
# Add EngineCoreOutput for this Request. | ||
output = EngineCoreOutput( | ||
request_id=req_id, | ||
new_token_ids=request.output_token_ids[-num_new_tokens:], | ||
finished=request.is_finished(), | ||
finish_reason=request.get_finished_reason(), | ||
stop_reason=request.stop_reason) | ||
engine_core_outputs.append(output) | ||
|
||
# Breakout of the loop. | ||
if stopped: | ||
continue | ||
|
||
new_running.append(request) | ||
self.running = new_running | ||
return sampled | ||
return engine_core_outputs | ||
|
||
def _check_stop(self, request: Request) -> bool: | ||
if (request.num_tokens >= self.max_model_len | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@njhill - im starting to think this stuff should be wrapped into
llmengine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-neuralmagic yeah I was thinking we might have a completely separate
LLM
class but that may be tricky if we want to be able to switch existing code with the env var.