-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative decoding 6/9] Integrate speculative decoding with LLMEngine #3894
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for the PR! Just some minor comments.
# Remove the parent sequence if it is not selected for next | ||
# iteration | ||
seq_group.remove(seq.seq_id) | ||
self.scheduler.free_seq(seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, output processor owns the scheduler and calls scheduler's functions in processing the output. This is a bit wired. Is it possible to move the scheduler part out of the output processor? Or is it too much work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree! Note this weirdness is not introduced in this PR, it already existed in the LLMEngine. If it had unit tests we could easily refactor it, but I'm afraid without good unit tests it's risky to refactor things without a good reason.
""" | ||
|
||
def __init__(self, scheduler, scheduler_config, get_tokenizer_for_seq): | ||
self.scheduler = scheduler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- scheduler is not used in this class, why do we need scheduler here?
- we need to read max_model_len from scheduler config, can we pass
max_model_len
directly?
In general, we might want to avoid the dependency on scheduler/scheduler_config in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- removed scheduler dependency and added type annotations (good catch!)
- passed max model len directly
# If no spec tokens, call the proposer and scorer workers normally. | ||
# Used for prefill. | ||
if num_spec_tokens == 0 or len(seq_group_metadata_list) == 0: | ||
if num_lookahead_slots == 0 or len(seq_group_metadata_list) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can len(seq_group_metadata_list) be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any use in mind for it, but I want to maintain compatibility with the other workers.
cpu worker
vllm/vllm/worker/cpu_worker.py
Lines 284 to 286 in 92cd2e2
# If there is no input, we don't need to execute the model. | |
if num_seq_groups == 0: | |
return {} |
gpu worker
Lines 231 to 233 in 92cd2e2
# If there is no input, we don't need to execute the model. | |
if num_seq_groups == 0: | |
return {} |
neuron worker
vllm/vllm/worker/neuron_worker.py
Lines 79 to 81 in 92cd2e2
# If there is no input, we don't need to execute the model. | |
if num_seq_groups == 0: | |
return {} |
@@ -0,0 +1,117 @@ | |||
from typing import Callable, Iterable, List |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question:
From my understanding, multi_step processor is used for target model, which is not a muti_step_worker.
single_step processor is used for the draft model, which is a multi_step_worker.
Correct me if I'm wrong. If this is the case, can we have a better name for multi_step output processor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: seems there is some conflict with #3884 and block manager v2 that only shows up in the spec decode e2e test added in this PR. will investigate. |
^ sorry for the inconvenience @cadedaniel! |
np glad chunked prefill got in |
Hi @cadedaniel , Great job for this feature enabling! However, I find that in https://github.com/vllm-project/vllm/blob/main/vllm/spec_decode/multi_step_worker.py#L347, you append -1 towards the proposal to fill num_lookahead_slots number. Which maybe a bit of waste for the verification stage? Since the proposal may not always num_lookahead_slots length? Is there any improvement space here? Thx~ |
Thanks for testing it out.
Yes, it would be wasteful if we verified these padding token ids. We avoid this via two means:
This means we never have any In the future, we can support per-sequence proposal lens. |
This PR integrates the existing speculative decoding components with the LLMEngine. With this PR, one can run speculative decoding with a draft model / target model e2e. Note however that the output is garbage as there remains e2e correctness implementation (PR 7/9 in the open sourcing plan).
Notes:
List[SamplerOutput]
from execute_model instead of justSamplerOutput
.