-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" #3837
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just some minor questions.
vllm/worker/cpu_worker.py
Outdated
# Note: To reuse the cache management procedure, | ||
# use cpu cache as 'gpu cache'. | ||
num_cpu_blocks = num_gpu_blocks | ||
del num_gpu_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused about line 209-212, why del here? self.cache_config.num_gpu_blocks = num_gpu_blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to make the code readable given the awkwardness of num_gpu_blocks
actually being num_cpu_blocks
and num_cpu_blocks
being ignored/always zero. But yeah the del
is a bit extreme for this..
I refactored out the checks; code is more readable now
self.cache_config.num_gpu_blocks = num_cpu_blocks | ||
self.cache_config.num_cpu_blocks = 0 | ||
|
||
if num_cpu_blocks <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: move the check before we read num_cpu_blocks
…e hardware-agnostic speculative decoding" (vllm-project#3837)
…e hardware-agnostic speculative decoding" (vllm-project#3837)
…e hardware-agnostic speculative decoding" (vllm-project#3837)
…e hardware-agnostic speculative decoding" (vllm-project#3837)
This PR implements the BaseWorker interface described in #3809. This enables speculative decoding to treat all workers in the same way, allowing future work to enable speculative decoding for CPU/Neuron/other vLLM backends.
I will write up docs on how to do this after spec decode is merged; the TL;DR is need to implement the rejection sampler (currently implemented only in pytorch) and add plumbing between the proposal method and verification model that works for the hardware backend (currently only top1 fixed speculation is implemented for torch).
Notes
Closes #3809