Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" #3837

Merged
merged 56 commits into from
Apr 9, 2024

Conversation

cadedaniel
Copy link
Collaborator

@cadedaniel cadedaniel commented Apr 3, 2024

This PR implements the BaseWorker interface described in #3809. This enables speculative decoding to treat all workers in the same way, allowing future work to enable speculative decoding for CPU/Neuron/other vLLM backends.

I will write up docs on how to do this after spec decode is merged; the TL;DR is need to implement the rejection sampler (currently implemented only in pytorch) and add plumbing between the proposal method and verification model that works for the hardware backend (currently only top1 fixed speculation is implemented for torch).

Notes

  • This PR moves some logic in each of the executors back to their respective workers, for example logic to check if the cache size is valid. This allows speculative decoding to benefit from these checks, since the speculative worker will compose within it workers (not executors). Notably, the checks are different for cpu/gpu workers.

Closes #3809

@cadedaniel cadedaniel changed the title [Draft] [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" [Misc] [Core] Implement RFC "Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding" Apr 5, 2024
@cadedaniel cadedaniel marked this pull request as ready for review April 5, 2024 05:59
@LiuXiaoxuanPKU LiuXiaoxuanPKU self-assigned this Apr 5, 2024
Copy link
Collaborator

@LiuXiaoxuanPKU LiuXiaoxuanPKU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just some minor questions.

# Note: To reuse the cache management procedure,
# use cpu cache as 'gpu cache'.
num_cpu_blocks = num_gpu_blocks
del num_gpu_blocks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused about line 209-212, why del here? self.cache_config.num_gpu_blocks = num_gpu_blocks?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to make the code readable given the awkwardness of num_gpu_blocks actually being num_cpu_blocks and num_cpu_blocks being ignored/always zero. But yeah the del is a bit extreme for this..

I refactored out the checks; code is more readable now

self.cache_config.num_gpu_blocks = num_cpu_blocks
self.cache_config.num_cpu_blocks = 0

if num_cpu_blocks <= 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move the check before we read num_cpu_blocks

@cadedaniel cadedaniel enabled auto-merge (squash) April 9, 2024 04:49
@cadedaniel cadedaniel merged commit e7c7067 into vllm-project:main Apr 9, 2024
35 checks passed
SageMoore pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 11, 2024
andy-neuma pushed a commit to neuralmagic/nm-vllm that referenced this pull request Apr 12, 2024
z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request Apr 22, 2024
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Augment BaseExecutor interfaces to enable hardware-agnostic speculative decoding
2 participants