Description
Background
Speculative decoding leverages the ability to cheaply generate proposals, and cheaply verify them to achieve speedup for memory-bound inference. Different methods of speculative decoding explore the frontier between cost of proposal, alignment with the target model, and cost of verification.
For example, Medusa produces very cheap proposals, but the quality of the proposals are strictly less than Eagle because the heads do not have access to the previous proposals. Eagle on the other hand pays more for the proposals by sampling autoregressively instead of 1-shot, but it brings the benefit of higher-quality proposals.
At the end of the day, what the user cares about will dictate which speculative technique is used. vLLM's job is to provide them with the option for best speedup for their use case.
Draft-model, EAGLE, and MLPSpeculator rely on autoregressive proposals. This means their top-1 proposals are higher-quality than Medusa, which gives vLLM an ITL reduction that is more flops-efficient than Medusa. This is what our speculative decoding efforts are focused on first -- afterward, we can support top-k proposals with Medusa so users who care more about ITL reduction can use vLLM.
Speedup autoregressive proposal methods
This issue is to speed up autoregressive proposal methods by optimizing the sampler. Specifically, the sampler performs wasted work by copying sampled values to GPU and serializing them into Python objects. In speculative decoding, we never use the python objects because we consume the raw sampled token ids / probabilities in their GPU tensors. This means that the copy and CPU serialization are pure overhead in speculative decoding.
How much overhead?
In profiling vLLM, I found that copy + serialization in the draft model takes ~441µs (cell J30). Note that the actual forward pass and sampling math of the draft model take (220µs + 639µs) = 859µs. This means that by removing the unnecessary copy and serialization, we can get 50% more draft tokens in the same time it takes with the copy and serialization enabled.
This difference is actually massive on the overall performance of speculative decoding.
Furthermore, the subsequent draft model forward pass must consume the output of the previous step. This allows us to reduce time spent in prepare_inputs
. I don't have numbers here, but I expect a further ~150µs reduction per draft model step by this (~300µs to ~150µs).
The work
This issue is to:
- Make the CPU copy and CPU serialization optional in vLLM's sampler (thus leaving sampled token ids on GPU), and then
- passing those sampled token ids to
prepare_inputs
of the next draft model forward pass.
1. Make CPU serialization optional
Warm up task: Note a good warmup task to get familiar with the Sampler is to add an option to disable logprobs
for a given Worker. This will also provide some speedup to spec decode (~2ms e2e step time), but isn't part of this issue.
Code pointers:
- The sample method which does CPU-GPU synchronization and serialization
- The actual for loop which does synchronization and serialization
2. Allow prepare_inputs
method to work on-device
The on-gpu sampled token ids should be appended to the next prepare_inputs batch.