Description
Proposal to improve performance
With the end-to-end correctness tests merged in #3951, now we will optimize the implementation to get ~50% speedup on 70B model with temperature 1.0.
Work required:
P0/P1 -- priority
(Small/Medium/Large) -- relative size estimate
- Optimizing proposal time
- P0 (Large) Reduce draft model control-plane communication from O(num_steps) to O(1)
- P0 (Medium) Support draft model on different tensor-parallel-size than target model [Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632
- Optimizations for scoring time
- P0 (Medium) Re-enable bonus tokens to increase % accepted tokens [Speculative decoding] [Performance]: Re-enable bonus tokens #4212
- P1 (Large) Replace CPU-based batch expansion with multi-query attention kernel call
- P1 (Medium) Automate speculative decoding [RFC]: Automate Speculative Decoding #4565
- Optimizations for both proposal and scoring time [Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561
- P0 (Medium) Decouple sampling serialization from sampling
- P1 (Large) Amortize
prepare_inputs
over multiple forward passes
- Optimizations for scheduling time
- P0 (Medium) Profile & optimize block manager V2 [Performance]: Profile & optimize the BlockManagerV2 #4536
FAQ
What should the target configuration be for 50% speedup?
In the Anyscale fork we saw a 50% speedup on bs=8 with a 68m-sized draft model on TP1/70B target model on TP8 and a 7B draft model on TP(1|8)/70B target model on TP8. This was with the optimizations listed above as "P0".
Note we can do much better than this, with multi-query scoring (P1), GQA for target model scoring, and a dynamic speculation policy. This is just the starting point!
Why not implement Medusa / tree-attention?
We should implement this! The work here will lay the foundation for future improvements in speculative decoding. For example, Eagle uses the Medusa approach (fine-tuned heads plus tree attention) and even claims to beat Medusa. But for Eagle to work well in vLLM we need to optimize the sampler as listed above.
The north star should be: configurable tree size (top-k .. top-1), which uses multi-query attention for scoring (no batch expansion). This issue is about optimizing vLLM in the top-1 speculation case to get 50% speedup with draft models.