Open
Description
Motivation
Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are starting to mature enough to make their way into frameworks like TGI and vLLM, so might be a good time for LMDeploy to consider adding support for a popular/established speculative decoding method.
Related resources
- TGI (supports Medusa and MLPSpeculator as of writing):
- vLLM (groundwork for several speculation methods in progress as of writing):
- [WIP] Speculative decoding using a draft model vllm-project/vllm#2188
- [Model] MLPSpeculator speculative decoding support vllm-project/vllm#4947
- [Speculative Decoding] Medusa Implementation with Top-1 proposer vllm-project/vllm#4978
- [Speculative Decoding] Enable arbitrary model inputs vllm-project/vllm#5101
- [Speculative Decoding 1/2 ] Add typical acceptance sampling as one of the sampling techniques in the verifier vllm-project/vllm#5131
- MLC-LLM (supports only EAGLE as of writing):
- [SpecDecode] Support Eagle in speculative decoding mlc-ai/mlc-llm#2080
- [Eagle] Make eagle disco compatible mlc-ai/mlc-llm#2197
- [Eagle] Avoid worker - engine transfer for hidden states mlc-ai/mlc-llm#2256
- [Eagle] Fix token shifting for prefill step mlc-ai/mlc-llm#2266
- [Eagle] Run additional decode for draft model when all proposals are accepted mlc-ai/mlc-llm#2294
- [Eagle] Fix the requests for additional decode in eagle verify mlc-ai/mlc-llm#2336
Below is a copy-paste from a neat project called Spec-Bench. The ranking when running 33B models is similar. Please see the linked repo for latest data.
- Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
- Testing environment: Pytorch 2.0.1, under CUDA 11.8
- Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1
Models | Multi-turn Conversation | Translation | Summa-rization | Question Answering | Mathematical Reasoning | Retrieval-aug. Generation | #Mean Accepted Tokens | Overall |
---|---|---|---|---|---|---|---|---|
EAGLE🏅 | 2.44x | 1.81x | 2.13x | 2.11x | 2.54x | 1.82x | 3.57 | 2.16x |
SpS🥈 | 1.98x | 1.37x | 2.00x | 1.95x | 1.89x | 1.76x | 2.29 | 1.83x |
Hydra🥉 | 2.04x | 1.67x | 1.56x | 1.81x | 2.16x | 1.48x | 3.26 | 1.80x |
PLD | 1.57x | 1.07x | 2.31x | 1.25x | 1.62x | 1.56x | 1.74 | 1.55x |
Medusa | 1.60x | 1.38x | 1.28x | 1.46x | 1.64x | 1.22x | 2.32 | 1.44x |
REST | 1.49x | 1.18x | 1.21x | 1.46x | 1.35x | 1.27x | 1.63 | 1.32x |
Lookahead | 1.13x | 0.97x | 1.05x | 1.07x | 1.29x | 0.98x | 1.65 | 1.08x |
Note that MLPSpeculator is not included in the benchmark since it is newer. Another new method that isn't included in Spec-Bench as of writing:
Metadata
Metadata
Assignees
Labels
No labels