[Feature] Speculative Decoding

### Motivation

Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are starting to mature enough to make their way into frameworks like TGI and vLLM, so might be a good time for LMDeploy to consider adding support for a popular/established speculative decoding method.

### Related resources

* TGI (supports Medusa and MLPSpeculator as of writing):
  * https://huggingface.co/docs/text-generation-inference/basic_tutorials/train_medusa
  * https://github.com/huggingface/text-generation-inference/pull/1865
* vLLM (groundwork for several speculation methods in progress as of writing):
  * https://github.com/vllm-project/vllm/pull/2188
  * https://github.com/vllm-project/vllm/pull/4947
  * https://github.com/vllm-project/vllm/pull/4978
  * https://github.com/vllm-project/vllm/pull/5101
  * https://github.com/vllm-project/vllm/pull/5131
 * MLC-LLM (supports only EAGLE as of writing):
   * https://github.com/mlc-ai/mlc-llm/pull/2080
   * https://github.com/mlc-ai/mlc-llm/pull/2197
   * https://github.com/mlc-ai/mlc-llm/pull/2256
   * https://github.com/mlc-ai/mlc-llm/pull/2266
   * https://github.com/mlc-ai/mlc-llm/pull/2294
   * https://github.com/mlc-ai/mlc-llm/pull/2336


Below is a copy-paste from a neat project called [Spec-Bench](https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md). The ranking when running 33B models is similar. Please see the linked repo for latest data.

- Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
- Testing environment: Pytorch 2.0.1, under CUDA 11.8
- Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1

| Models                                                       | Multi-turn Conversation | Translation | Summa-rization | Question Answering | Mathematical Reasoning | Retrieval-aug. Generation | #Mean Accepted Tokens |  Overall  |
| ------------------------------------------------------------ | :---------------------: | :---------: | :------------: | :----------------: | :--------------------: | :-----------------------: | :-------------------: | :-------: |
| [EAGLE](https://sites.google.com/view/eagle-llm)🏅            |        **2.44x**        |  **1.81x**  |     2.13x      |     **2.11x**      |       **2.54x**        |         **1.82x**         |       **3.57**        | **2.16x** |
| [SpS](https://huggingface.co/blog/assisted-generation)🥈      |          1.98x          |    1.37x    |     2.00x      |       1.95x        |         1.89x          |           1.76x           |         2.29          |   1.83x   |
| [Hydra](https://github.com/zankner/hydra)🥉                   |          2.04x          |    1.67x    |     1.56x      |       1.81x        |         2.16x          |           1.48x           |         3.26          |   1.80x   |
| [PLD](https://github.com/apoorvumang/prompt-lookup-decoding) |          1.57x          |    1.07x    |   **2.31x**    |       1.25x        |         1.62x          |           1.56x           |         1.74          |   1.55x   |
| [Medusa](https://sites.google.com/view/medusa-llm)           |          1.60x          |    1.38x    |     1.28x      |       1.46x        |         1.64x          |           1.22x           |         2.32          |   1.44x   |
| [REST](https://sites.google.com/view/rest-llm)               |          1.49x          |    1.18x    |     1.21x      |       1.46x        |         1.35x          |           1.27x           |         1.63          |   1.32x   |
| [Lookahead](https://lmsys.org/blog/2023-11-21-lookahead-decoding/) |          1.13x          |    0.97x    |     1.05x      |       1.07x        |         1.29x          |           0.98x           |         1.65          |   1.08x   |

Note that MLPSpeculator is not included in the benchmark since it is newer. Another new method that isn't included in Spec-Bench as of writing:

* https://github.com/apple/ml-recurrent-drafter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Speculative Decoding #1738

Motivation

Related resources

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Models	Multi-turn Conversation	Translation	Summa-rization	Question Answering	Mathematical Reasoning	Retrieval-aug. Generation	#Mean Accepted Tokens	Overall
EAGLE🏅	2.44x	1.81x	2.13x	2.11x	2.54x	1.82x	3.57	2.16x
SpS🥈	1.98x	1.37x	2.00x	1.95x	1.89x	1.76x	2.29	1.83x
Hydra🥉	2.04x	1.67x	1.56x	1.81x	2.16x	1.48x	3.26	1.80x
PLD	1.57x	1.07x	2.31x	1.25x	1.62x	1.56x	1.74	1.55x
Medusa	1.60x	1.38x	1.28x	1.46x	1.64x	1.22x	2.32	1.44x
REST	1.49x	1.18x	1.21x	1.46x	1.35x	1.27x	1.63	1.32x
Lookahead	1.13x	0.97x	1.05x	1.07x	1.29x	0.98x	1.65	1.08x

[Feature] Speculative Decoding #1738

Description

Motivation

Related resources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions