Skip to content

[Feature] Speculative Decoding #1738

Open
@josephrocca

Description

@josephrocca

Motivation

Speculative decoding can speed up generation more than 2x. This degree of speedup is an important feature for a production-grade LM deployment library, and it seems the methods are starting to mature enough to make their way into frameworks like TGI and vLLM, so might be a good time for LMDeploy to consider adding support for a popular/established speculative decoding method.

Related resources

Below is a copy-paste from a neat project called Spec-Bench. The ranking when running 33B models is similar. Please see the linked repo for latest data.

  • Device: a single NVIDIA GeForce RTX 3090 GPU (24GB) with 12 CPU cores
  • Testing environment: Pytorch 2.0.1, under CUDA 11.8
  • Experimental Settings: Vicuna-7B-v1.3, greedy decoding, FP16 precision, batch size = 1
Models Multi-turn Conversation Translation Summa-rization Question Answering Mathematical Reasoning Retrieval-aug. Generation #Mean Accepted Tokens Overall
EAGLE🏅 2.44x 1.81x 2.13x 2.11x 2.54x 1.82x 3.57 2.16x
SpS🥈 1.98x 1.37x 2.00x 1.95x 1.89x 1.76x 2.29 1.83x
Hydra🥉 2.04x 1.67x 1.56x 1.81x 2.16x 1.48x 3.26 1.80x
PLD 1.57x 1.07x 2.31x 1.25x 1.62x 1.56x 1.74 1.55x
Medusa 1.60x 1.38x 1.28x 1.46x 1.64x 1.22x 2.32 1.44x
REST 1.49x 1.18x 1.21x 1.46x 1.35x 1.27x 1.63 1.32x
Lookahead 1.13x 0.97x 1.05x 1.07x 1.29x 0.98x 1.65 1.08x

Note that MLPSpeculator is not included in the benchmark since it is newer. Another new method that isn't included in Spec-Bench as of writing:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions