[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding?

Sampling is an already known bottleneck of vLLM(see #421 and #670 ). Last weekend I saw a project named [Medusa](https://github.com/FasterDecoding/Medusa), in it's [blog](https://sites.google.com/view/medusa-llm), it introduce a new simple decoding way to accelerate LLM generation and reach a good performance. As far as I known, [lepton.ai](https://www.lepton.ai/docs) is alreay use this method.
Adopting Medusa Heads is not difficult, since there is no seperate model. But tree attention and typical acceptance scheme is not a standard process for most LLM inference framework and should take a huge effort.

Any advice or comments?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding? #1171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Discussion] Will vLLM consider using Speculative Sampling to accelerating LLM decoding? #1171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions