Closed
Description
Sampling is an already known bottleneck of vLLM(see #421 and #670 ). Last weekend I saw a project named Medusa, in it's blog, it introduce a new simple decoding way to accelerate LLM generation and reach a good performance. As far as I known, lepton.ai is alreay use this method.
Adopting Medusa Heads is not difficult, since there is no seperate model. But tree attention and typical acceptance scheme is not a standard process for most LLM inference framework and should take a huge effort.
Any advice or comments?
Metadata
Metadata
Assignees
Labels
No labels