[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) #4669

abhigoyal1997 · 2024-05-08T04:57:33Z

🚀 The feature, motivation and pitch

In approaches like Medusa/EAGLE/Hydra, the speculative model uses the last hidden states from the base model to propose candidates. This feature will allow any such approaches to be implemented with ease. One idea is to store the required base model's outputs along with the sequence and then use that while generating candidates for the next iteration.

Alternatives

No response

Additional context

No response

abhigoyal1997 · 2024-05-08T04:58:53Z

I have implemented Medusa using this. If this makes sense and can be accepted as a contribution, I would love to create a PR (including the implementation of Medusa). I am also working on implementing the EAGLE approach.

KexinFeng · 2024-05-11T14:48:02Z

@abhigoyal1997 This is indeed an important feature that people have been looking for. It's also within my exploration radar, and I look forward to its implementation in vllm.
Here is some detailed question. I know for Medusa, tree-draft-tokens play an essential role; for Eagle, it is also important. In your implementation, did you enable tree-draft-tokens, or is it still the single-sequence draft-tokens?

I'm asking this because I'm developing this tree-style speculation. And it will be a perfect match with the Medusa/Eagle/Hydra here. We can maybe combine the effort and see how the performance boost when the two technieques are put together. #4565 (comment)

abhigoyal1997 · 2024-05-11T15:49:25Z

Hi @KexinFeng
Currently what I've implemented only takes top-1 predictions to get single-sequence draft tokens. I agree that tree-style speculation is essential to get a significant acceleration. I've observed it in a torch.compile based implementation I worked on (based on gpt-fast), but I've not tried implementing that in vllm yet as it looked more complicated at the time and I knew it is already being worked on.

As for the current implementation of Medusa and EAGLE using a single sequence, I'll create a PR as soon as I've tested it a bit more and have company approvals.

youkaichao · 2024-05-12T05:52:56Z

cc @cadedaniel @LiuXiaoxuanPKU for visibility.

Siegfried-qgf · 2024-09-03T06:41:11Z

@abhigoyal1997 This is indeed an important feature that people have been looking for. It's also within my exploration radar, and I look forward to its implementation in vllm. Here is some detailed question. I know for Medusa, tree-draft-tokens play an essential role; for Eagle, it is also important. In your implementation, did you enable tree-draft-tokens, or is it still the single-sequence draft-tokens?

I'm asking this because I'm developing this tree-style speculation. And it will be a perfect match with the Medusa/Eagle/Hydra here. We can maybe combine the effort and see how the performance boost when the two technieques are put together. #4565 (comment)

I'm excited that you're working on this. I'm also considering adding tree attention to vllm to adapt it to EAGLE. How is your work going now？ I'd like to ask where I should make efforts to modify it, and do you have an open source plan?

abhigoyal1997 added the feature request label May 8, 2024

abhigoyal1997 changed the title ~~[Feature]: Support for a proposal model that takes inputs from base model~~ [Feature]: Support for a draft model that takes inputs from base model May 8, 2024

abhigoyal1997 changed the title ~~[Feature]: Support for a draft model that takes inputs from base model~~ [Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) May 8, 2024

caddfa31434 mentioned this issue May 11, 2024

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

Closed

8 tasks

KexinFeng mentioned this issue May 11, 2024

[RFC]: Automate Speculative Decoding #4565

Open

abhigoyal1997 mentioned this issue May 12, 2024

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

65 tasks

abhigoyal1997 mentioned this issue May 22, 2024

[Speculative Decoding] Medusa Implementation with Top-1 proposer #4978

Merged

LiuXiaoxuanPKU closed this as completed in #4978 Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) #4669

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) #4669

abhigoyal1997 commented May 8, 2024

abhigoyal1997 commented May 8, 2024 •

edited

Loading

KexinFeng commented May 11, 2024 •

edited

Loading

abhigoyal1997 commented May 11, 2024

youkaichao commented May 12, 2024

Siegfried-qgf commented Sep 3, 2024

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) #4669

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) #4669

Comments

abhigoyal1997 commented May 8, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

abhigoyal1997 commented May 8, 2024 • edited Loading

KexinFeng commented May 11, 2024 • edited Loading

abhigoyal1997 commented May 11, 2024

youkaichao commented May 12, 2024

Siegfried-qgf commented Sep 3, 2024

abhigoyal1997 commented May 8, 2024 •

edited

Loading

KexinFeng commented May 11, 2024 •

edited

Loading