Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) #4669

Closed
abhigoyal1997 opened this issue May 8, 2024 · 5 comments · Fixed by #4978

Comments

@abhigoyal1997
Copy link
Contributor

🚀 The feature, motivation and pitch

In approaches like Medusa/EAGLE/Hydra, the speculative model uses the last hidden states from the base model to propose candidates. This feature will allow any such approaches to be implemented with ease. One idea is to store the required base model's outputs along with the sequence and then use that while generating candidates for the next iteration.

Alternatives

No response

Additional context

No response

@abhigoyal1997
Copy link
Contributor Author

abhigoyal1997 commented May 8, 2024

I have implemented Medusa using this. If this makes sense and can be accepted as a contribution, I would love to create a PR (including the implementation of Medusa). I am also working on implementing the EAGLE approach.

@abhigoyal1997 abhigoyal1997 changed the title [Feature]: Support for a proposal model that takes inputs from base model [Feature]: Support for a draft model that takes inputs from base model May 8, 2024
@abhigoyal1997 abhigoyal1997 changed the title [Feature]: Support for a draft model that takes inputs from base model [Feature]: Support for a draft model that takes inputs from base model (to support Medusa/EAGLE/Hydra) May 8, 2024
@KexinFeng
Copy link

KexinFeng commented May 11, 2024

@abhigoyal1997 This is indeed an important feature that people have been looking for. It's also within my exploration radar, and I look forward to its implementation in vllm.
Here is some detailed question. I know for Medusa, tree-draft-tokens play an essential role; for Eagle, it is also important. In your implementation, did you enable tree-draft-tokens, or is it still the single-sequence draft-tokens?

I'm asking this because I'm developing this tree-style speculation. And it will be a perfect match with the Medusa/Eagle/Hydra here. We can maybe combine the effort and see how the performance boost when the two technieques are put together. #4565 (comment)

@abhigoyal1997
Copy link
Contributor Author

Hi @KexinFeng
Currently what I've implemented only takes top-1 predictions to get single-sequence draft tokens. I agree that tree-style speculation is essential to get a significant acceleration. I've observed it in a torch.compile based implementation I worked on (based on gpt-fast), but I've not tried implementing that in vllm yet as it looked more complicated at the time and I knew it is already being worked on.

As for the current implementation of Medusa and EAGLE using a single sequence, I'll create a PR as soon as I've tested it a bit more and have company approvals.

@youkaichao
Copy link
Member

cc @cadedaniel @LiuXiaoxuanPKU for visibility.

@Siegfried-qgf
Copy link

@abhigoyal1997 This is indeed an important feature that people have been looking for. It's also within my exploration radar, and I look forward to its implementation in vllm. Here is some detailed question. I know for Medusa, tree-draft-tokens play an essential role; for Eagle, it is also important. In your implementation, did you enable tree-draft-tokens, or is it still the single-sequence draft-tokens?

I'm asking this because I'm developing this tree-style speculation. And it will be a perfect match with the Medusa/Eagle/Hydra here. We can maybe combine the effort and see how the performance boost when the two technieques are put together. #4565 (comment)

I'm excited that you're working on this. I'm also considering adding tree attention to vllm to adapt it to EAGLE. How is your work going now? I'd like to ask where I should make efforts to modify it, and do you have an open source plan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants