Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632

Closed
cadedaniel opened this issue May 6, 2024 · 5 comments · Fixed by #5414 · May be fixed by #4933 or #5856
Closed
Labels
help wanted Extra attention is needed performance Performance-related issues speculative-decoding

Comments

@cadedaniel
Copy link
Collaborator

cadedaniel commented May 6, 2024

Overview

Speculative decoding allows a speedup for memory-bound LLMs by using a fast proposal method to propose tokens that are verified in a single forward pass by the larger LLM. Papers report 2-3x speedup for bs=1, in Anyscale's fork we see up to 2x speedup with a small draft model for bs=8 (30% for bs=16) (we can improve this! see #4630 if you want to help).

A key optimization for small models (68m/160m domain) is to use tensor-parallel degree 1, even if the target model is using tensor-parallel degree 4 or 8. In our fork, this reduces proposal time from 5ms/tok to 1.5ms/tok. This will allow a well-aligned 68m draft model to get 2x per-user throughput improvement on 70B target model.

Furthermore, a 1B/7B proposer model may ideally be placed on TP=2 or TP=4, while the larger model is placed on TP=8. vLLM should support these configuration so the community can use the configuration best for their draft model.

Design suggestions

I implemented a Worker which patches the tensor parallel group to TP1 in our fork. The code is dumped here. We should use this approach in vLLM, however we can improve it by using @youkaichao 's tensor-parallel group improvements.

@youkaichao
Copy link
Member

I can work on this after a major refactor of distributed #4591 is landed.

@wooyeonlee0
Copy link
Contributor

wooyeonlee0 commented Jun 5, 2024

@cadedaniel
Can I contribute my code that already implemented this feature on v0.4.2?
I've referred to your code in #2188.

I'm aware that #4933 is going on, so I want to confirm that it's okay to do it.

@GeauxEric
Copy link
Contributor

@wooyeonlee0
pls go ahead.

@cadedaniel
Copy link
Collaborator Author

cadedaniel commented Jun 6, 2024

yep, my policy is to review the PRs in the order that they're initially ready for review. go ahead @wooyeonlee0 .

@wooyeonlee0
Copy link
Contributor

Thanks for the answer :)
I'll send a PR maybe next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment