-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630
Comments
This should be doable. Just need to figure out the UX change of how users use it. Do spec workers and non-spec workers share process/device? e.g. when we have tp=8 in current code, and want to add another tp=2 for spec decoding, do we want tp=2 to be another 2 processes, or from the subset of the tp=8 processes? |
See the code linked here @youkaichao : #4632. The spec worker and non-spec workers share the same process. |
About the tree-attention/Medusa/Eagle, one of the core implementation will be tree attention mask in flash attention, which is currently not ready. I'd like to bring your attention to it Dao-AILab/flash-attention#924. If anyone would like to contribute to it, it would be great. |
Hi @cadedaniel, I have tried current main branch to evaluate the acceleration of speculative decoding, but encountered the following assertion error: vllm/vllm/executor/ray_gpu_executor.py Lines 28 to 32 in 190bc83
I'm wondering how the 50% speedup is measured, is there still further pending PRs? And, as the draft-model looks so small (64m-sized), may I know if the 50% speedup is measured with greedy sampling or random sampling? Thanks! |
@LiuXiaoxuanPKU has more on this |
@sighingnow this issue is for getting the 50% speedup. once the P0s are done we will get it with temperature 1.0. |
I have met the same problem. Is there a solution? By the way, is there any documentation on how to evaluate the acceleration of speculative decoding? Thanks! |
May I know more about the accept rate when we get the 50% speedup? Thanks! |
On llama2 7b / llama2 70b, the acceptance rate was like 80% (no fine tuning). we trained a 68m draft model at anyscale that gets ~50% acceptance rate. btw you can run acceptance rate experiments today (I will push a PR tomorrow for TP>1 support)
Thanks @ChuanhongLi -- FYI there is no acceleration yet. we'll share documentation once there is a useful speedup. |
Thanks for the information! Looking forward to the complete speculative decoding support! |
Thanks for your reply! |
I noticed there's a feature request related to Medusa/Eagle at #4669 |
@cadedaniel May I know how you calculated the acceptance rate?On llama2 7b / llama2 70b, this acceptance rate seems a little high but just 50% speedup. |
Hi @cadedaniel @LiuXiaoxuanPKU, I have pushed a multi-query scorer implementation in #6185. Could you please take a look at it and let me know how do you think about it? Thanks! |
Thanks everyone for the help! We hit a 45% latency reduction. Big thanks to @sroy745 @alexm-neuralmagic @comaniac @wooyeonlee0 @zifeitong @LiuXiaoxuanPKU @rkooo567 @ruisearch42 and everyone else who has helped reduced vLLM overheads! I expect there to be more performance gains once we move the API server outside of the worker, we can re-run evals then. |
@cadedaniel thanks for leading this project! |
@cadedaniel Thanks for leading this effort. |
Is speculative decoding optimized on vLLM now? The doc on speculative decoding mentions that it is not currently optimized and links this PR to track progress while this PR was closed as complete on Aug 5 |
I believe the ongoing optimisation work is the checklist in the PR description @ishan-scribe |
Lot of the tasks mentioned in the PR description have been addressed and the improvements from these changes have been reported by Cade here #4630 (comment). There are still ongoing improvements to sd e.g. enabling chunked_prefill with speculative-decoding, MQA support for speculative decoding etc. |
Hi @cadedaniel |
How much performance improvement has THPT achieved? |
what is THPT? @v-lmn |
Could you check chunked prefill is disabled in your baseline? For llama3.1 70B, chunked prefill might be turned on by default because the context length is long. You could disable chunked prefill and compare TTFT again. Spec decode should not help with TTFT. It will actually slow down TTFT because you need to run draft model's prefill. |
Proposal to improve performance
With the end-to-end correctness tests merged in #3951, now we will optimize the implementation to get ~50% speedup on 70B model with temperature 1.0.
Work required:
P0/P1 -- priority
(Small/Medium/Large) -- relative size estimate
prepare_inputs
over multiple forward passesFAQ
What should the target configuration be for 50% speedup?
In the Anyscale fork we saw a 50% speedup on bs=8 with a 68m-sized draft model on TP1/70B target model on TP8 and a 7B draft model on TP(1|8)/70B target model on TP8. This was with the optimizations listed above as "P0".
Note we can do much better than this, with multi-query scoring (P1), GQA for target model scoring, and a dynamic speculation policy. This is just the starting point!
Why not implement Medusa / tree-attention?
We should implement this! The work here will lay the foundation for future improvements in speculative decoding. For example, Eagle uses the Medusa approach (fine-tuned heads plus tree attention) and even claims to beat Medusa. But for Eagle to work well in vLLM we need to optimize the sampler as listed above.
The north star should be: configurable tree size (top-k .. top-1), which uses multi-query attention for scoring (no batch expansion). This issue is about optimizing vLLM in the top-1 speculation case to get 50% speedup with draft models.
The text was updated successfully, but these errors were encountered: