Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Automate Speculative Decoding #4565

Open
LiuXiaoxuanPKU opened this issue May 2, 2024 · 16 comments
Open

[RFC]: Automate Speculative Decoding #4565

LiuXiaoxuanPKU opened this issue May 2, 2024 · 16 comments
Labels

Comments

@LiuXiaoxuanPKU
Copy link
Collaborator

LiuXiaoxuanPKU commented May 2, 2024

Motivation.

Speculative Decoding is a crucial feature for reducing latency, currently supported by vLLM (credit to @cadedaniel !). However, when deploying Speculative Decoding in real online LLM serving systems that use continuous batching, improvements are not always observed. Paradoxically, under conditions of high request rates or low speculation accuracy, latency may actually increase.

We propose to address these issues. We want to intelligently determines the optimal speculation length for each request, ranging from zero (no speculation) to multiple tokens. This determination is based on the concept of goodput, which reflects the current observed load across the entire system, thus allowing for most effective speculative execution.

The method is designed for versatility, compatible with various speculative decoding styles, from traditional, model-based approaches to model-free methods such as prompt lookup and tree-style decoding. This innovation builds on recent research by the vLLM team. We plan to release the detailed paper shortly.

Proposed Change.

Milestone 1: Implement a mechanism to disable speculative decoding (proposed length = verified length = 0), allowing users to manually decide when to cease speculative decoding. Based on prior empirical studies, we can initiate this process by monitoring the running_queue size. Speculative decoding will be suspended for incoming requests once the running_queue exceeds a predefined threshold. Cody will assist with this implementation, thanks @comaniac!

Milestone 2: Dynamically determine the proposed length for speculative decoding. We will utilize runtime information, such as batch size, in conjunction with profiled parameters like token acceptance rate and the comparative costs of running the draft versus the target model. This approach allows us to adjust the proposed length in real-time, optimizing performance based on current system conditions.

Milestone 3: Eliminate reliance on pre-profiled parameters and gather necessary information directly from runtime. We will collect data such as the token acceptance rate and the execution times for both the draft and target models from previous steps. This data will then be integrated into the goodput calculation, allowing for a more dynamic and responsive system configuration

Feedback Period.

No response

CC List.

No response

Any Other Things.

  1. We will implement modifications after the scheduler allocates the slots, which may result in some memory inefficiency. For instance, if num_lookahead_slots is set to 5 but the proposed length is only 3, then 2 slots would go unused.
  2. Currently, we support proposing lengths at the batch level, meaning all requests within the same batch share the same proposed length. In the future, we could consider supporting more finely grained proposed lengths as needed.
@cade
Copy link

cade commented May 2, 2024

As much as I would love to take credit for bringing Speculative Decoding to vLLM, I'm relatively certain the praise belongs to @cadedaniel. 😁

@KexinFeng
Copy link

KexinFeng commented May 6, 2024

It's indeed a good idea to make the speculative system smarter, to be able to automatically adjust according to the serving load and serving data. Along the same direction, there is one more thing that is not mentioned but is worth doing, which is dyanmical candidate tree topology. This is a generalization of the dynamical speculation length mentioned above, and will be possible after enabling the tree-based speculative decoding on vllm. We are actually actively exploring this direction.

Another good thing about it is that it is orthogonal to the roadmaps above and thus will be compatible with them, as you mentioned. On the other hand, this direction also falls into the title of this RFC, the dynamical speculative decoding. So I mention it here to bring the community's attention to it and hope in the next step this implementation will be a contribution to vllm.

More specifically, in the 1D sequential spec-decoding, the spec_length can be dynamically set according to the predicted acceptance rate. In the tree-style spec-decoding, which is a generalization of 1D, the tree topology including the tree size will be dynamically set according to an acceptance rate vector. And then further speed up can be expected.

@YuCheng-Qi
Copy link

@LiuXiaoxuanPKU
Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot load two LLM models at the same time (one of which is used as a draft model and the other as a target model). I now have 3 questions for you:

  1. Can the vllm framework now support loading two LLM models at the same time?
  2. Will Dynamic Speculative Decoding-related functions be developed on the vllm framework?
  3. If the above two functions are supported by the vllm community, when will they be implemented?

@LiuXiaoxuanPKU
Copy link
Collaborator Author

It's indeed a good idea to make the speculative system smarter, to be able to automatically adjust according to the serving load and serving data. Along the same direction, there is one more thing that is not mentioned but is worth doing, which is dyanmical candidate tree topology. This is a generalization of the dynamical speculation length mentioned above, and will be possible after enabling the tree-based speculative decoding on vllm. We are actually actively exploring this direction.

Another good thing about it is that it is orthogonal to the roadmaps above and thus will be compatible with them, as you mentioned. On the other hand, this direction also falls into the title of this RFC, the dynamical speculative decoding. So I mention it here to bring the community's attention to it and hope in the next step this implementation will be a contribution to vllm.

More specifically, in the 1D sequential spec-decoding, the spec_length can be dynamically set according to the predicted acceptance rate. In the tree-style spec-decoding, which is a generalization of 1D, the tree topology including the tree size will be dynamically set according to an acceptance rate vector. And then further speed up can be expected.

Yes! In the research, we also explore the idea of dynamically adjust top k in for tree-style speculation. Our preliminary results are promising, but the results are based on the simulation. Once we have the tree-style speculative decoding in vllm, we can also add that.

@LiuXiaoxuanPKU
Copy link
Collaborator Author

LiuXiaoxuanPKU commented May 9, 2024

@LiuXiaoxuanPKU Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot load two LLM models at the same time (one of which is used as a draft model and the other as a target model). I now have 3 questions for you:

  1. Can the vllm framework now support loading two LLM models at the same time?
  2. Will Dynamic Speculative Decoding-related functions be developed on the vllm framework?
  3. If the above two functions are supported by the vllm community, when will they be implemented?

Thanks for the interest!

  1. If the two models are used for speculative decoding, yes vllm can already support that. Take a look at this worker, which contains a draft worker and a target worker. The draft worker is responsible for loading and executing the draft model, while the target work is used for the target model.
  2. Yes, it will be integrated into vllm.
  3. Currently, we are in the process of optimizing speculative decoding performance because dynamically adjusting it will not be interesting if the native speculative decoding performance is not good. Once we think the native speculative decoding performance is reasonable, we will add our method on top of it quickly. I am not sure how long this step will take, @cade might have more context here.
  4. For the timeline, since our method is very light weighted, milestone2 (pre collect some system numbers, support limited models such as llama-7b, llama-70b) can be done within one week. milestone3 is to fully automate the speculation, which will take longer, 1-2 month.

@YuCheng-Qi
Copy link

@LiuXiaoxuanPKU Thanks for your response, and best wishes to you as well!

@LiuXiaoxuanPKU LiuXiaoxuanPKU changed the title [RFC]: Dynamic Speculative Decoding [RFC]: Automate Speculative Decoding May 11, 2024
@KexinFeng
Copy link

KexinFeng commented May 11, 2024

@LiuXiaoxuanPKU It's great to know that the vllm is looking into the tree-style 2D speculation. Actually, I'm developing an implementation of this tree-style 2D speculation, which works on any tree topology. And similar to what you mentioned, my estimation also shows that the results will be promising. We can expect further boost in this direction. When I finish the implementation, I would like to create a PR and integrate this into vllm's speculation.

Some updates: I just notice that here #4669 people has Medusa/Eagle/Hydra implementation now. The tree-style speculation will be a good match with them.

@keyboardAnt
Copy link

keyboardAnt commented Jun 22, 2024

We recently showed that even a relatively simple speculation lookahead controller can speed up the decoding.

Paradoxically, under conditions of high request rates or low speculation accuracy, latency may actually increase.

Yes, speculative decoding leads to slowdowns if the accuracy is too low. Our proposed alternative (the DSI algorithm) is always faster than speculative decoding and never slower than traditional autoregression (nonspeculative). We proved it mathematically and provided supporting experiments.

I'm open to collaborations to push it forward.

@wooyeonlee0
Copy link
Contributor

@LiuXiaoxuanPKU Thank you (and other co-workers) for the great work!
I checked the SmartSpec paper on arxiv and slides in the recent meetup, and it looks great :)
I'm looking forward to seeing it in the vllm repo.
Please let me know, if there's anything I can help with. 👍

@jon-chuang
Copy link
Contributor

@LiuXiaoxuanPKU what is the status of this issue?

@brotherchen
Copy link

amazing works ! I would like to know whether lookahead decoding technology is in the planning of vllm, because it can speed up reasoning without the need for an additional draft model or any additional training.

@cadedaniel
Copy link
Collaborator

@brotherchen more info here on lookahead decoding in vLLM. currently not being worked on as far as I know.

@smart-lty
Copy link

@LiuXiaoxuanPKU Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot load two LLM models at the same time (one of which is used as a draft model and the other as a target model). I now have 3 questions for you:

  1. Can the vllm framework now support loading two LLM models at the same time?
  2. Will Dynamic Speculative Decoding-related functions be developed on the vllm framework?
  3. If the above two functions are supported by the vllm community, when will they be implemented?

Thanks for the interest!

  1. If the two models are used for speculative decoding, yes vllm can already support that. Take a look at this worker, which contains a draft worker and a target worker. The draft worker is responsible for loading and executing the draft model, while the target work is used for the target model.
  2. Yes, it will be integrated into vllm.
  3. Currently, we are in the process of optimizing speculative decoding performance because dynamically adjusting it will not be interesting if the native speculative decoding performance is not good. Once we think the native speculative decoding performance is reasonable, we will add our method on top of it quickly. I am not sure how long this step will take, @cade might have more context here.
  4. For the timeline, since our method is very light weighted, milestone2 (pre collect some system numbers, support limited models such as llama-7b, llama-70b) can be done within one week. milestone3 is to fully automate the speculation, which will take longer, 1-2 month.

@LiuXiaoxuanPKU Very interesting work! Our recent work on speculative decoding reveals that executing the draft model and the target model in parallel can achieve an adaptive draft length, which can significantly improve the speculative decoding performance. I would like to know whether the draft worker and the target worker can execute in parallel within vllm?

@LiuXiaoxuanPKU
Copy link
Collaborator Author

@LiuXiaoxuanPKU Thanks a lot for the super helpful info! I am very interested in the Dynamic Speculative Decoding mentioned above, and also found that the existing vllm framework cannot load two LLM models at the same time (one of which is used as a draft model and the other as a target model). I now have 3 questions for you:

  1. Can the vllm framework now support loading two LLM models at the same time?
  2. Will Dynamic Speculative Decoding-related functions be developed on the vllm framework?
  3. If the above two functions are supported by the vllm community, when will they be implemented?

Thanks for the interest!

  1. If the two models are used for speculative decoding, yes vllm can already support that. Take a look at this worker, which contains a draft worker and a target worker. The draft worker is responsible for loading and executing the draft model, while the target work is used for the target model.
  2. Yes, it will be integrated into vllm.
  3. Currently, we are in the process of optimizing speculative decoding performance because dynamically adjusting it will not be interesting if the native speculative decoding performance is not good. Once we think the native speculative decoding performance is reasonable, we will add our method on top of it quickly. I am not sure how long this step will take, @cade might have more context here.
  4. For the timeline, since our method is very light weighted, milestone2 (pre collect some system numbers, support limited models such as llama-7b, llama-70b) can be done within one week. milestone3 is to fully automate the speculation, which will take longer, 1-2 month.

@LiuXiaoxuanPKU Very interesting work! Our recent work on speculative decoding reveals that executing the draft model and the target model in parallel can achieve an adaptive draft length, which can significantly improve the speculative decoding performance. I would like to know whether the draft worker and the target worker can execute in parallel within vllm?

Currently no. The draft model and target model are executed sequentially. I image the asynchronous execution will be a big change for the current vllm's arch, but any discussion and contribution is welcomed!

@TechxGenus
Copy link
Contributor

This hf blog looks great, can it be easily integrated into vllm?
https://huggingface.co/blog/dynamic_speculation_lookahead

@gopalsarda
Copy link
Contributor

@LiuXiaoxuanPKU Amazing work! I have 2 questions regarding the implementation for Milestone 2: Dynamically determine the proposed length for speculative decoding

  1. Are you planning to provide an abstraction so that individual spec decoding algorithms (EAGLE/Medusa) can implement their own policy for dynamically determining the proposed length or have a stopping criteria? This abstraction would help implement features like what is proposed in EAGLE-2 for dynamically adjusting the shape of the draft tree based on the context. This would be helpful even for top-1 proposer.
  2. I was curious how would determining proposed length using runtime information like token acceptance rate work for cases where the same endpoint has to serve requests for different tasks (with different acceptance rates). Wouldn't acceptance rate from requests from one task adversely affect the proposal length for requests for another task?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests