-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speculative Decoding #2607
Speculative Decoding #2607
Conversation
Hi @ymwangg thanks a lot for this PR, I learnt a few things from it. While trying to test it out, I did a pip install -e . on a fresh container forked from nvidia/cuda:12.1.0-devel-ubuntu22.04 with a few other things but I got this weird error where when I tried to import it, I couldn't see any objects or classes from vllm: import vllm
print(dir(vllm)) # doesn't have the usual LLM, ... Any idea if I need to build this differently? Curious to know in what environment did you get it to work. |
@wongjingping this PR doesn't introduce any other dependencies, so it should work as long as vllm works. Are you able to build and run vanilla vllm? |
Sure, I'll run some tests on it. Their performance should be very similar. This PR's MQA kernel made a few simplifications based on the assumption that the total query length won't exceed 16 and kv caches already contain all tokens. I did find flash-attention's flash_attn_varlen_func is a little faster than triton but it doesn't make end-to-end faster due to extra overhead in preparing the inputs (flattening kv cache, cu_seqlens_k) and lack of cudagraph support. |
@ymwangg thanks for the tip - I got it to work after building it directly within my updated dockerfile. It works great! Only thing I noticed is that the memory consumption will be slightly higher, and one might need to reduce the |
@wongjingping glad to hear it now works. Yes, I also observed gpu_memory_utilization needs be <= 0.8 for non-A100 gpus like A10g (23GB HBM). |
hi, i ran your code, but it reports import vllm._c error, can you provide the file? |
You probably didn't install it correctly. The simplest way to install vllm is to use docker
and run the following inside container:
|
thanks, it works! |
Run
Benchmark
Error
|
@xunfeng1980 preemption is currently not supported. You probably can reduce request rate. It works on my machine
Though it's slower than that w/o speculative decoding, which yields 3.75 requests/s. |
@xunfeng1980 Thanks for reporting this issue. It should be fixed now. Pre-emption by recompute is banned erroneously. |
RTX 4090 vllm with Speculative Decoding
RTX 4090 vllm no Speculative Decoding
Speculative Decoding is more slowly |
@mansur20478 Thanks for the good catch. Yes, it's supposed to use "can_append_multiple_slots". It get slipped through during rebasing. |
A few questions regarding this PR:
|
Thanks for the question!
|
Thanks for replying. Could you please rebase this PR to |
dda49f3
to
7c47d0b
Compare
0ef5920
to
5fb432c
Compare
Co-authored-by: Jie Wang <holawj@gmail.com> Fix greedy sampling in speculative decoding Add back pre-emption by recompute support Add logprobs support for speculative decoding. Fix prompt_logprobs and add stop_str support
5fb432c
to
5e3a872
Compare
hi, i find that after combining the speculative decoding method with vllm, the engine will be initailized with target model and draft model, and then the default generate methed of llmengine will always run with speculative mode. Whether it's possible to determine whether to use speculative mode with a parameter,in the case of concurrent calls to the llmengine's generate method |
I have a question, how does speculative decoding works in case of larger batch size. For example, suppose my batch size is 4 and speculation length is 7. If it is dependent on each other, after speculative decoding step: |
Each sequence accept/reject tokens independently, the code is here https://github.com/ymwangg/vllm/blob/specdec_v0.1.2/vllm/model_executor/layers/sampler.py#L727-L740. |
Hello! Thank you very much for your work! I was very interested in your work, so I fetched #2607 locally for research, but I encountered a similar problem to #1391 when pip install -e . Have you encountered similar problems? My cuda version is 12.1, the torch version is 2.2.0, and the GPU is RTX 3090. |
[Question on Increasing Single Decoding Time with Speculative Decoding as Batch Size Increases] I am exploring the impact of speculative decoding on the efficiency of very large language models (vLLMs) and have observed some intriguing behavior regarding decoding times. From my experiments, I noticed that when speculative decoding is not used, the single decoding time remains relatively stable across different batch sizes. However, when speculative decoding is implemented, there is a significant increase in single decoding time as the batch size increases. For context, the speculative decoding setup I am using involves a draft model with a speculative length set to 4. I have attached a table below that illustrates these observations: I understand that the overall computational load increases with speculative decoding due to the use of a draft model. However, I am curious about the specific reasons why the increase in single decoding time is notably pronounced with larger batch sizes. Could this be related to the overhead from running verification processes in parallel on the target model? I used LLaMa-13B for target model and LLaMa-68M for draft model. Four A100(80GB) GPU are used with TP degree 4. Any insights or explanations would be greatly appreciated. Thank you for your time and assistance. |
@dutsc it looks like this issue is specific to windows os. Sorry I don't have access to windows setup. Maybe other folks in the community can help you. |
Hi @Heelim-Hong, it's expected the speedup with speculative decoding keep decreasing as you increase the batch sizes. Yes, this is related to how verification works. At low level, you can think about multiplying two matrices of shape |
Hi @ymwangg. Thank you very much for your response. I see that b represents the batch size and m represents the speculation length, but what do k and n represent? |
Introduction
Similar to #1797, #2188, this PR implements speculative decoding. The goal of this PR is to facilitate research in this direction, such as developing new draft token generation mechanisms, new sampling method, optimized CUDA kernels, while the vLLM community is settling on the infrastructure part.
Example Usage
You can find two example scripts
examples/api_client_spec_dec.py
andexamples/offline_inference_spec_dec.py
.The fastest way to try out this feature is to run the following commands:
Demo
Below is a demo of Llama-70b (use TinyLlama-1.1b as draft model) running on 4 Nvidia-A100-80G GPUs.
Demo w/o speculative decoding:
Demo with speculative decoding.
Limitations
This feature is still experimental, so use with caution. The following features (not exhaustive) are currently not supported:
Acknowledgement:
I'd like to thank @whbldhwj for sharing valuable data and code, especially the first MQA kernel, @YuchengT, @li2haipeng, @KexinFeng, @Vatshank for helpful discussions, @harishneit and @lanking520 for their leadership.