-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Multi-Proposers support for speculative decoding. #6300
Comments
These are great ideas! Contributions welcome :) |
Thanks! Once we have successfully verified the performance improvement in the inner version, I will submit a PR to begin integrating this feature into the open-source repository. Will keep you updated about our progress. Also, if there is any progress in the integration of SmartSpec, please let me know. cc @LiuXiaoxuanPKU Feel free to contact me anytime if any changes or additions you would like! |
One other idea you should consider is using multi-lora draft model |
Brilliant! The design philosophy of multi-proposers is similar to that of multiple LoRA support. Also, the choice should not be set through Although SpecDecodeWorker does not support LoRA at this stage, I will keep the combination of spec decode and LoRA in mind and advance it step by step :) |
Sounds good. Btw I don't think we should let users decide the spec method as it gives too much flexibility to impact other users -- should be set by service provider |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
🚀 The feature, motivation and pitch
Speculative decoding has demonstrated significant potential in efficiently generating proposals and utilizing idle computing power to expedite the auto-regression decoding process, particularly under lightweight workloads. Thanks to the remarkable work by @cadedaniel, we have verified the latency benefits brought by speculative decoding on the latest version of vllm.
We have observed the following points that we believe could further enhance the utility of speculative decoding:
Ngram Proposer: While the 'Ngram' proposer can offer a 2x to 3x performance improvement in Retrieval-Augmented Generation (RAG) scenarios, its performance diminishes when the RAG module retrieves no relevant data for a query.
Draft-Model-Based Proposers: In contrast, draft-model-based proposers have exhibited higher acceptance rates when the RAG module retrieves no relevant data or faces a more creative task. Yet the performance of this type of implementation is not fully optimized ([Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630 [Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561). So the current performance gains are limited. We sincerely thank the open-source community for their efforts and hope all this progress will be smooth.
Creative Tasks with High Temperature: We have noticed that both proposer methods exhibit lower performance compared to non-spec implementations when dealing with creative tasks characterized by a high temperature or a great top_k. Maybe the spec decode should be disabled under this circumstance.
Apart from these observations, we were particularly interested in your latest work on speculative length scheduling for different workload scenarios (#5886) Optimizing Speculative Decoding for Serving Large Language Models Using Goodput.
This led us to wonder if vllm could be enhanced to support multiple proposers and provide the flexibility to schedule them appropriately. Alternatively, enabling users to specify the proposer for different requests via SamplingParams could also be a viable solution.
We believe this enhancement could unlock greater potential and adaptivity for vllm's speculative decoding capabilities. We are working on an inner forked version to verify whether we can achieve a higher goodput.
Thanks, feel free to leave a message to let us know what you think of it.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: