Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Multi-Proposers support for speculative decoding. #6300

Open
ShangmingCai opened this issue Jul 10, 2024 · 6 comments
Open

[Feature]: Multi-Proposers support for speculative decoding. #6300

ShangmingCai opened this issue Jul 10, 2024 · 6 comments

Comments

@ShangmingCai
Copy link
Contributor

🚀 The feature, motivation and pitch

Speculative decoding has demonstrated significant potential in efficiently generating proposals and utilizing idle computing power to expedite the auto-regression decoding process, particularly under lightweight workloads. Thanks to the remarkable work by @cadedaniel, we have verified the latency benefits brought by speculative decoding on the latest version of vllm.

We have observed the following points that we believe could further enhance the utility of speculative decoding:

  • Ngram Proposer: While the 'Ngram' proposer can offer a 2x to 3x performance improvement in Retrieval-Augmented Generation (RAG) scenarios, its performance diminishes when the RAG module retrieves no relevant data for a query.

  • Draft-Model-Based Proposers: In contrast, draft-model-based proposers have exhibited higher acceptance rates when the RAG module retrieves no relevant data or faces a more creative task. Yet the performance of this type of implementation is not fully optimized ([Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630 [Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561). So the current performance gains are limited. We sincerely thank the open-source community for their efforts and hope all this progress will be smooth.

  • Creative Tasks with High Temperature: We have noticed that both proposer methods exhibit lower performance compared to non-spec implementations when dealing with creative tasks characterized by a high temperature or a great top_k. Maybe the spec decode should be disabled under this circumstance.

Apart from these observations, we were particularly interested in your latest work on speculative length scheduling for different workload scenarios (#5886) Optimizing Speculative Decoding for Serving Large Language Models Using Goodput.

This led us to wonder if vllm could be enhanced to support multiple proposers and provide the flexibility to schedule them appropriately. Alternatively, enabling users to specify the proposer for different requests via SamplingParams could also be a viable solution.

We believe this enhancement could unlock greater potential and adaptivity for vllm's speculative decoding capabilities. We are working on an inner forked version to verify whether we can achieve a higher goodput.

Thanks, feel free to leave a message to let us know what you think of it.

Alternatives

No response

Additional context

No response

@cadedaniel
Copy link
Collaborator

These are great ideas! Contributions welcome :)

@ShangmingCai
Copy link
Contributor Author

These are great ideas! Contributions welcome :)

Thanks! Once we have successfully verified the performance improvement in the inner version, I will submit a PR to begin integrating this feature into the open-source repository. Will keep you updated about our progress.

Also, if there is any progress in the integration of SmartSpec, please let me know. cc @LiuXiaoxuanPKU

Feel free to contact me anytime if any changes or additions you would like!

@cadedaniel
Copy link
Collaborator

One other idea you should consider is using multi-lora draft model

@ShangmingCai
Copy link
Contributor Author

One other idea you should consider is using multi-lora draft model

Brilliant! The design philosophy of multi-proposers is similar to that of multiple LoRA support. Also, the choice should not be set through sampling_params, but be left to the service provider for autonomous scheduling in the generate() function like LoRA.

Although SpecDecodeWorker does not support LoRA at this stage, I will keep the combination of spec decode and LoRA in mind and advance it step by step :)

@cadedaniel
Copy link
Collaborator

Sounds good. Btw I don't think we should let users decide the spec method as it gives too much flexibility to impact other users -- should be set by service provider

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants