[Feature]: Multi-Proposers support for speculative decoding. #6300

ShangmingCai · 2024-07-10T09:36:55Z

🚀 The feature, motivation and pitch

Speculative decoding has demonstrated significant potential in efficiently generating proposals and utilizing idle computing power to expedite the auto-regression decoding process, particularly under lightweight workloads. Thanks to the remarkable work by @cadedaniel, we have verified the latency benefits brought by speculative decoding on the latest version of vllm.

We have observed the following points that we believe could further enhance the utility of speculative decoding:

Ngram Proposer: While the 'Ngram' proposer can offer a 2x to 3x performance improvement in Retrieval-Augmented Generation (RAG) scenarios, its performance diminishes when the RAG module retrieves no relevant data for a query.
Draft-Model-Based Proposers: In contrast, draft-model-based proposers have exhibited higher acceptance rates when the RAG module retrieves no relevant data or faces a more creative task. Yet the performance of this type of implementation is not fully optimized ([Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630 [Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561). So the current performance gains are limited. We sincerely thank the open-source community for their efforts and hope all this progress will be smooth.
Creative Tasks with High Temperature: We have noticed that both proposer methods exhibit lower performance compared to non-spec implementations when dealing with creative tasks characterized by a high temperature or a great top_k. Maybe the spec decode should be disabled under this circumstance.

Apart from these observations, we were particularly interested in your latest work on speculative length scheduling for different workload scenarios (#5886) Optimizing Speculative Decoding for Serving Large Language Models Using Goodput.

This led us to wonder if vllm could be enhanced to support multiple proposers and provide the flexibility to schedule them appropriately. Alternatively, enabling users to specify the proposer for different requests via SamplingParams could also be a viable solution.

We believe this enhancement could unlock greater potential and adaptivity for vllm's speculative decoding capabilities. We are working on an inner forked version to verify whether we can achieve a higher goodput.

Thanks, feel free to leave a message to let us know what you think of it.

Alternatives

No response

Additional context

No response

cadedaniel · 2024-07-10T21:20:35Z

These are great ideas! Contributions welcome :)

ShangmingCai · 2024-07-11T02:31:03Z

These are great ideas! Contributions welcome :)

Thanks! Once we have successfully verified the performance improvement in the inner version, I will submit a PR to begin integrating this feature into the open-source repository. Will keep you updated about our progress.

Also, if there is any progress in the integration of SmartSpec, please let me know. cc @LiuXiaoxuanPKU

Feel free to contact me anytime if any changes or additions you would like!

cadedaniel · 2024-07-11T21:25:06Z

One other idea you should consider is using multi-lora draft model

ShangmingCai · 2024-07-12T02:43:55Z

One other idea you should consider is using multi-lora draft model

Brilliant! The design philosophy of multi-proposers is similar to that of multiple LoRA support. Also, the choice should not be set through sampling_params, but be left to the service provider for autonomous scheduling in the generate() function like LoRA.

Although SpecDecodeWorker does not support LoRA at this stage, I will keep the combination of spec decode and LoRA in mind and advance it step by step :)

cadedaniel · 2024-07-12T03:33:04Z

Sounds good. Btw I don't think we should let users decide the spec method as it gives too much flexibility to impact other users -- should be set by service provider

github-actions · 2024-10-25T02:01:50Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

ShangmingCai added the feature request label Jul 10, 2024

ShangmingCai mentioned this issue Jul 16, 2024

[Misc][Speculative decoding] Typos and typing fixes #6467

Merged

ShangmingCai mentioned this issue Aug 12, 2024

[Misc] Add quantization config support for speculative model. #7343

Merged

ShangmingCai mentioned this issue Aug 28, 2024

[WIP][Spec Decode] Add multi-proposer support for variable and flexible speculative decoding #7947

Open

github-actions bot added the stale label Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Multi-Proposers support for speculative decoding. #6300

[Feature]: Multi-Proposers support for speculative decoding. #6300

ShangmingCai commented Jul 10, 2024

cadedaniel commented Jul 10, 2024

ShangmingCai commented Jul 11, 2024

cadedaniel commented Jul 11, 2024

ShangmingCai commented Jul 12, 2024

cadedaniel commented Jul 12, 2024

github-actions bot commented Oct 25, 2024

[Feature]: Multi-Proposers support for speculative decoding. #6300

[Feature]: Multi-Proposers support for speculative decoding. #6300

Comments

ShangmingCai commented Jul 10, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

cadedaniel commented Jul 10, 2024

ShangmingCai commented Jul 11, 2024

cadedaniel commented Jul 11, 2024

ShangmingCai commented Jul 12, 2024

cadedaniel commented Jul 12, 2024

github-actions bot commented Oct 25, 2024