Skip to content

[Spec Decode] Make speculative decoding compatible with pipeline parallelism #15173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

xyang16
Copy link
Contributor

@xyang16 xyang16 commented Mar 20, 2025

This PR aims to make speculative decoding compatible with pipeline parallelism.

  • Introducing speculative_draft_pipeline_parallel_size config, allowing draft model to run with pp size different from target model (For eagle, support tp=1 and pp=1 only).
  • Introduce num_virtual_engine config. Currently the number of virtual_engine is the same as pipeline_parallel_size, in the case of draft pipeline_parallel_size is different from target pipeline_parallel_size, it will causes error. Therefore, adding num_virtual_engine config, this value will be the same as target pipeline_parallel_size.
  • Run draft model to get proposals only on pp stage 0, and broadcast proposals to the other pp stages.
  • Target model will be processed using the existing WorkerBase, the first n layers processed on the first pipeline stage, then send the intermediate result to the next stage, then the next stage process the next n layers and so on.

Sample commands:

  • eagle
vllm serve unsloth/llama-3-8b-Instruct \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    --speculative_config '{"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "draft_pipeline_parallel_size": 1}' \
    --gpu-memory-utilization 0.85
  • mlp
vllm serve unsloth/llama-3-8b-Instruct \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    --speculative_config '{"model": "ibm-ai-platform/llama3-8b-accelerator", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "draft_pipeline_parallel_size": 1}' \
    --gpu-memory-utilization 0.85
  • medusa
vllm serve JackFram/llama-68m \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    --speculative_config '{"model": "abhigoyal/vllm-medusa-llama-68m-random", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "draft_pipeline_parallel_size": 1}'

Benchmark

We benchmarked deepseek-ai/DeepSeek-R1 model with a private eagle model in a 2-node cluster. With pipeline parallel enabled (tp=8, pp=2), there's more than 30% tps improvement compared with baseline (tp=16, pp=1).

  • Instances: p5e.48xlarge (x2)
  • Target model: deepseek-ai/DeepSeek-R1
  • Draft model: eagle
Screenshot 2025-04-02 at 1 41 50 PM

cc @LiuXiaoxuanPKU @comaniac Would appreciate your review. Thanks!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@xyang16 xyang16 changed the title [WIP][Spec Decode] Making speculative decoding compatible with pipeline parallelism [WIP][Spec Decode] Make speculative decoding compatible with pipeline parallelism Mar 20, 2025
@xyang16 xyang16 force-pushed the specdec branch 8 times, most recently from 005e1a9 to ecd9f20 Compare March 20, 2025 21:26
@xyang16 xyang16 changed the title [WIP][Spec Decode] Make speculative decoding compatible with pipeline parallelism [Spec Decode] Make speculative decoding compatible with pipeline parallelism Mar 20, 2025
@DefTruth
Copy link
Contributor

Any plan to merge it? i am looking for similar feature for R1 with PP + TP + MTP.

Copy link

mergify bot commented Mar 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xyang16.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…llelism

Signed-off-by: Xin Yang <xyangx@amazon.com>
@xyang16 xyang16 force-pushed the specdec branch 5 times, most recently from c43f156 to 74137f8 Compare April 1, 2025 07:25
@David-lm
Copy link

David-lm commented Apr 1, 2025

I trust this message finds you in good spirits. I am writing to commend the exceptional work that has been accomplished and to seek clarification on a couple of important matters.

  1. Are there any inherent limitations when configuring the tp_size and pp_size for both the target model and the draft model?
  1. Has the configuration with draft_pp=1 and target_pp>1 been successfully executed on NVIDIA hardware without any issues to date?

@xyang16 xyang16 force-pushed the specdec branch 7 times, most recently from 0076f69 to 6ca6537 Compare April 2, 2025 04:12
@xyang16 xyang16 force-pushed the specdec branch 2 times, most recently from 4e4c85c to 061ee3d Compare April 2, 2025 19:41
@xyang16
Copy link
Contributor Author

xyang16 commented Apr 2, 2025

I trust this message finds you in good spirits. I am writing to commend the exceptional work that has been accomplished and to seek clarification on a couple of important matters.

  1. Are there any inherent limitations when configuring the tp_size and pp_size for both the target model and the draft model?
  1. Has the configuration with draft_pp=1 and target_pp>1 been successfully executed on NVIDIA hardware without any issues to date?
  1. There's no limitations for target model tp_size and pp_size. For draft model, currently we added support only for draft tp_size=1 and draft pp_size=1.
  2. We have done some benchmarks on p5e.48xlarge instances. See the Benchmark section in the description.

…llelism

Signed-off-by: Xin Yang <xyangx@amazon.com>
xyang16 added 2 commits April 2, 2025 14:55
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
@simon-mo
Copy link
Collaborator

simon-mo commented Apr 3, 2025

Thank you for the PR. Given we have turned on V1 by default, we would like any PR to work in V1 when they are merged. However, I do recognize that V1 doesn't support draft model yet. Leaving that to @WoosukKwon to decide and @LiuXiaoxuanPKU and @ruisearch42 to review.

@DeepTecher
Copy link

I trust this message finds you in good spirits. I am writing to commend the exceptional work that has been accomplished and to seek clarification on a couple of important matters.

  1. Are there any inherent limitations when configuring the tp_size and pp_size for both the target model and the draft model?
  1. Has the configuration with draft_pp=1 and target_pp>1 been successfully executed on NVIDIA hardware without any issues to date?
  1. There's no limitations for target model tp_size and pp_size. For draft model, currently we added support only for draft tp_size=1 and draft pp_size=1.
  2. We have done some benchmarks on p5e.48xlarge instances. See the Benchmark section in the description.

will you support the case where tp>1? In some scenarios, the draft model will also be huge.

xyang16 added 2 commits April 9, 2025 09:08
Signed-off-by: Xin Yang <xyangx@amazon.com>
Signed-off-by: Xin Yang <xyangx@amazon.com>
Copy link

mergify bot commented Apr 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xyang16.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 11, 2025
@aleksanderk-cerebras
Copy link

aleksanderk-cerebras commented Apr 24, 2025

@xyang16, thanks for thre PR! Can you please share the parameters that you used to run DeepSeek on two-node system? First of all, the build of the PR fails on my end. I rebased your branch and the build is fine now but vllm fails at start.
This is the error I got:

WARNING 04-24 10:37:17 [arg_utils.py:1737] Speculative Decoding is not supported by the V1 Engine. Falling back to V0. 
Traceback (most recent call last):
... 
 line 1219, in create_engine_config
    parallel_config = ParallelConfig(
                      ^^^^^^^^^^^^^^^
TypeError: ParallelConfig.__init__() got an unexpected keyword argument 'num_virtual_engine'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants