[Spec Decode] Make speculative decoding compatible with pipeline parallelism #15173

xyang16 · 2025-03-20T00:45:10Z

This PR aims to make speculative decoding compatible with pipeline parallelism.

Introducing speculative_draft_pipeline_parallel_size config, allowing draft model to run with pp size different from target model (For eagle, support tp=1 and pp=1 only).
Introduce num_virtual_engine config. Currently the number of virtual_engine is the same as pipeline_parallel_size, in the case of draft pipeline_parallel_size is different from target pipeline_parallel_size, it will causes error. Therefore, adding num_virtual_engine config, this value will be the same as target pipeline_parallel_size.
Run draft model to get proposals only on pp stage 0, and broadcast proposals to the other pp stages.
Target model will be processed using the existing WorkerBase, the first n layers processed on the first pipeline stage, then send the intermediate result to the next stage, then the next stage process the next n layers and so on.

Sample commands:

eagle

vllm serve unsloth/llama-3-8b-Instruct \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    --speculative_config '{"model": "yuhuili/EAGLE-LLaMA3-Instruct-8B", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "draft_pipeline_parallel_size": 1}' \
    --gpu-memory-utilization 0.85

mlp

vllm serve unsloth/llama-3-8b-Instruct \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    --speculative_config '{"model": "ibm-ai-platform/llama3-8b-accelerator", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "draft_pipeline_parallel_size": 1}' \
    --gpu-memory-utilization 0.85

medusa

vllm serve JackFram/llama-68m \
    --tensor-parallel-size 2 \
    --pipeline-parallel-size 2 \
    --speculative_config '{"model": "abhigoyal/vllm-medusa-llama-68m-random", "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "draft_pipeline_parallel_size": 1}'

Benchmark

We benchmarked deepseek-ai/DeepSeek-R1 model with a private eagle model in a 2-node cluster. With pipeline parallel enabled (tp=8, pp=2), there's more than 30% tps improvement compared with baseline (tp=16, pp=1).

Instances: p5e.48xlarge (x2)
Target model: deepseek-ai/DeepSeek-R1
Draft model: eagle

cc @LiuXiaoxuanPKU @comaniac Would appreciate your review. Thanks!

github-actions · 2025-03-20T00:45:19Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DefTruth · 2025-03-26T03:04:09Z

Any plan to merge it? i am looking for similar feature for R1 with PP + TP + MTP.

mergify · 2025-03-26T03:38:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xyang16.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…llelism Signed-off-by: Xin Yang <xyangx@amazon.com>

David-lm · 2025-04-01T07:46:01Z

I trust this message finds you in good spirits. I am writing to commend the exceptional work that has been accomplished and to seek clarification on a couple of important matters.

Are there any inherent limitations when configuring the tp_size and pp_size for both the target model and the draft model?

Has the configuration with draft_pp=1 and target_pp>1 been successfully executed on NVIDIA hardware without any issues to date?

vllm/spec_decode/mqa_scorer.py

xyang16 · 2025-04-02T20:47:30Z

I trust this message finds you in good spirits. I am writing to commend the exceptional work that has been accomplished and to seek clarification on a couple of important matters.

Are there any inherent limitations when configuring the tp_size and pp_size for both the target model and the draft model?

Has the configuration with draft_pp=1 and target_pp>1 been successfully executed on NVIDIA hardware without any issues to date?

There's no limitations for target model tp_size and pp_size. For draft model, currently we added support only for draft tp_size=1 and draft pp_size=1.
We have done some benchmarks on p5e.48xlarge instances. See the Benchmark section in the description.

…llelism Signed-off-by: Xin Yang <xyangx@amazon.com>

Signed-off-by: Xin Yang <xyangx@amazon.com>

simon-mo · 2025-04-03T01:03:21Z

Thank you for the PR. Given we have turned on V1 by default, we would like any PR to work in V1 when they are merged. However, I do recognize that V1 doesn't support draft model yet. Leaving that to @WoosukKwon to decide and @LiuXiaoxuanPKU and @ruisearch42 to review.

DeepTecher · 2025-04-05T15:54:11Z

I trust this message finds you in good spirits. I am writing to commend the exceptional work that has been accomplished and to seek clarification on a couple of important matters.

Are there any inherent limitations when configuring the tp_size and pp_size for both the target model and the draft model?

Has the configuration with draft_pp=1 and target_pp>1 been successfully executed on NVIDIA hardware without any issues to date?

There's no limitations for target model tp_size and pp_size. For draft model, currently we added support only for draft tp_size=1 and draft pp_size=1.

We have done some benchmarks on p5e.48xlarge instances. See the Benchmark section in the description.

will you support the case where tp>1？ In some scenarios, the draft model will also be huge.

Signed-off-by: Xin Yang <xyangx@amazon.com>

mergify · 2025-04-11T16:26:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @xyang16.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

aleksanderk-cerebras · 2025-04-24T09:32:47Z

@xyang16, thanks for thre PR! Can you please share the parameters that you used to run DeepSeek on two-node system? First of all, the build of the PR fails on my end. I rebased your branch and the build is fine now but vllm fails at start.
This is the error I got:

WARNING 04-24 10:37:17 [arg_utils.py:1737] Speculative Decoding is not supported by the V1 Engine. Falling back to V0. 
Traceback (most recent call last):
... 
 line 1219, in create_engine_config
    parallel_config = ParallelConfig(
                      ^^^^^^^^^^^^^^^
TypeError: ParallelConfig.__init__() got an unexpected keyword argument 'num_virtual_engine'

xyang16 requested review from njhill and LiuXiaoxuanPKU as code owners March 20, 2025 00:45

mergify bot added the speculative-decoding label Mar 20, 2025

xyang16 force-pushed the specdec branch from bf11602 to f1cdd2d Compare March 20, 2025 00:54

xyang16 changed the title ~~[WIP][Spec Decode] Making speculative decoding compatible with pipeline parallelism~~ [WIP][Spec Decode] Make speculative decoding compatible with pipeline parallelism Mar 20, 2025

xyang16 force-pushed the specdec branch 8 times, most recently from 005e1a9 to ecd9f20 Compare March 20, 2025 21:26

xyang16 changed the title ~~[WIP][Spec Decode] Make speculative decoding compatible with pipeline parallelism~~ [Spec Decode] Make speculative decoding compatible with pipeline parallelism Mar 20, 2025

DeepTecher mentioned this pull request Mar 26, 2025

[Feature]: DeepSeek v3/r1 MTP support PP #14005

Open

1 task

mergify bot added the needs-rebase label Mar 26, 2025

xyang16 force-pushed the specdec branch from cd68f0e to 4086116 Compare March 28, 2025 18:11

mergify bot removed the needs-rebase label Mar 28, 2025

xyang16 force-pushed the specdec branch 2 times, most recently from e138dd7 to 78f283f Compare March 31, 2025 07:31

xyang16 requested review from zhuohan123, youkaichao, alexm-redhat and comaniac as code owners March 31, 2025 07:31

xyang16 force-pushed the specdec branch from 78f283f to 4097d66 Compare April 1, 2025 05:56

[Spec Decode] Make speculative decoding compatible with pipeline para…

b8015ca

…llelism Signed-off-by: Xin Yang <xyangx@amazon.com>

xyang16 force-pushed the specdec branch from 4097d66 to b8015ca Compare April 1, 2025 05:59

xyang16 force-pushed the specdec branch 4 times, most recently from c43f156 to 74137f8 Compare April 1, 2025 07:25

xyang16 force-pushed the specdec branch 7 times, most recently from 0076f69 to 6ca6537 Compare April 2, 2025 04:12

elinx reviewed Apr 2, 2025

View reviewed changes

vllm/spec_decode/mqa_scorer.py Outdated Show resolved Hide resolved

xyang16 force-pushed the specdec branch 2 times, most recently from 4e4c85c to 061ee3d Compare April 2, 2025 19:41

[Spec Decode] Make speculative decoding compatible with pipeline para…

fa8cfa0

…llelism Signed-off-by: Xin Yang <xyangx@amazon.com>

xyang16 force-pushed the specdec branch from 061ee3d to fa8cfa0 Compare April 2, 2025 21:06

xyang16 added 2 commits April 2, 2025 14:55

review changes

5b59e41

Signed-off-by: Xin Yang <xyangx@amazon.com>

update tests

34d1a7b

Signed-off-by: Xin Yang <xyangx@amazon.com>

xyang16 force-pushed the specdec branch from f2d6a1c to 66d88d2 Compare April 3, 2025 18:04

xyang16 added 2 commits April 9, 2025 09:08

update

f16b59a

Signed-off-by: Xin Yang <xyangx@amazon.com>

doc

7292223

Signed-off-by: Xin Yang <xyangx@amazon.com>

xyang16 force-pushed the specdec branch from 66d88d2 to 7292223 Compare April 11, 2025 16:25

mergify bot added the needs-rebase label Apr 11, 2025

xyang16 closed this Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Spec Decode] Make speculative decoding compatible with pipeline parallelism #15173

[Spec Decode] Make speculative decoding compatible with pipeline parallelism #15173

Uh oh!

xyang16 commented Mar 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Mar 20, 2025

Uh oh!

DefTruth commented Mar 26, 2025

Uh oh!

mergify bot commented Mar 26, 2025

Uh oh!

David-lm commented Apr 1, 2025

Uh oh!

Uh oh!

xyang16 commented Apr 2, 2025 •

edited

Loading

Uh oh!

simon-mo commented Apr 3, 2025

Uh oh!

DeepTecher commented Apr 5, 2025

Uh oh!

mergify bot commented Apr 11, 2025

Uh oh!

aleksanderk-cerebras commented Apr 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[Spec Decode] Make speculative decoding compatible with pipeline parallelism #15173

[Spec Decode] Make speculative decoding compatible with pipeline parallelism #15173

Uh oh!

Conversation

xyang16 commented Mar 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark

Uh oh!

github-actions bot commented Mar 20, 2025

Uh oh!

DefTruth commented Mar 26, 2025

Uh oh!

mergify bot commented Mar 26, 2025

Uh oh!

David-lm commented Apr 1, 2025

Uh oh!

Uh oh!

xyang16 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-mo commented Apr 3, 2025

Uh oh!

DeepTecher commented Apr 5, 2025

Uh oh!

mergify bot commented Apr 11, 2025

Uh oh!

aleksanderk-cerebras commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xyang16 commented Mar 20, 2025 •

edited by github-actions bot

Loading

xyang16 commented Apr 2, 2025 •

edited

Loading

aleksanderk-cerebras commented Apr 24, 2025 •

edited

Loading