Skip to content

[V1][PP] Support PP for MultiprocExecutor #14219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 6, 2025

Conversation

bigPYJ1151
Copy link
Member

By offloading MQ reading to a IO thread, MultiprocExecutor can receive multiple batches in flight and support V1 PP seamlessly.

Copy link

github-actions bot commented Mar 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 4, 2025
@comaniac comaniac self-assigned this Mar 4, 2025
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!
Please see the comments to align the v1 PP interface and design.
Meanwhile, do you have any benchmark numbers?

@comaniac
Copy link
Collaborator

comaniac commented Mar 4, 2025

Also cc @ruisearch42

@@ -89,8 +89,8 @@ def detailed(
chunked_prefill=False),
],
# only ray is supported for V1
distributed_backends=["mp", "ray", "ray"],
vllm_major_versions=["0", "0", "1"],
distributed_backends=["mp", "mp", "ray", "ray"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment above

def collective_rpc(self,
method: Union[str, Callable],
timeout: Optional[float] = None,
args: tuple = (),
kwargs: Optional[dict] = None) -> list[Any]:
kwargs: Optional[dict] = None,
non_block: bool = False) -> list[Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since @comaniac and @ruisearch42 do not want to add non_block at the collective_rpc level, you can have a _run_workers function to have the ability to be non_blocking, and then in the execute_model level of this executor, call _run_workerswith non-blocking if pp is enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea :)

Comment on lines 242 to 243
driver_worker_rank = (vllm_config.parallel_config.world_size -
vllm_config.parallel_config.tensor_parallel_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it calculated this way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the driver should be the first rank in the last TP group.

Now changed is_driver checking to self.rank % tp_size == 0 as the definition in V0.

@youkaichao
Copy link
Member

also cc @njhill and @tlrmchlsmth for the mp-related change.

@bigPYJ1151
Copy link
Member Author

Thanks for @comaniac @ruisearch42 @youkaichao 's comments, I have resolved them all.

I have some initial benchmark results. On a 4xL40 platform:

# TP=2 PP=1
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=2 -pp=1 --max-model-len=8192 
Throughput: 5.60 requests/s, 2314.67 total tokens/s, 1110.17 output tokens/s

# TP=1 PP=2
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=1 -pp=2 --max-model-len=8192 
Throughput: 5.74 requests/s, 2374.38 total tokens/s, 1138.81 output tokens/s

# TP=1 PP=4
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=4 -pp=1 --max-model-len=8192 
Throughput: 7.93 requests/s, 3277.62 total tokens/s, 1572.03 output tokens/s

# TP=1 PP=4
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=1 -pp=4 --max-model-len=8192 
Throughput: 9.28 requests/s, 3838.89 total tokens/s, 1841.22 output tokens/s

# TP=2 PP=2
VLLM_USE_V1=1 python3 benchmark_throughput.py --backend=vllm --dataset=./ShareGPT_V3_unfiltered_cleaned_split.json --model=meta-llama/Meta-Llama-3-8B-Instruct --n=1 --num-prompts=1000 --trust-remote-code --disable-log-stats -tp=2 -pp=2 --max-model-len=8192 
Throughput: 9.69 requests/s, 4005.70 total tokens/s, 1921.23 output tokens/s

) -> Union[ModelRunnerOutput, Future[ModelRunnerOutput]]:
output = self._run_workers("execute_model",
args=(scheduler_output, ),
non_block=self.max_concurrent_batches > 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should be able to eliminate non_block here and just have

def _run_workers(self, ...):
    non_block = self.max_concurrent_batches > 1
    ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now collective_rpc is using _run_workers. I think this will make all collective_rpc return future objects when using PP 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It's a bit confusing tho. Could we try to add more comments to help future developments? For example:

    def _run_workers(self,
                     method: Union[str, Callable],
                     timeout: Optional[float] = None,
                     args: tuple = (),
                     kwargs: Optional[dict] = None,
                     non_block: bool = False) -> list[Any]:
        """Run the method on workers of all ranks and get their responses.
        Note that when non_block=True (PP > 1), this function immediately returns
        future objects to unblock pipeline.

        Args:
            ...
        
        Return:
            A list of responses from workers of all ranks. The response will be
            future objects if non_block=True.
        """

Comment on lines 230 to 261
# Note: only returns ModelRunnerOutput from the driver worker with the
# last PP rank
return output[self.world_size -
self.parallel_config.tensor_parallel_size]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation may have issues when other parallelism is used (e.g., EP, DP attention)?

Suggested change
# Note: only returns ModelRunnerOutput from the driver worker with the
# last PP rank
return output[self.world_size -
self.parallel_config.tensor_parallel_size]
# Only returns ModelRunnerOutput from TP rank=0 and PP rank=-1
# (the first TP worker of the last PP stage).
return output[self.world_size -
self.parallel_config.tensor_parallel_size]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.world_size - self.parallel_config.tensor_parallel_size is tp rank 0 with pp rank = 1. i think it is still not correct.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming TP=8, PP=4, then the world size is 32. IIUC, in this case the PP ranks would be
0-7, PP rank 0
8-15, PP rank 1
16-23, PP rank 2
24-31, PP rank 3

so world_size - tp_size = 32 - 8 = 24 should be PP rank = -1 (i.e. 3)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EP is using TP group so it will not have issue.

DP will have issue because the rank map shape is (DP, PP, TP). But right now the MP executor suppose there are only TP and PP and checking it in initialization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad, this is indeed correct then. we should add more comment here, e.g. including the calculation from @comaniac . it is quite easy to mess things up here.

@bigPYJ1151 bigPYJ1151 force-pushed the v1_pp branch 2 times, most recently from 64fc61e to 5ab24c7 Compare March 7, 2025 18:02
Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
Since we are in the process of making v1 as the default and this PR changes the executor, perhaps it's better to wait for a few more days until other issues are resolved.

Also cc @njhill @robertgshaw2-redhat @WoosukKwon for awareness.

) -> Union[ModelRunnerOutput, Future[ModelRunnerOutput]]:
output = self._run_workers("execute_model",
args=(scheduler_output, ),
non_block=self.max_concurrent_batches > 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It's a bit confusing tho. Could we try to add more comments to help future developments? For example:

    def _run_workers(self,
                     method: Union[str, Callable],
                     timeout: Optional[float] = None,
                     args: tuple = (),
                     kwargs: Optional[dict] = None,
                     non_block: bool = False) -> list[Any]:
        """Run the method on workers of all ranks and get their responses.
        Note that when non_block=True (PP > 1), this function immediately returns
        future objects to unblock pipeline.

        Args:
            ...
        
        Return:
            A list of responses from workers of all ranks. The response will be
            future objects if non_block=True.
        """

@WoosukKwon
Copy link
Collaborator

Hi @bigPYJ1151, can you please rebase the PR and resolve merge conflicts?

@bigPYJ1151 bigPYJ1151 force-pushed the v1_pp branch 2 times, most recently from 9b72ac2 to 44610ab Compare April 29, 2025 12:48
@bigPYJ1151
Copy link
Member Author

bigPYJ1151 commented Apr 29, 2025

Hi @bigPYJ1151, can you please rebase the PR and resolve merge conflicts?

@WoosukKwon Sure, updated, also verified the unit tests. Please take a look :)

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Apr 29, 2025

@ruisearch42 @comaniac @youkaichao Can you please take a final look by any chance?

Copy link
Collaborator

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM. Thanks for the PR.
Would be nice if the MP backend owners could review too.

@@ -350,6 +349,11 @@ def _compare_tp(
# Temporary. Currently when zeromq + SPMD is used, it does not properly
# terminate because of a Ray Compiled Graph issue.
common_args.append("--disable-frontend-multiprocessing")
elif distributed_backend == "mp":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only support "ray" and "mp", so should have the else branch and assert it's "mp"

Comment on lines 59 to 60
f"tensor_parallel_size ({tensor_parallel_size}) x pipeline"
f" parallel_size ({pp_parallel_size}). ")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use a single word pipeline_parallel_size in the message

Copy link
Collaborator

@ruisearch42 ruisearch42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the other minor issue as well

bigPYJ1151 added 2 commits May 6, 2025 07:18
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label May 6, 2025
@WoosukKwon
Copy link
Collaborator

@bigPYJ1151 I've just started the CI test. Will merge once it becomes green.

@bigPYJ1151
Copy link
Member Author

@WoosukKwon All required became green, please help to merge, thanks :)

@WoosukKwon WoosukKwon merged commit a6fed02 into vllm-project:main May 6, 2025
73 checks passed
robertgshaw2-redhat added a commit to neuralmagic/vllm that referenced this pull request May 6, 2025
* [Model] Add GraniteMoeHybrid 4.0 model (vllm-project#17497)

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Co-authored-by: Thomas Ortner <boh@zurich.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>

* [easy] Fix logspam on PiecewiseBackend errors (vllm-project#17138)

Signed-off-by: rzou <zou3519@gmail.com>

* [Bugfix] Fixed prompt length for random dataset (vllm-project#17408)

Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com>

* [Doc] Update notes for H2O-VL and Gemma3 (vllm-project#17219)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Misc] Fix ScalarType float4 naming  (vllm-project#17690)

Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

* Fix `dockerfilegraph` pre-commit hook (vllm-project#17698)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Bugfix] Fix triton import with local TritonPlaceholder (vllm-project#17446)

Signed-off-by: Mengqing Cao <cmq0113@163.com>

* [V1] Enable TPU V1 backend by default (vllm-project#17673)

Signed-off-by: mgoin <mgoin64@gmail.com>

* [V1][PP] Support PP for MultiprocExecutor (vllm-project#14219)

Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>

* [v1] AttentionMetadata for each layer (vllm-project#17394)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [Feat] Add deprecated=True to CLI args (vllm-project#17426)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>

* [Docs] Use gh-file to add links to tool_calling.md (vllm-project#17709)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* [v1] Introduce KVCacheBlocks as interface between Scheduler and KVCacheManager (vllm-project#17479)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [doc] Add RAG Integration example (vllm-project#17692)

Signed-off-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>

* [Bugfix] Fix modality limits in vision language example (vllm-project#17721)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* Make right sidebar more readable in "Supported Models" (vllm-project#17723)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [TPU] Increase block size and reset block shapes (vllm-project#16458)

* [Misc] Add Next Edit Prediction (NEP) datasets support in `benchmark_serving.py` (vllm-project#16839)

Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>

* [Bugfix] Fix for the condition to accept empty encoder inputs for mllama (vllm-project#17732)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

* [Kernel] Unified Triton kernel that doesn't distinguish between prefill + decode (vllm-project#16828)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

---------

Signed-off-by: Thomas Ortner <boh@zurich.ibm.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: rzou <zou3519@gmail.com>
Signed-off-by: Mikhail Podvitskii <podvitskiymichael@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
Signed-off-by: dtransposed <>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>
Co-authored-by: Stan Wozniak <77159600+s3woz@users.noreply.github.com>
Co-authored-by: Thomas Ortner <boh@zurich.ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Tyler Michael Smith <tysmith@redhat.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Mikhail Podvitskii <podvitskiymichael@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: reidliu41 <reid201711@gmail.com>
Co-authored-by: Jevin Jiang <jevin0change@gmail.com>
Co-authored-by: d.transposed <damian.bogunowicz@gmail.com>
Co-authored-by: dtransposed <damian@damian-ml-machine.europe-west3-b.c.jetbrains-grazie.internal>
Co-authored-by: Gregory Shtrasberg <156009573+gshtras@users.noreply.github.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
minpeter pushed a commit to minpeter/vllm that referenced this pull request Jun 24, 2025
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: jiang.li <jiang1.li@intel.com>
Signed-off-by: minpeter <kali2005611@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants