[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407

MatthewBonanni · 2025-09-22T17:05:16Z

NOTE: This PR is a duplicate of #24963 , authored by @Bam4d, because this is high-priority and must be merged shortly

When using DP and CUDA graphs, but running with a batch that does not fill all DP ranks, execute_dummy is used to fill those ranks with work for MoEs.

without CUDA graphs enabled, execute_dummy will be significantly slower than the normal execute_batch, thus slowing down all the ranks due to synchronization for EP.

Before

BS1 DP8

Mean ITL (ms):                           137.62    
Median ITL (ms):                         131.40    
P99 ITL (ms):                            194.54

BS32 DP8

Mean ITL (ms):                           30.49     
Median ITL (ms):                         30.09     
P99 ITL (ms):                            46.84

After

BS1 DP8

Mean ITL (ms):                           31.70     << Significantly faster ITL at BS1
Median ITL (ms):                         30.82     
P99 ITL (ms):                            49.60

BS32 DP8

Mean ITL (ms):                           30.10     
Median ITL (ms):                         29.82     
P99 ITL (ms):                            35.32

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

gemini-code-assist

Code Review

This pull request rolls back the 'uniform decode with mixed batch in cudagraph capture' feature by adding an assertion. It also includes a critical bug fix for the token distribution logic within _dummy_run for uniform decode scenarios. The previous logic could lead to an IndexError due to incorrect request count calculation and could also assign an incorrect number of tokens to the last request. The new implementation correctly uses ceiling division for calculating the number of requests and properly assigns the remainder of tokens, resolving these issues. The changes are correct and improve the robustness of the code. I approve this pull request.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

luccafong · 2025-09-24T23:47:24Z

vllm/v1/worker/gpu_model_runner.py

-            num_reqs = num_tokens // max_query_len
+            assert not create_mixed_batch
+            num_reqs = cdiv(num_tokens, max_query_len)
            assert num_reqs <= max_num_reqs, \


we will need to consider the case when running with single request with long context (that is greater than max_num_reqs, say, a req with 2048, but max_num_reqs=1024), it will fail here, since the dp padded num_tokens will be 2K, the other DP rank will fail directly, since it still go into uniform_decode branch. @MatthewBonanni

cc @LucasWilkinson

In what case may we run into this situation in the dummy run? i.e. single request with long context but uniform_decode?

I just hit the assertion errors with DeepSeek DP EP run.

�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] EngineCore encountered a fatal error. �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] Traceback (most recent call last): �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 701, in run_engine_core �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] engine_core.run_busy_loop() �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 1056, in run_busy_loop �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.execute_dummy_batch() �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 387, in execute_dummy_batch �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.model_executor.execute_dummy_batch() �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/executor/abstract.py", line 109, in execute_dummy_batch �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.collective_rpc("execute_dummy_batch") �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/executor/uniproc_executor.py", line 83, in collective_rpc �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] return [run_method(self.driver_worker, method, args, kwargs)] �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/utils/__init__.py", line 3010, in run_method �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] return func(*args, **kwargs) �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/worker/gpu_worker.py", line 490, in execute_dummy_batch �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.model_runner._dummy_run(1, uniform_decode=True) �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/torch/utils/_contextlib.py", line 120, in decorate_context �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] return func(*args, **kwargs) �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/worker/gpu_model_runner.py", line 2918, in _dummy_run �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] assert num_reqs <= max_num_reqs, \ �[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch

Like what @luccafong said, unless we are running PD disagg, we could have large token padding exceed max_num_reqs when there is dp rank running prefill.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: gaojc <1055866782@qq.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>

lianjiezh · 2025-10-21T08:26:42Z

@MatthewBonanni
Hi, Could you tell me which commit introduced the issue?

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

rollback

4b63deb

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

MatthewBonanni requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 22, 2025 17:05

mergify bot added the v1 label Sep 22, 2025

MatthewBonanni mentioned this pull request Sep 22, 2025

[BugFix] [DP/EP] Fix slow execution when BS <= DP #24963

Closed

gemini-code-assist bot reviewed Sep 22, 2025

View reviewed changes

ProExpertProg approved these changes Sep 22, 2025

View reviewed changes

ProExpertProg enabled auto-merge (squash) September 22, 2025 19:12

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025

Merge branch 'main' into rollback_no_mixed_batch

463deec

robertgshaw2-redhat changed the title ~~Roll back uniform decode with mixed batch cudagraph~~ [Bugfix] Roll back uniform decode with mixed batch cudagraph Sep 22, 2025

MatthewBonanni changed the title ~~[Bugfix] Roll back uniform decode with mixed batch cudagraph~~ [BugFix] [DP/EP] Fix slow execution when BS <= DP Sep 22, 2025

Bugfix

3ddd2da

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

auto-merge was automatically disabled September 22, 2025 20:24
Head branch was pushed to by a user without write access

ProExpertProg enabled auto-merge (squash) September 22, 2025 20:26

Merge branch 'main' into rollback_no_mixed_batch

e821db6

vllm-bot merged commit ac0048c into vllm-project:main Sep 23, 2025
38 of 40 checks passed

MatthewBonanni deleted the rollback_no_mixed_batch branch September 23, 2025 01:43

MatthewBonanni mentioned this pull request Sep 23, 2025

[Bug][WideEP]: API server hangs during DP=16 + DBO on B200 at high concurrency #25491

Closed

1 task

LucasWilkinson mentioned this pull request Sep 23, 2025

[BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch #25505

Merged

luccafong reviewed Sep 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407

[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407

Uh oh!

MatthewBonanni commented Sep 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

luccafong Sep 24, 2025 •

edited

Loading

Uh oh!

MatthewBonanni Sep 25, 2025

Uh oh!

fhl2000 Sep 25, 2025

Uh oh!

wpc Sep 26, 2025

Uh oh!

lianjiezh commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Uh oh!

Uh oh!

[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407

[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407

Uh oh!

Conversation

MatthewBonanni commented Sep 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before

After

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

luccafong Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MatthewBonanni Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

fhl2000 Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

wpc Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

lianjiezh commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

MatthewBonanni commented Sep 22, 2025 •

edited by github-actions bot

Loading

luccafong Sep 24, 2025 •

edited

Loading