Skip to content

Conversation

@MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Sep 22, 2025

NOTE: This PR is a duplicate of #24963 , authored by @Bam4d, because this is high-priority and must be merged shortly

When using DP and CUDA graphs, but running with a batch that does not fill all DP ranks, execute_dummy is used to fill those ranks with work for MoEs.

without CUDA graphs enabled, execute_dummy will be significantly slower than the normal execute_batch, thus slowing down all the ranks due to synchronization for EP.

Before

BS1 DP8

Mean ITL (ms):                           137.62    
Median ITL (ms):                         131.40    
P99 ITL (ms):                            194.54

BS32 DP8

Mean ITL (ms):                           30.49     
Median ITL (ms):                         30.09     
P99 ITL (ms):                            46.84

After

BS1 DP8

Mean ITL (ms):                           31.70     << Significantly faster ITL at BS1
Median ITL (ms):                         30.82     
P99 ITL (ms):                            49.60  

BS32 DP8

Mean ITL (ms):                           30.10     
Median ITL (ms):                         29.82     
P99 ITL (ms):                            35.32

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request rolls back the 'uniform decode with mixed batch in cudagraph capture' feature by adding an assertion. It also includes a critical bug fix for the token distribution logic within _dummy_run for uniform decode scenarios. The previous logic could lead to an IndexError due to incorrect request count calculation and could also assign an incorrect number of tokens to the last request. The new implementation correctly uses ceiling division for calculating the number of requests and properly assigns the remainder of tokens, resolving these issues. The changes are correct and improve the robustness of the code. I approve this pull request.

@ProExpertProg ProExpertProg enabled auto-merge (squash) September 22, 2025 19:12
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 22, 2025
@robertgshaw2-redhat robertgshaw2-redhat changed the title Roll back uniform decode with mixed batch cudagraph [Bugfix] Roll back uniform decode with mixed batch cudagraph Sep 22, 2025
@MatthewBonanni MatthewBonanni changed the title [Bugfix] Roll back uniform decode with mixed batch cudagraph [BugFix] [DP/EP] Fix slow execution when BS <= DP Sep 22, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
auto-merge was automatically disabled September 22, 2025 20:24

Head branch was pushed to by a user without write access

@ProExpertProg ProExpertProg enabled auto-merge (squash) September 22, 2025 20:26
@vllm-bot vllm-bot merged commit ac0048c into vllm-project:main Sep 23, 2025
38 of 40 checks passed
@MatthewBonanni MatthewBonanni deleted the rollback_no_mixed_batch branch September 23, 2025 01:43
num_reqs = num_tokens // max_query_len
assert not create_mixed_batch
num_reqs = cdiv(num_tokens, max_query_len)
assert num_reqs <= max_num_reqs, \
Copy link
Collaborator

@luccafong luccafong Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will need to consider the case when running with single request with long context (that is greater than max_num_reqs, say, a req with 2048, but max_num_reqs=1024), it will fail here, since the dp padded num_tokens will be 2K, the other DP rank will fail directly, since it still go into uniform_decode branch. @MatthewBonanni

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case may we run into this situation in the dummy run? i.e. single request with long context but uniform_decode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just hit the assertion errors with DeepSeek DP EP run.

�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] EngineCore encountered a fatal error.
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] Traceback (most recent call last):
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 701, in run_engine_core
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     engine_core.run_busy_loop()
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 1056, in run_busy_loop
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     self.execute_dummy_batch()
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 387, in execute_dummy_batch
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     self.model_executor.execute_dummy_batch()
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/executor/abstract.py", line 109, in execute_dummy_batch
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     self.collective_rpc("execute_dummy_batch")
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     return [run_method(self.driver_worker, method, args, kwargs)]
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/utils/__init__.py", line 3010, in run_method
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     return func(*args, **kwargs)
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/worker/gpu_worker.py", line 490, in execute_dummy_batch
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     self.model_runner._dummy_run(1, uniform_decode=True)
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/torch/utils/_contextlib.py", line 120, in decorate_context
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     return func(*args, **kwargs)
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]   File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/worker/gpu_model_runner.py", line 2918, in _dummy_run
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710]     assert num_reqs <= max_num_reqs, \
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch

Like what @luccafong said, unless we are running PD disagg, we could have large token padding exceed max_num_reqs when there is dp rank running prefill.

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
Signed-off-by: charlifu <charlifu@amd.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
Signed-off-by: gaojc <1055866782@qq.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
@lianjiezh
Copy link

@MatthewBonanni
Hi, Could you tell me which commit introduced the issue?

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants