-
-
Couldn't load subscription status.
- Fork 10.8k
[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] [DP/EP] Fix slow execution when BS <= DP #25407
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request rolls back the 'uniform decode with mixed batch in cudagraph capture' feature by adding an assertion. It also includes a critical bug fix for the token distribution logic within _dummy_run for uniform decode scenarios. The previous logic could lead to an IndexError due to incorrect request count calculation and could also assign an incorrect number of tokens to the last request. The new implementation correctly uses ceiling division for calculating the number of requests and properly assigns the remainder of tokens, resolving these issues. The changes are correct and improve the robustness of the code. I approve this pull request.
Head branch was pushed to by a user without write access
| num_reqs = num_tokens // max_query_len | ||
| assert not create_mixed_batch | ||
| num_reqs = cdiv(num_tokens, max_query_len) | ||
| assert num_reqs <= max_num_reqs, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need to consider the case when running with single request with long context (that is greater than max_num_reqs, say, a req with 2048, but max_num_reqs=1024), it will fail here, since the dp padded num_tokens will be 2K, the other DP rank will fail directly, since it still go into uniform_decode branch. @MatthewBonanni
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case may we run into this situation in the dummy run? i.e. single request with long context but uniform_decode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just hit the assertion errors with DeepSeek DP EP run.
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] EngineCore encountered a fatal error.
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] Traceback (most recent call last):
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 701, in run_engine_core
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] engine_core.run_busy_loop()
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 1056, in run_busy_loop
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.execute_dummy_batch()
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/engine/core.py", line 387, in execute_dummy_batch
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.model_executor.execute_dummy_batch()
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/executor/abstract.py", line 109, in execute_dummy_batch
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.collective_rpc("execute_dummy_batch")
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/executor/uniproc_executor.py", line 83, in collective_rpc
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] return [run_method(self.driver_worker, method, args, kwargs)]
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/utils/__init__.py", line 3010, in run_method
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] return func(*args, **kwargs)
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/worker/gpu_worker.py", line 490, in execute_dummy_batch
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] self.model_runner._dummy_run(1, uniform_decode=True)
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/torch/utils/_contextlib.py", line 120, in decorate_context
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] return func(*args, **kwargs)
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] File "/packages/smart.inference_platform_sp.llm_predictor_gpu_vllm.persistent/service-inplace#link-tree/vllm/v1/worker/gpu_model_runner.py", line 2918, in _dummy_run
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] assert num_reqs <= max_num_reqs, \
�[1;36m(EngineCore_DP1 pid=1540)�[0;0m ERROR 09-25 22:16:40 [core.py:710] AssertionError: Do not capture num_reqs > max_num_reqs for uniform batch
Like what @luccafong said, unless we are running PD disagg, we could have large token padding exceed max_num_reqs when there is dp rank running prefill.
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: gaojc <1055866782@qq.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com>
|
@MatthewBonanni |
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Chris Bamford <chrisbam4d@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
NOTE: This PR is a duplicate of #24963 , authored by @Bam4d, because this is high-priority and must be merged shortly
When using DP and CUDA graphs, but running with a batch that does not fill all DP ranks, execute_dummy is used to fill those ranks with work for MoEs.
without CUDA graphs enabled, execute_dummy will be significantly slower than the normal execute_batch, thus slowing down all the ranks due to synchronization for EP.
Before
BS1 DP8
BS32 DP8
After
BS1 DP8
BS32 DP8