-
-
Notifications
You must be signed in to change notification settings - Fork 9.2k
[Feature] Support Pipeline Parallism in torchrun SPMD offline inference for V1 #17827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. The main question is not sure why we need pipeline_parallel_broadcast_output
and cannot make it always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for efficient pp, we would want different stages of pp from different batches run concurrently. would this broadcast make them serialized? like there will be only one batch and one stage running for the full engine.
vllm/v1/worker/gpu_model_runner.py
Outdated
if len(get_pp_group().ranks) > 0: | ||
model_output_broadcast_data = get_pp_group().broadcast_tensor_dict( | ||
model_output_broadcast_data, src=last_rank_in_group) | ||
assert model_output_broadcast_data is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we return earlier if it's not first rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need it for complete sync in current solution
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
thanks @youkaichao, right now running torch.run in a pure synced way to align with SPMD, so this is not the ideal solution for efficient PP, only one batch and one stage. I will think about how to design and improve it so we have multi-batches can overlap while compatible with torch.run as a follow up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed with @luccafong , landing this first to unblock some use case.
Will enhance the perf in the following PR.
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG with a couple comments
last_rank_in_group = pp_group_ranks - 1 | ||
if self.parallel_config.distributed_executor_backend \ | ||
== "external_launcher" and len(get_pp_group().ranks) > 0: | ||
model_output_broadcast_data = get_pp_group().broadcast_tensor_dict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why we need this broadcast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added as comments, now we enable by sync all ranks, will improve to reduce pp bubles in following PR.
Signed-off-by: Lucia Fang <fanglu@fb.com>
Signed-off-by: Lucia Fang <fanglu@fb.com>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Lucia Fang <fanglu@fb.com>
CI build failure not related, also pulled latest change, @houseroad |
Actually v1 test failure may be related: E AssertionError: pipeline model parallel group is not initialized cc: @luccafong |
this is a trunk failure Added a fix here: |
…ce for V1 (vllm-project#17827) Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>
…ce for V1 (vllm-project#17827) Signed-off-by: Lucia Fang <fanglu@fb.com> Signed-off-by: minpeter <kali2005611@gmail.com>
This PR add support for PP on torchrun offline inference, note this now does not support overlapping mircobatches so not the most efficient way to ublock PP use cases, will improve as follow ups.
output