-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][Distributed] add coordinator to reduce code duplication in tp and pp #5293
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM! Left some style comments.
vllm/distributed/parallel_state.py
Outdated
_WORLD: Optional[GroupCoordinator] = None | ||
|
||
|
||
def get_world() -> GroupCoordinator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about the following changes? Make the function name easier to understand:
get_world
->get_world_group
?get_tp
->get_tp_group
?get_pp
->get_pp_group
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao Great! I love this refactoring PR 👍 |
…-project#5293) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
…-project#5293) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
…-project#5293) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
…-project#5293) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
…-project#5293) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
…-project#5293) [Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
This PR adds coordinator, the last piece in #3587 . The main benefit is that all distributed stuff (e.g. complicated selection logic in communication ops, prev/next rank logic, initialization&destruction logic) can be shared for all tp/pp code.
After this PR, we explicitly require the code writer specify which group it wants to operate on:
get_world()
get_tp()
get_pp()
After that, they can just call
all_reduce
,prev_rank
... without any further changes.