Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][Distributed] add coordinator to reduce code duplication in tp and pp #5293

Merged
merged 43 commits into from
Jun 13, 2024

Conversation

youkaichao
Copy link
Member

This PR adds coordinator, the last piece in #3587 . The main benefit is that all distributed stuff (e.g. complicated selection logic in communication ops, prev/next rank logic, initialization&destruction logic) can be shared for all tp/pp code.

After this PR, we explicitly require the code writer specify which group it wants to operate on:

  • world group, by get_world()
  • tp group, by get_tp()
  • pp group, by get_pp()

After that, they can just call all_reduce, prev_rank ... without any further changes.

@youkaichao youkaichao requested a review from zhuohan123 June 5, 2024 21:05
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM! Left some style comments.

_WORLD: Optional[GroupCoordinator] = None


def get_world() -> GroupCoordinator:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the following changes? Make the function name easier to understand:

  • get_world -> get_world_group?
  • get_tp -> get_tp_group?
  • get_pp -> get_pp_group?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in ef78f38ef78f38

vllm/distributed/parallel_state.py Outdated Show resolved Hide resolved
vllm/distributed/communication_op.py Show resolved Hide resolved
@youkaichao youkaichao merged commit ea3890a into vllm-project:main Jun 13, 2024
56 of 57 checks passed
@youkaichao youkaichao deleted the coordinator_impl branch June 13, 2024 00:27
@wooyeonlee0
Copy link
Contributor

@youkaichao Great! I love this refactoring PR 👍
My code in #5414 can be simpler now :)

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 16, 2024
…-project#5293)

[Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024
…-project#5293)

[Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024
…-project#5293)

[Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024
…-project#5293)

[Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024
…-project#5293)

[Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
…-project#5293)

[Core][Distributed] add coordinator to reduce code duplication in tp and pp (vllm-project#5293)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants