[Auto Parallel] fix hang caused by different process group initialization order #68847
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Category
Auto Parallel
PR Types
Bug fixes
Description
修复由于通信组创建顺序不同导致的hang的问题。此问题首次出现在调试allgather moe动转静测试中,具体原因如下:
目前动转静通信组的建立是在静半reshard模块中,例如:nd_mesh_reshard_func.py,创建通信组之后所有进程都要同步一次,用以避免其他的问题。所以要求全部rank创建通信组的顺序必须严格一致,否则可能导致hang。例如有4个进程[0, 1, 2, 3],有的进程创建了[0, 2]通信组,有的进程没有,就会导致hang,moe场景恰好触发了这个case。
Pcard-73145