Skip to content

Fix interleaved 1f1b race #1088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 1, 2024
Merged

Fix interleaved 1f1b race #1088

merged 7 commits into from
May 1, 2024

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Apr 24, 2024

Current fixes:

  • We need to delay the send in the warmup phase right before 1f1b phase so that it is sent along with the recvs in the next step. This is to align the communication between ranks to prevent hang
  • Sorted_batch_isend_irecv() is not working for interleaved 1f1b since we need to ensure that recv / send across different peers are all part of the same coalesced op. So we remove its usage from interleaved 1f1b.

Test added:

When I add sleep for certain ranks then there is a failure on rank 0 [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

Run with:
PIPPY_VERBOSITY=DEBUG pytest test/test_pipeline_schedule.py -vsk test_interleaved_1f1b_with_model_sleep

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 24, 2024
ghstack-source-id: 88843af
Pull Request resolved: #1088
@H-Huang H-Huang marked this pull request as draft April 24, 2024 17:53
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 26, 2024
ghstack-source-id: 7c14ee2
Pull Request resolved: #1088
@H-Huang H-Huang changed the title test with hang Fix pipeline schedule changes Apr 26, 2024
@H-Huang H-Huang changed the title Fix pipeline schedule changes Fix interleaved 1f1b race Apr 26, 2024
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 29, 2024
ghstack-source-id: b43ddf1
Pull Request resolved: #1088
@kwen2501
Copy link
Contributor

Thanks for the fix! I haven't run through the entire change but one small comment:
maybe try not to reuse the name ops too much. It could lead to bugs if other devs did not use it carefully.

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 29, 2024
ghstack-source-id: d5566e9
Pull Request resolved: #1088
@H-Huang H-Huang marked this pull request as ready for review April 29, 2024 17:09
@H-Huang H-Huang requested review from kwen2501 and wconstab April 29, 2024 17:09
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

Comment on lines 608 to 609
Highest rank has a warmup (F only) count of [len(stages) - 1] * group_size
and each hop away from highest rank adds 2 warmup stages due to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does group_size mean here? Number of PP ranks? If so, is [len(stages) - 1] * group_size a square of # PP ranks?
And what does "hop" mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep group_size is pp ranks. And the warmup steps is going to be a multiple of the pp ranks. Hop means for every rank less than the highest. The comment is a bit old so i will update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so len(stages) means the number of local stages?

Comment on lines +613 to +616
TODO: Interleaved 1F1B does not support using sorted_batch_isend_irecv()
because it requires recvs and sends from different peers
to execute in the same coalesced operation. As a result, this schedule does
not support models with skip connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this comment. To actuate it, we can add a has_skip_conn filed in pipe_info, so that the schedule can check this field and error out here. (Rather than silently hang)

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request May 1, 2024
ghstack-source-id: 9ccbaef
Pull Request resolved: #1088
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request May 1, 2024
ghstack-source-id: 21fd725
Pull Request resolved: #1088
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request May 1, 2024
ghstack-source-id: 897aefa
Pull Request resolved: #1088
@H-Huang H-Huang merged commit 090a0d5 into gh/H-Huang/4/base May 1, 2024
H-Huang added a commit that referenced this pull request May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants