Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix interleaved 1f1b race #1088

Merged
merged 7 commits into from
May 1, 2024
Merged

Fix interleaved 1f1b race #1088

merged 7 commits into from
May 1, 2024

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Apr 24, 2024

Current fixes:

  • We need to delay the send in the warmup phase right before 1f1b phase so that it is sent along with the recvs in the next step. This is to align the communication between ranks to prevent hang
  • Sorted_batch_isend_irecv() is not working for interleaved 1f1b since we need to ensure that recv / send across different peers are all part of the same coalesced op. So we remove its usage from interleaved 1f1b.

Test added:

When I add sleep for certain ranks then there is a failure on rank 0 [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.

Run with:
PIPPY_VERBOSITY=DEBUG pytest test/test_pipeline_schedule.py -vsk test_interleaved_1f1b_with_model_sleep

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 24, 2024
ghstack-source-id: 88843aff77a12b0e6c0098e74e4a22ac6da8a227
Pull Request resolved: #1088
@H-Huang H-Huang marked this pull request as draft April 24, 2024 17:53
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 26, 2024
ghstack-source-id: 7c14ee2bf47d8dde8722355b43d3b5c7bc24e945
Pull Request resolved: #1088
@H-Huang H-Huang changed the title test with hang Fix pipeline schedule changes Apr 26, 2024
@H-Huang H-Huang changed the title Fix pipeline schedule changes Fix interleaved 1f1b race Apr 26, 2024
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 29, 2024
ghstack-source-id: b43ddf1d6f6a342a040b9e98b4ac927aff7472c3
Pull Request resolved: #1088
@kwen2501
Copy link
Contributor

Thanks for the fix! I haven't run through the entire change but one small comment:
maybe try not to reuse the name ops too much. It could lead to bugs if other devs did not use it carefully.

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Apr 29, 2024
ghstack-source-id: d5566e9b37f3c2797086a31fc1a3447a3ade09ab
Pull Request resolved: #1088
@H-Huang H-Huang marked this pull request as ready for review April 29, 2024 17:09
@H-Huang H-Huang requested review from kwen2501 and wconstab April 29, 2024 17:09
Copy link
Contributor

@kwen2501 kwen2501 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

Comment on lines 608 to 609
Highest rank has a warmup (F only) count of [len(stages) - 1] * group_size
and each hop away from highest rank adds 2 warmup stages due to:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does group_size mean here? Number of PP ranks? If so, is [len(stages) - 1] * group_size a square of # PP ranks?
And what does "hop" mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep group_size is pp ranks. And the warmup steps is going to be a multiple of the pp ranks. Hop means for every rank less than the highest. The comment is a bit old so i will update it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, so len(stages) means the number of local stages?

Comment on lines +613 to +616
TODO: Interleaved 1F1B does not support using sorted_batch_isend_irecv()
because it requires recvs and sends from different peers
to execute in the same coalesced operation. As a result, this schedule does
not support models with skip connections.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this comment. To actuate it, we can add a has_skip_conn filed in pipe_info, so that the schedule can check this field and error out here. (Rather than silently hang)

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request May 1, 2024
ghstack-source-id: 9ccbaefb21a98b49ac89ddb3f35e52c48f1993c1
Pull Request resolved: #1088
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request May 1, 2024
ghstack-source-id: 21fd725ff4e7de966721e5ab74d29fe8bbd03fcf
Pull Request resolved: #1088
[ghstack-poisoned]
H-Huang added a commit that referenced this pull request May 1, 2024
ghstack-source-id: 897aefaedaf8b2ff2989839f42269f5536254962
Pull Request resolved: #1088
@H-Huang H-Huang merged commit 090a0d5 into gh/H-Huang/4/base May 1, 2024
6 checks passed
H-Huang added a commit that referenced this pull request May 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants