-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix interleaved 1f1b race #1088
Conversation
ghstack-source-id: 88843aff77a12b0e6c0098e74e4a22ac6da8a227 Pull Request resolved: #1088
ghstack-source-id: 7c14ee2bf47d8dde8722355b43d3b5c7bc24e945 Pull Request resolved: #1088
ghstack-source-id: b43ddf1d6f6a342a040b9e98b4ac927aff7472c3 Pull Request resolved: #1088
Thanks for the fix! I haven't run through the entire change but one small comment: |
ghstack-source-id: d5566e9b37f3c2797086a31fc1a3447a3ade09ab Pull Request resolved: #1088
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
pippy/PipelineSchedule.py
Outdated
Highest rank has a warmup (F only) count of [len(stages) - 1] * group_size | ||
and each hop away from highest rank adds 2 warmup stages due to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does group_size
mean here? Number of PP ranks? If so, is [len(stages) - 1] * group_size
a square of # PP ranks?
And what does "hop" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep group_size
is pp ranks. And the warmup steps is going to be a multiple of the pp ranks. Hop means for every rank less than the highest. The comment is a bit old so i will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so len(stages)
means the number of local stages?
TODO: Interleaved 1F1B does not support using sorted_batch_isend_irecv() | ||
because it requires recvs and sends from different peers | ||
to execute in the same coalesced operation. As a result, this schedule does | ||
not support models with skip connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this comment. To actuate it, we can add a has_skip_conn
filed in pipe_info
, so that the schedule can check this field and error out here. (Rather than silently hang)
ghstack-source-id: 9ccbaefb21a98b49ac89ddb3f35e52c48f1993c1 Pull Request resolved: #1088
ghstack-source-id: 21fd725ff4e7de966721e5ab74d29fe8bbd03fcf Pull Request resolved: #1088
ghstack-source-id: 897aefaedaf8b2ff2989839f42269f5536254962 Pull Request resolved: #1088
Current fixes:
Sorted_batch_isend_irecv()
is not working for interleaved 1f1b since we need to ensure that recv / send across different peers are all part of the same coalesced op. So we remove its usage from interleaved 1f1b.Test added:
When I add sleep for certain ranks then there is a failure on rank 0
[Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Run with:
PIPPY_VERBOSITY=DEBUG pytest test/test_pipeline_schedule.py -vsk test_interleaved_1f1b_with_model_sleep
Stack from ghstack (oldest at bottom):