-
Notifications
You must be signed in to change notification settings - Fork 86
Fix interleaved 1f1b race #1088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the fix! I haven't run through the entire change but one small comment: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work.
pippy/PipelineSchedule.py
Outdated
Highest rank has a warmup (F only) count of [len(stages) - 1] * group_size | ||
and each hop away from highest rank adds 2 warmup stages due to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does group_size
mean here? Number of PP ranks? If so, is [len(stages) - 1] * group_size
a square of # PP ranks?
And what does "hop" mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep group_size
is pp ranks. And the warmup steps is going to be a multiple of the pp ranks. Hop means for every rank less than the highest. The comment is a bit old so i will update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, so len(stages)
means the number of local stages?
TODO: Interleaved 1F1B does not support using sorted_batch_isend_irecv() | ||
because it requires recvs and sends from different peers | ||
to execute in the same coalesced operation. As a result, this schedule does | ||
not support models with skip connections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this comment. To actuate it, we can add a has_skip_conn
filed in pipe_info
, so that the schedule can check this field and error out here. (Rather than silently hang)
Current fixes:
Sorted_batch_isend_irecv()
is not working for interleaved 1f1b since we need to ensure that recv / send across different peers are all part of the same coalesced op. So we remove its usage from interleaved 1f1b.Test added:
When I add sleep for certain ranks then there is a failure on rank 0
[Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
Run with:
PIPPY_VERBOSITY=DEBUG pytest test/test_pipeline_schedule.py -vsk test_interleaved_1f1b_with_model_sleep
Stack from ghstack (oldest at bottom):