Fix interleaved 1f1b race#1088
Conversation
|
Thanks for the fix! I haven't run through the entire change but one small comment: |
| Highest rank has a warmup (F only) count of [len(stages) - 1] * group_size | ||
| and each hop away from highest rank adds 2 warmup stages due to: |
There was a problem hiding this comment.
What does group_size mean here? Number of PP ranks? If so, is [len(stages) - 1] * group_size a square of # PP ranks?
And what does "hop" mean?
There was a problem hiding this comment.
Yep group_size is pp ranks. And the warmup steps is going to be a multiple of the pp ranks. Hop means for every rank less than the highest. The comment is a bit old so i will update it.
There was a problem hiding this comment.
oh, so len(stages) means the number of local stages?
| TODO: Interleaved 1F1B does not support using sorted_batch_isend_irecv() | ||
| because it requires recvs and sends from different peers | ||
| to execute in the same coalesced operation. As a result, this schedule does | ||
| not support models with skip connections. |
There was a problem hiding this comment.
I like this comment. To actuate it, we can add a has_skip_conn filed in pipe_info, so that the schedule can check this field and error out here. (Rather than silently hang)
Current fixes:
Sorted_batch_isend_irecv()is not working for interleaved 1f1b since we need to ensure that recv / send across different peers are all part of the same coalesced op. So we remove its usage from interleaved 1f1b.Test added:
When I add sleep for certain ranks then there is a failure on rank 0
[Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.Run with:
PIPPY_VERBOSITY=DEBUG pytest test/test_pipeline_schedule.py -vsk test_interleaved_1f1b_with_model_sleepStack from ghstack (oldest at bottom):