-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Remove PP Grad Tail Check #2538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Remove PP Grad Tail Check (until deepspeedai#2538 is merged to upstream)
|
Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close. |
Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out. |
Thanks! Other than a current known failure, it is passing all tests at least, but we will just want to figure out who would be best to review. |
|
Hi, I have met the same problem as well. I wonder if we can move forwards and merge this solution ASAP? |
* Remove PP Grad Tail Check (deepspeedai#2538) * Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <dash.stander@gmail.com> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * Added __HIP_PLATFORM_AMD__=1 (deepspeedai#4570) * fix multiple definition while building evoformer (deepspeedai#4556) Current builder for evoformer use the same name for `attention.cpp` and `attention.cu`, leading to same intermediate filename `attention.o`: ```shell march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe - isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention_back.o ``` and ``` `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)': tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here ``` I use following to reproduce and confirm my fix works: ``` git clone https://github.com/NVIDIA/cutlass --depth 1 CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext" ```  Co-authored-by: Conglong Li <conglong.li@gmail.com> * Update ccl.py --------- Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Xie Zejian <xiezej@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com>
* Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <dash.stander@gmail.com> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
* Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <dash.stander@gmail.com> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
After the patch in #1400 for BigScience, the final element of the
inputstuple is conditional on whether its grad is null (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L995).This will always fail if
elt.grad is Nonein (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1026) with aIndexError: tuple index out of rangebecause theindex_grad_tailin (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1006) doesn't exist.We're facing this issue intermittently whenever the grad tail is full of
nan. I propose we just always include the grad tail.@dashstander @jeffra @ShadenSmith