Skip to content

Conversation

@Quentin-Anthony
Copy link
Contributor

After the patch in #1400 for BigScience, the final element of the inputs tuple is conditional on whether its grad is null (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L995).

This will always fail if elt.grad is None in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1026) with a IndexError: tuple index out of range because the index_grad_tail in (https://github.com/microsoft/DeepSpeed/blob/v0.7.5/deepspeed/runtime/pipe/engine.py#L1006) doesn't exist.

We're facing this issue intermittently whenever the grad tail is full of nan. I propose we just always include the grad tail.

@dashstander @jeffra @ShadenSmith

Quentin-Anthony added a commit to EleutherAI/DeeperSpeed that referenced this pull request Mar 9, 2023
Remove PP Grad Tail Check (until deepspeedai#2538 is merged to upstream)
@loadams loadams self-assigned this Aug 22, 2023
@loadams
Copy link
Collaborator

loadams commented Sep 1, 2023

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

@Quentin-Anthony
Copy link
Contributor Author

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out.

@loadams
Copy link
Collaborator

loadams commented Sep 1, 2023

Hi @Quentin-Anthony - is this an active PR you'd still like to see merged? Just going through some older PRs and trying to see if we should review/prioritize or close.

Yes I'm still interested! I'll take another look sometime over the next few days and decide how best to close this out.

Thanks! Other than a current known failure, it is passing all tests at least, but we will just want to figure out who would be best to review.

@RUAN-ZX
Copy link
Contributor

RUAN-ZX commented Oct 26, 2023

Hi, I have met the same problem as well. I wonder if we can move forwards and merge this solution ASAP?

@loadams loadams enabled auto-merge October 26, 2023 16:06
@loadams loadams added this pull request to the merge queue Oct 26, 2023
Merged via the queue into deepspeedai:master with commit 3589cad Oct 26, 2023
delock pushed a commit to delock/DeepSpeedSYCLSupport that referenced this pull request Oct 30, 2023
* Remove PP Grad Tail Check (deepspeedai#2538)

* Only communicate grad tail if it exists

Co-authored-by: Dashiell Stander <dash.stander@gmail.com>

* Revert previous patch and just always send the grad tail

* Formatting

---------

Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

* Added __HIP_PLATFORM_AMD__=1 (deepspeedai#4570)

* fix multiple definition while building evoformer (deepspeedai#4556)

Current builder for evoformer use the same name for `attention.cpp` and
`attention.cu`, leading to same intermediate filename `attention.o`:
```shell
march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -
isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem 
/home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython-
310/csrc/deepspeed4science/evoformer_attn/attention_back.o
```
and
```
`attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)':
      tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here
      /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here
```

I use following to reproduce and confirm my fix works:
```
git clone https://github.com/NVIDIA/cutlass --depth 1
CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext"
```

![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff)

Co-authored-by: Conglong Li <conglong.li@gmail.com>

* Update ccl.py

---------

Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Xie Zejian <xiezej@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
baodii pushed a commit to baodii/DeepSpeed that referenced this pull request Nov 7, 2023
* Only communicate grad tail if it exists

Co-authored-by: Dashiell Stander <dash.stander@gmail.com>

* Revert previous patch and just always send the grad tail

* Formatting

---------

Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
* Only communicate grad tail if it exists

Co-authored-by: Dashiell Stander <dash.stander@gmail.com>

* Revert previous patch and just always send the grad tail

* Formatting

---------

Co-authored-by: Dashiell Stander <dash.stander@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants