Skip to content

Commit 4ed3b60

Browse files
Liangliang-MaQuentin-Anthonydashstandertjruwaseloadams
authored
update ccl.py for error type (#24)
* Remove PP Grad Tail Check (deepspeedai#2538) * Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <dash.stander@gmail.com> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * Added __HIP_PLATFORM_AMD__=1 (deepspeedai#4570) * fix multiple definition while building evoformer (deepspeedai#4556) Current builder for evoformer use the same name for `attention.cpp` and `attention.cu`, leading to same intermediate filename `attention.o`: ```shell march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe - isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention_back.o ``` and ``` `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)': tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here ``` I use following to reproduce and confirm my fix works: ``` git clone https://github.com/NVIDIA/cutlass --depth 1 CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext" ``` ![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff) Co-authored-by: Conglong Li <conglong.li@gmail.com> * Update ccl.py --------- Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Xie Zejian <xiezej@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com>
1 parent f40a484 commit 4ed3b60

File tree

5 files changed

+6
-3
lines changed

5 files changed

+6
-3
lines changed

deepspeed/comm/ccl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,7 @@ def get_all_ranks_from_group(self, group):
170170
while True:
171171
results.append(super(CCLBackend, self).get_global_rank(group, rank))
172172
rank += 1
173-
except ValueError:
173+
except (ValueError, RuntimeError):
174174
pass
175175
if tuple(results) not in self.groups:
176176
self._new_group(results, group)

deepspeed/runtime/pipe/engine.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -988,7 +988,7 @@ def _exec_send_grads(self, buffer_id):
988988
if isinstance(inputs, tuple):
989989
first_input = inputs[0]
990990
assert all([torch.is_tensor(elt) for elt in inputs[1:]])
991-
inputs_grad_tail = [elt.grad for elt in inputs[1:] if elt.grad is not None]
991+
inputs_grad_tail = [elt.grad for elt in inputs[1:]]
992992
elif torch.is_tensor(inputs):
993993
first_input = inputs
994994
inputs_grad_tail = []

op_builder/builder.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -486,6 +486,9 @@ def jit_load(self, verbose=True):
486486
cxx_args.append("-DBF16_AVAILABLE")
487487
nvcc_args.append("-DBF16_AVAILABLE")
488488

489+
if self.is_rocm_pytorch():
490+
cxx_args.append("-D__HIP_PLATFORM_AMD__=1")
491+
489492
op_module = load(name=self.name,
490493
sources=self.strip_empty_entries(sources),
491494
extra_include_paths=self.strip_empty_entries(extra_include_paths),

op_builder/evoformer_attn.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ def extra_ldflags(self):
2727

2828
def sources(self):
2929
src_dir = 'csrc/deepspeed4science/evoformer_attn'
30-
return [f'{src_dir}/attention.cpp', f'{src_dir}/attention_back.cu', f'{src_dir}/attention.cu']
30+
return [f'{src_dir}/attention.cpp', f'{src_dir}/attention_back.cu', f'{src_dir}/attention_cu.cu']
3131

3232
def nvcc_args(self):
3333
args = super().nvcc_args()

0 commit comments

Comments
 (0)