Fix: Sparse tensors not updating #1914

Dipet · 2022-04-26T11:51:51Z

param.grad.data returns new object, for this reason we do not assign updated sparse gradients to the target tensor and the optimizer works with unique gradient tensor on each device.

tjruwase · 2022-04-26T14:16:45Z

@Dipet, can you please share a bit more about this issue. In particular, what do you mean by .data returns a new object. In my experience, .data is the way to get the underlying storage object so that multiple tensors can reference the same buffer. Is this different from your experience? Thanks!

Dipet · 2022-04-26T14:25:53Z

Yes, I am sorry, I forget that in this PR we can not see the second call of tensor.data there https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L2183

When we call twice .data we get different python objects.
This is a small reproducible example of this problem:

import torch

tensor = torch.tensor([0.0, 1., 2.], requires_grad=True)
tensor.sum().backward()

g = tensor.grad
print("Original gradient")
print(g)  # tensor([1., 1., 1.])

d = g.data
d.data = torch.tensor([3.0, 4., 5.])
print("Data is not assigned to original tensor gradient")
print(tensor.grad)  # tensor([1., 1., 1.])

g.data = torch.tensor([3.0, 4., 5.])
print("Data successfully assigned")
print(tensor.grad)  # tensor([3., 4., 5.])

Dipet · 2022-04-26T14:48:41Z

For dense tensors we use inplace copy_ operation to update gradients after the allreduce operation so we do not catch this problem for them.
I'm not sure if we could use the same logic for sparse tensors. Also, using the copy_ method is not the fastest approach when we could just swap 2 objects.

Dipet · 2022-04-26T15:17:56Z

ok, I checked inplace copy_ with sparse tensor, but this approach also need to remove of call .data because I catch this error:

    tensor.orig_dense_tensor.copy_(tensor.to_coo_tensor())
RuntimeError: resize_ is not allowed on a Tensor created from .data or .detach().
If your intent is to change the metadata of a Tensor (such as sizes / strides / storage / storage_offset)
without autograd tracking the change, remove the .data / .detach() call and wrap the change in a `with torch.no_grad():` block.
For example, change:
    x.data.set_(y)
to:
    with torch.no_grad():
        x.set_(y)
python-BaseException

tjruwase · 2022-04-26T17:27:10Z

@Dipet, thanks for sharing this context. I have not worked a lot with sparse tensors, so that is probably why I have not run into this issue. My suggestion would be to implement this logic only for the sparse tensor code path, so that the dense tensor code paths remain unchanged for backwards-compatibility. I am happy to help brain-storm on this.

FYI @jeffra, who has worked more on sparse tensors.

Dipet · 2022-04-27T11:26:04Z

@jeffra Tests failed, but it looks like problem with hardware, because tests failed on nvcc call.

Dipet · 2022-05-16T08:26:52Z

@jeffra Can you look at the PR?

Dipet · 2022-05-19T07:53:08Z

Guys, could you take a look at this PR?

tjruwase · 2022-05-19T13:35:19Z

@Dipet, apologies for the delay. I had suggested restricting this change only to sparse tensors so that dense tensor code paths remain backwards compatible. Is there a problem with doing that?

Dipet · 2022-05-23T12:16:50Z

@tjruwase I changed the logic for sparse tensors only, dense tensors should work the same as before.

tjruwase

Thanks!

Fix do not updated sparse grads

89bf66a

Dipet requested review from jeffra, samyam, tjruwase, ShadenSmith, conglongli, awan-10, cli99, eltonzheng, minjiaz and RezaYazdaniAminabadi as code owners April 26, 2022 11:51

Merge branch 'master' into master

7f54365

Dipet changed the title ~~Fix do not updated sparse grads~~ Fix: Sparse tensors not updating Apr 26, 2022

Merge branch 'master' into master

f362660

jeffra and others added 2 commits April 27, 2022 14:37

Merge branch 'master' into master

216c8ce

Merge branch 'master' into master

b04c20a

Merge branch 'master' into master

f3acdfe

Merge branch 'master' into master

30ce968

Dipet added 2 commits May 23, 2022 14:51

Remove call .data for sparse grads

a1e2a96

Merge branch 'master' into master

d6c34b2

tjruwase approved these changes May 23, 2022

View reviewed changes

tjruwase merged commit b8ff482 into microsoft:master May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Sparse tensors not updating #1914

Fix: Sparse tensors not updating #1914

Dipet commented Apr 26, 2022

tjruwase commented Apr 26, 2022

Dipet commented Apr 26, 2022

Dipet commented Apr 26, 2022

Dipet commented Apr 26, 2022

tjruwase commented Apr 26, 2022

Dipet commented Apr 27, 2022

Dipet commented May 16, 2022

Dipet commented May 19, 2022

tjruwase commented May 19, 2022

Dipet commented May 23, 2022

tjruwase left a comment

Fix: Sparse tensors not updating #1914

Fix: Sparse tensors not updating #1914

Conversation

Dipet commented Apr 26, 2022

tjruwase commented Apr 26, 2022

Dipet commented Apr 26, 2022

Dipet commented Apr 26, 2022

Dipet commented Apr 26, 2022

tjruwase commented Apr 26, 2022

Dipet commented Apr 27, 2022

Dipet commented May 16, 2022

Dipet commented May 19, 2022

tjruwase commented May 19, 2022

Dipet commented May 23, 2022

tjruwase left a comment

Choose a reason for hiding this comment