Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Sparse tensors not updating #1914

Merged
merged 9 commits into from
May 23, 2022
Merged

Fix: Sparse tensors not updating #1914

merged 9 commits into from
May 23, 2022

Conversation

Dipet
Copy link
Contributor

@Dipet Dipet commented Apr 26, 2022

param.grad.data returns new object, for this reason we do not assign updated sparse gradients to the target tensor and the optimizer works with unique gradient tensor on each device.

@tjruwase
Copy link
Contributor

@Dipet, can you please share a bit more about this issue. In particular, what do you mean by .data returns a new object. In my experience, .data is the way to get the underlying storage object so that multiple tensors can reference the same buffer. Is this different from your experience? Thanks!

@Dipet
Copy link
Contributor Author

Dipet commented Apr 26, 2022

Yes, I am sorry, I forget that in this PR we can not see the second call of tensor.data there https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/engine.py#L2183

When we call twice .data we get different python objects.
This is a small reproducible example of this problem:

import torch

tensor = torch.tensor([0.0, 1., 2.], requires_grad=True)
tensor.sum().backward()

g = tensor.grad
print("Original gradient")
print(g)  # tensor([1., 1., 1.])

d = g.data
d.data = torch.tensor([3.0, 4., 5.])
print("Data is not assigned to original tensor gradient")
print(tensor.grad)  # tensor([1., 1., 1.])

g.data = torch.tensor([3.0, 4., 5.])
print("Data successfully assigned")
print(tensor.grad)  # tensor([3., 4., 5.])

@Dipet Dipet changed the title Fix do not updated sparse grads Fix: Sparse tensors not updating Apr 26, 2022
@Dipet
Copy link
Contributor Author

Dipet commented Apr 26, 2022

For dense tensors we use inplace copy_ operation to update gradients after the allreduce operation so we do not catch this problem for them.
I'm not sure if we could use the same logic for sparse tensors. Also, using the copy_ method is not the fastest approach when we could just swap 2 objects.

@Dipet
Copy link
Contributor Author

Dipet commented Apr 26, 2022

ok, I checked inplace copy_ with sparse tensor, but this approach also need to remove of call .data because I catch this error:

    tensor.orig_dense_tensor.copy_(tensor.to_coo_tensor())
RuntimeError: resize_ is not allowed on a Tensor created from .data or .detach().
If your intent is to change the metadata of a Tensor (such as sizes / strides / storage / storage_offset)
without autograd tracking the change, remove the .data / .detach() call and wrap the change in a `with torch.no_grad():` block.
For example, change:
    x.data.set_(y)
to:
    with torch.no_grad():
        x.set_(y)
python-BaseException

@tjruwase
Copy link
Contributor

@Dipet, thanks for sharing this context. I have not worked a lot with sparse tensors, so that is probably why I have not run into this issue. My suggestion would be to implement this logic only for the sparse tensor code path, so that the dense tensor code paths remain unchanged for backwards-compatibility. I am happy to help brain-storm on this.

FYI @jeffra, who has worked more on sparse tensors.

@Dipet
Copy link
Contributor Author

Dipet commented Apr 27, 2022

@jeffra Tests failed, but it looks like problem with hardware, because tests failed on nvcc call.

@Dipet
Copy link
Contributor Author

Dipet commented May 16, 2022

@jeffra Can you look at the PR?

@Dipet
Copy link
Contributor Author

Dipet commented May 19, 2022

Guys, could you take a look at this PR?

@tjruwase
Copy link
Contributor

@Dipet, apologies for the delay. I had suggested restricting this change only to sparse tensors so that dense tensor code paths remain backwards compatible. Is there a problem with doing that?

@Dipet
Copy link
Contributor Author

Dipet commented May 23, 2022

@tjruwase I changed the logic for sparse tensors only, dense tensors should work the same as before.

Copy link
Contributor

@tjruwase tjruwase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tjruwase tjruwase merged commit b8ff482 into microsoft:master May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants