-
Notifications
You must be signed in to change notification settings - Fork 617
- Add codegen for embedding backward meta functions #2347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D53674518 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -270,8 +272,13 @@ def execute_backward_adagrad_( # noqa C901 | |||
output_dtype=output_dtype, | |||
) | |||
|
|||
# TODO: make it compile for CPU and unweighted | |||
if compile and not use_cpu and weighted: | |||
cc = torch.compile(cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding fullgraph=True here
#x " must have the same number of elements as " #y " They had ", \ | ||
(x).sym_numel(), \ | ||
" and ", \ | ||
(y).sym_numel()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more robust version of this which works better with unbacked symints is
TORCH_SYM_CHECK(x.sym_numel().sym_eq(y.sym_numel()))
for more explanation on what this is doing, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.jqnrfurlygn5
Summary: Adding embedding backward meta codegen functions. Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers. Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly. There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests. Reviewed By: q10, Microve Differential Revision: D53674518
fafe6a3
to
511192c
Compare
This pull request was exported from Phabricator. Differential Revision: D53674518 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D53674518 |
Summary: Pull Request resolved: pytorch#2347 Adding embedding backward meta codegen functions. Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers. Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly. There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests. Reviewed By: q10, Microve Differential Revision: D53674518
511192c
to
54c57fa
Compare
} | ||
if (reinterpret_cast<uint64_t>(grad_output.data_ptr()) % 16 != 0) { | ||
aligned_grad_output = at::empty_like(grad_output).copy_(grad_output); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the code got moved here. I guess you're only running this inside of the CUDA kernel now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential hazard to be aware of when doing a transform like this, is if the operator you moved this logic into is differentiable. The backward in that case may have been relying on the input being guaranteed to be aligned, including the saved copy for backwards.
@@ -253,7 +265,7 @@ Tensor {{ ddesc }}_embedding_codegen_grad_indice_weights{{ vdesc }}_cuda( | |||
const auto total_B = offsets.size(0) - 1; | |||
TORCH_CHECK_GE(total_B, 0); | |||
TORCH_CHECK_LE(max_D, {{ max_embedding_dim }}); | |||
auto grad_indice_weights = empty_like(indices, indices.options().dtype(at::toAccumulateType(grad_output.scalar_type(), true))); | |||
auto grad_indice_weights = empty_like(indices, indices.options().dtype(at::toAccumulateType(aligned_grad_output.scalar_type(), true))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to carefully audit that you updated all the downstream use sites. To make it obvious you didn't do it wrong, change the input name and then once you align grad output, assign it to grad_output
, no diff afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grad_output is const here, so I need to create new variable.
) { | ||
|
||
const auto T = D_offsets.sym_size(0) - 1; | ||
TORCH_CHECK_GT(T, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also potentially make these more unbacked symint friendly, but happy to leave this for later too.
|
||
auto grad_indice_weights = empty_like(indices, indices.options().dtype(at::toAccumulateType(grad_output.scalar_type(), true))); | ||
|
||
return grad_indice_weights; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trusting you that this accurately reflects the original logic ;)
} | ||
if (reinterpret_cast<uint64_t>(grad_output.data_ptr()) % 16 != 0) { | ||
aligned_grad_output = at::empty_like(grad_output).copy_(grad_output); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duped! Maybe factor this out to a helper?
{%- endif %} | ||
|
||
// short-circuit if there are zero indices. | ||
if (indices.sym_numel() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this one size oblivious, per https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit#heading=h.11jnmcqhq5yy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, nice work!
Summary: Adding embedding backward meta codegen functions. Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers. Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly. There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests. Reviewed By: ezyang, q10, Microve Differential Revision: D53674518
54c57fa
to
e4399d0
Compare
Summary: Adding embedding backward meta codegen functions. Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers. Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly. There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests. Reviewed By: ezyang, q10, Microve Differential Revision: D53674518
e4399d0
to
afc4e95
Compare
Summary: Adding embedding backward meta codegen functions. Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers. Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly. There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests. Reviewed By: ezyang, q10, Microve Differential Revision: D53674518
afc4e95
to
a1f0136
Compare
Summary: Adding embedding backward meta codegen functions. Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers. Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly. There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests. Reviewed By: ezyang, q10, Microve Differential Revision: D53674518
a1f0136
to
5d4714a
Compare
Summary: Adding embedding backward meta codegen functions. Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers. Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly. There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests. Reviewed By: ezyang, q10, Microve Differential Revision: D53674518
5d4714a
to
4aa3883
Compare
This pull request has been merged in 5f48fbd. |
Summary:
Adding embedding backward meta codegen functions.
Moved memory alignment that was outside of the cuda kernel into the custom operator, since we couldn't write a symbolic version for memory alignment checks on the pointers.
Tests are changed to allow compilation only on adagrad. Other tests are ran to ensure they continue to work properly.
There are missing fixes to allow compilation for unweighted kernels and CPU, which are excluded from the tests.
Differential Revision: D53674518