-
Notifications
You must be signed in to change notification settings - Fork 617
int8 output for seq embeddings #2316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D53449813 |
Summary: * int8 output dtype is a gap for recently fbgemm usage case, setup a reasonable refimplementation first, memcpy based. * for sequence embedding, we first unblock dispatch via simple memcpy, it is a pure bw op(no dequant) so memcpy should be reasonably ok. further optimization like ILP via unrolling, try avx non-temp instruction, rep instruction to be done in future iterations. Differential Revision: D53449813
2d86f0d
to
5b5027b
Compare
This pull request was exported from Phabricator. Differential Revision: D53449813 |
Summary: * int8 output dtype is a gap for recently fbgemm usage case, setup a reasonable refimplementation first, memcpy based. * for sequence embedding, we first unblock dispatch via simple memcpy, it is a pure bw op(no dequant) so memcpy should be reasonably ok. further optimization like ILP via unrolling, try avx non-temp instruction, rep instruction to be done in future iterations. Differential Revision: D53449813
5b5027b
to
89445b1
Compare
This pull request was exported from Phabricator. Differential Revision: D53449813 |
Summary: * int8 output dtype is a gap for recently fbgemm usage case, setup a reasonable refimplementation first, memcpy based. * for sequence embedding, we first unblock dispatch via simple memcpy, it is a pure bw op(no dequant) so memcpy should be reasonably ok. further optimization like ILP via unrolling, try avx non-temp instruction, rep instruction to be done in future iterations. Differential Revision: D53449813
89445b1
to
ea763e9
Compare
This pull request was exported from Phabricator. Differential Revision: D53449813 |
Summary: * int8 output dtype is a gap for recently fbgemm usage case, setup a reasonable refimplementation first, memcpy based. * for sequence embedding, we first unblock dispatch via simple memcpy, it is a pure bw op(no dequant) so memcpy should be reasonably ok. further optimization like ILP via unrolling, try avx non-temp instruction, rep instruction to be done in future iterations. Differential Revision: D53449813
ea763e9
to
816faa1
Compare
This pull request was exported from Phabricator. Differential Revision: D53449813 |
This pull request has been merged in af41af1. |
Summary:
Differential Revision: D53449813