-
Notifications
You must be signed in to change notification settings - Fork 617
Add NEON implementation of Fused8BitRowwiseQuantizedSBFloatToFloatOrHalf #3707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D69573878 |
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
99e2504
to
7d12c46
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
7d12c46
to
3bff7e9
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
3bff7e9
to
a8931d3
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
a8931d3
to
6fdcc64
Compare
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
6fdcc64
to
0556e72
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
0556e72
to
4757d4c
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
4757d4c
to
09852aa
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
09852aa
to
2c5f603
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
2c5f603
to
2ee6cda
Compare
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
2ee6cda
to
cbbf748
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D69573878 |
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 Pull Request resolved: pytorch#3707 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
cbbf748
to
dfeb094
Compare
…alf (pytorch#3707) Summary: X-link: facebookresearch/FBGEMM#789 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Differential Revision: D69573878
dfeb094
to
295e157
Compare
This pull request was exported from Phabricator. Differential Revision: D69573878 |
This pull request has been merged in 3de6774. |
…alf (pytorch#789) Summary: Pull Request resolved: facebookresearch/FBGEMM#789 X-link: pytorch#3707 QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function We have observed a ~x12 performance improvement for the downcasting case. The case where a float32_t is returned maintains the same speed: Full results: before: P1732996851 after: P1732996401 Reviewed By: q10 Differential Revision: D69573878 fbshipit-source-id: b3d5045cf718a13356c068846a7805ff60f61d87
Summary:
QuantUtilsNeon.cc has been introduced, Fused8BitRowwiseQuantizedSBFloatToFloatOrHalfNeon implemented as first function
We have observed a ~x12 performance improvement for the downcasting case.
The case where a float32_t is returned maintains the same speed:
Full results:
before:
P1732996851
after:
P1732996401
Differential Revision: D69573878