You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor int4 and int8 weight only quantization to use quantize (#301)
* Replace implementation for int8 dynamic quantization with call to `quantize`
Summary:
Previously we added `quantize` as a general API (#256) for
Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general.
The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant
and 8da4w (for executorch).
This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor
subclass. We'll make sure the performance does not regress for vit model.
Test Plan:
TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py
reference: elapsed_time: 1.4821058654785155 milliseconds
after refactor: elapsed_time: 1.4804757690429688 milliseconds
generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d
Reviewers:
Subscribers:
Tasks:
Tags:
* Refactor int8 weight only quant to use `quantize`
Summary:
Similar to #294 we replaced the implementation
of int8 weight only quant to used the newly added `quantize` function, as a part of
the unification effort for affine quantization
Test Plan:
1. unit perf test:
python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int8_wo_quant_perf
elapsed time: 0.23909856796264647, ref elapsed time: 0.25150911331176756
elapsed time: 0.24894208908081056, ref elapsed time: 0.2570047950744629
elapsed time: 0.21607391357421876, ref elapsed time: 0.22809568405151368
2. integration test:
TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py
Reference: elapsed_time: 1.355208740234375 milliseconds
After refactor: elapsed_time: 1.32778857421875 milliseconds
code diff (gist): https://gist.github.com/jerryzh168/921a722cf20d476c8fc5888482e722dc
code diff (meta-only paste): https://www.internalfb.com/phabricator/paste/view/P1387333845
Reviewers:
Subscribers:
Tasks:
Tags:
* Replace implementation for int8 dynamic quantization with call to `quantize`
Summary:
Previously we added `quantize` as a general API (#256) for
Affine Quantized tensor subclass, and also tensor subclass based dtype conversion in general.
The plan is to use this to replace existing quant APIs including int4 weight only, int8 weight only, int8 dynamic quant
and 8da4w (for executorch).
This PR we started replacing the implementation of int8 dynamic quant API with `quantize` API with affine quantized tensor
subclass. We'll make sure the performance does not regress for vit model.
Test Plan:
TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py
reference: elapsed_time: 1.4821058654785155 milliseconds
after refactor: elapsed_time: 1.4804757690429688 milliseconds
generated code diff: https://gist.github.com/jerryzh168/90c71107a5aaaa5d8dd2170c573e076d
Reviewers:
Subscribers:
Tasks:
Tags:
* Refactor int4 weight only quantization with call to `quantize`
Summary:
This is similar to #294 but applied for int4 weight only quantization
Test Plan:
unit perf test:
python test/quantization/test_quant_api.py -k test_quantized_tensor_subclass_int4_wo_quant_perf
elapsed time: 0.2166275215148926, ref elapsed time: 0.2191881561279297
elapsed time: 0.2376406478881836, ref elapsed time: 0.22721023559570314
elapsed time: 0.21919679641723633, ref elapsed time: 0.2154969596862793
integration perf test:
reference: elapsed_time: 2.5900126953125 milliseconds
after refactor: elapsed_time: 2.56680078125 milliseconds
diff: no diff
TORCH_LOGS='output_code' python tutorials/quantize_vit/run_vit_b_quant.py
Before:
After:
generated code diff:
Reviewers:
Subscribers:
Tasks:
Tags:
---------
Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
@unittest.skipIf(notTORCH_VERSION_AFTER_2_4, "Test only enabled for 2.4+")
549
-
@unittest.skipIf(nottorch.cuda.is_available(), "Need CUDA available")
550
-
@unittest.skip("This perf test is supposed to be run locally for sanity check performance when there is a change of int8 dynamic quant implementation")
0 commit comments