TorchAO 0.1.0: First Release
Highlights
We’re excited to announce the release of TorchAO v0.1.0! TorchAO is a repository to host architecture optimization techniques such as quantization and sparsity and performance kernels on different backends such as CUDA and CPU. In this release, we added support for a few quantization techniques like int4 weight only GPTQ quantization, added nf4 dtype support for QLoRA and sparsity features like WandaSparsifier, we also added autotuner that can tune triton integer matrix multiplication kernels on cuda.
Note: TorchAO is currently in a pre-release state and under extensive development. The public APIs should not be considered stable. But we welcome you to try out our APIs and offerings and provide any feedback on your experience.
torchao 0.1.0 will be compatible with PyTorch 2.2.2 and 2.3.0, ExecuTorch 0.2.0 and TorchTune 0.1.0.
New Features
Quantization
- Added tensor subclass based quantization APIs:
change_linear_weights_to_int8_dqtensors
,change_linear_weights_to_int8_woqtensors
andchange_linear_weights_to_int4_woqtensors
(#1) - Added module based quantization APIs for int8 dynamic and weight only quantization
apply_weight_only_int8_quant
andapply_dynamic_quant
(#1) - Added module swap version of int4 weight only quantization
Int4WeightOnlyQuantizer
andInt4WeightOnlyGPTQQuantizer
used in TorchTune (#119, #116) - Added int8 dynamic activation and int4 weight quantization
Int8DynActInt4WeightQuantizer
andInt8DynActInt4WeightGPTQQuantizer
, used in ExecuTorch (#74) (available after torch 2.3.0 and later)
Sparsity
- Added
WandaSparsifier
that prunes both weights and activations (#22)
Kernels
- Added
autotuner
for int mm Triton kernels (#41)
dtypes
Improvements
- Setup github workflow for regression testing (#50)
- Setup github workflow for
torchao-nightly
release (#54)
Documentation
- Added tutorials for quantizing vision transformer model (#60)
- Added tutorials for how to add an op for
nf4
tensor (#54)
Notes
- we are still debugging the accuracy problem for
Int8DynActInt4WeightGPTQQuantizer
- Save and load does not work well for tensor subclass based APIs yet
- We will consolidate tensor subclass and module swap based quantization APIs later
uint4
tensor subclass is going to be merged into pytorch core in the future- Quantization ops in
quant_primitives.py
will be deduplicated with similar quantize/dequantize ops in PyTorch later