diff --git a/contents/optimizations/optimizations.qmd b/contents/optimizations/optimizations.qmd index d331fdfe..f9c865bc 100644 --- a/contents/optimizations/optimizations.qmd +++ b/contents/optimizations/optimizations.qmd @@ -535,7 +535,7 @@ In the quantized version shown in @fig-quantized-sine-wave, the continuous sine ![Quantized Sine Wave.](images/png/efficientnumerics_quantizedsine.png){#fig-quantized-sine-wave} -Returning to the context of Machine Learning (ML), quantization refers to the process of constraining the possible values that numerical parameters (such as weights and biases) can take to a discrete set, thereby reducing the precision of the parameters and consequently, the model's memory footprint. When properly implemented, quantization can reduce model size by up to 4x and improve inference latency and throughput by up to 2-3x. @fig-quantized-models-size illustrates the impact that quantization has on different models' sizes: for example, an Image Classification model like ResNet-v2 can be compressed from 180MB down to 45MB with 8-bit quantization. There is typically less than 1% loss in model accuracy from well tuned quantization. Accuracy can often be recovered by re-training the quantized model with quantization aware training techniques. Therefore, this technique has emerged to be very important in deploying ML models to resource-constrained environments, such as mobile devices, IoT devices, and edge computing platforms, where computational resources (memory and processing power) are limited. +Returning to the context of Machine Learning (ML), quantization refers to the process of constraining the possible values that numerical parameters (such as weights and biases) can take to a discrete set, thereby reducing the precision of the parameters and consequently, the model's memory footprint. When properly implemented, quantization can reduce model size by up to 4x and improve inference latency and throughput by up to 2-3x. @fig-quantized-models-size illustrates the impact that quantization has on different models' sizes: for example, an Image Classification model like ResNet-v2 can be compressed from 180MB down to 45MB with 8-bit quantization. There is typically less than 1% loss in model accuracy from well tuned quantization. Accuracy can often be recovered by re-training the quantized model with quantization-aware training techniques. Therefore, this technique has emerged to be very important in deploying ML models to resource-constrained environments, such as mobile devices, IoT devices, and edge computing platforms, where computational resources (memory and processing power) are limited. ![Effect of quantization on model sizes. Credit: HarvardX.](images/png/efficientnumerics_reducedmodelsize.png){#fig-quantized-models-size} @@ -579,7 +579,7 @@ Furthermore, learnable quantizers can be jointly trained with model parameters, #### Stochastic Quantization -Unlike the two previous approaches which generate deterministic mappings, there is some work exploring the idea of stochastic quantization for quantization aware training and reduced precision training. This approach maps floating numbers up or down with a probability associated to the magnitude of the weight update. The hope generated by high level intuition is that such a probabilistic approach may allow a neural network to explore more, as compared to deterministic quantization. Supposedly, enabling a stochastic rounding may allow neural networks to escape local optimums, thereby updating its parameters. Below are two example stochastic mapping functions: +Unlike the two previous approaches which generate deterministic mappings, there is some work exploring the idea of stochastic quantization for quantization-aware training and reduced precision training. This approach maps floating numbers up or down with a probability associated to the magnitude of the weight update. The hope generated by high level intuition is that such a probabilistic approach may allow a neural network to explore more, as compared to deterministic quantization. Supposedly, enabling a stochastic rounding may allow neural networks to escape local optimums, thereby updating its parameters. Below are two example stochastic mapping functions: ![](images/png/efficientnumerics_nonuniform.png) @@ -641,7 +641,7 @@ Between the two, calculating the range dynamically usually is very costly, so mo ### Techniques -The two prevailing techniques for quantizing models are Post Training Quantization and Quantization Aware Training. +The two prevailing techniques for quantizing models are Post Training Quantization and Quantization-Aware Training. **Post Training Quantization** - Post-training quantization (PTQ) is a quantization technique where the model is quantized after it has been trained. The model is trained in floating point and then weights and activations are quantized as a post-processing step. This is the simplest approach and does not require access to the training data. Unlike Quantization-Aware Training (QAT), PTQ sets weight and activation quantization parameters directly, making it low-overhead and suitable for limited or unlabeled data situations. However, not readjusting the weights after quantizing, especially in low-precision quantization can lead to very different behavior and thus lower accuracy. To tackle this, techniques like bias correction, equalizing weight ranges, and adaptive rounding methods have been developed. PTQ can also be applied in zero-shot scenarios, where no training or testing data are available. This method has been made even more efficient to benefit compute- and memory- intensive large language models. Recently, SmoothQuant, a training-free, accuracy-preserving, and general-purpose PTQ solution which enables 8-bit weight, 8-bit activation quantization for LLMs, has been developed, demonstrating up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy [(@xiao2022smoothquant)](https://arxiv.org/abs/2211.10438). @@ -650,7 +650,7 @@ In PTQ, a pretrained model undergoes a calibration process, as shown in @fig-PTQ ![Post-Training Quantization and calibration. Credit: @gholami2021survey.](images/png/efficientnumerics_PTQ.png){#fig-PTQ-diagram} -**Quantization Aware Training** - Quantization-aware training (QAT) is a fine-tuning of the PTQ model. The model is trained aware of quantization, allowing it to adjust for quantization effects. This produces better accuracy with quantized inference. Quantizing a trained neural network model with methods such as PTQ introduces perturbations that can deviate the model from its original convergence point. For instance, Krishnamoorthi showed that even with per-channel quantization, networks like MobileNet do not reach baseline accuracy with int8 Post Training Quantization (PTQ) and require Quantization Aware Training (QAT) [(@krishnamoorthi2018quantizing)](https://arxiv.org/abs/1806.08342).To address this, QAT retrains the model with quantized parameters, employing forward and backward passes in floating point but quantizing parameters after each gradient update. Handling the non-differentiable quantization operator is crucial; a widely used method is the Straight Through Estimator (STE), approximating the rounding operation as an identity function. While other methods and variations exist, STE remains the most commonly used due to its practical effectiveness. +**Quantization-Aware Training** - Quantization-aware training (QAT) is a fine-tuning of the PTQ model. The model is trained aware of quantization, allowing it to adjust for quantization effects. This produces better accuracy with quantized inference. Quantizing a trained neural network model with methods such as PTQ introduces perturbations that can deviate the model from its original convergence point. For instance, Krishnamoorthi showed that even with per-channel quantization, networks like MobileNet do not reach baseline accuracy with int8 Post Training Quantization (PTQ) and require Quantization-Aware Training (QAT) [(@krishnamoorthi2018quantizing)](https://arxiv.org/abs/1806.08342).To address this, QAT retrains the model with quantized parameters, employing forward and backward passes in floating point but quantizing parameters after each gradient update. Handling the non-differentiable quantization operator is crucial; a widely used method is the Straight Through Estimator (STE), approximating the rounding operation as an identity function. While other methods and variations exist, STE remains the most commonly used due to its practical effectiveness. In QAT, a pretrained model is quantized and then finetuned using training data to adjust parameters and recover accuracy degradation, as shown in @fig-QAT-diagram. The calibration process is often conducted in parallel with the finetuning process for QAT. ![Quantization-Aware Training. Credit: @gholami2021survey.](images/png/efficientnumerics_QAT.png){#fig-QAT-diagram} @@ -663,7 +663,7 @@ Quantization-Aware Training serves as a natural extension of Post-Training Quant ![Relative accuracies of PTQ and QAT. Credit: @wu2020integer.](images/png/efficientnumerics_PTQQATsummary.png){#fig-quantization-methods-summary} -| **Feature/Technique** | **Post Training Quantization** | **Quantization Aware Training** | **Dynamic Quantization** | +| **Feature/Technique** | **Post Training Quantization** | **Quantization-Aware Training** | **Dynamic Quantization** | |------------------------------|------------------------------|------------------------------|------------------------------| | **Pros** | | | | | Simplicity | ✓ | ✗ | ✗ |