You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -43,7 +45,7 @@ There are many excellent works for weight only quantization to improve its accur
43
45
> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.
44
46
45
47
## Examples
46
-
### **Quantization Capability**:
48
+
### **Quantization Capability**
47
49
| Config | Capability |
48
50
| :---: | :---:|
49
51
| dtype |['int', 'nf4', 'fp4']|
@@ -56,22 +58,22 @@ Notes:
56
58
-*group_size = -1* refers to **per output channel quantization**. Taking a linear layer (input channel = $C_{in}$, output channel = $C_{out}$) for instance, when *group size = -1*, quantization will calculate total $C_{out}$ quantization parameters. Otherwise, when *group_size = gs* quantization parameters are calculate with every $gs$ elements along with the input channel, leading to total $C_{out} \times (C_{in} / gs)$ quantization parameters.
57
59
- 4-bit NormalFloat(NF4) is proposed in QLoRA[5]. 'fp4' includes [fp4_e2m1](../../neural_compressor/adaptor/torch_utils/weight_only.py#L37) and [fp4_e2m1_bnb](https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L735). By default, fp4 refers to fp4_e2m1_bnb.
| actorder | False | Whether to sort Hessian's diagonal values to rearrange channel-wise quantization order|
@@ -86,7 +88,7 @@ Notes:
86
88
### **Export Compressed Model**
87
89
To support low memory inference, Neural Compressor implemented WeightOnlyLinear, a torch.nn.Module, to compress the fake quantized fp32 model. Since torch does not provide flexible data type storage, WeightOnlyLinear combines low bits data into a long date type, such as torch.int8 and torch.int32. Low bits data includes weights and zero points. When using WeightOnlyLinear for inference, it will restore the compressed data to float32 and run torch linear function.
The saved_results folder contains two files: `best_model.pt` and `qconfig.json`, and the generated q_model is a fake quantized model.
129
131
132
+
## Layer Wise Quantization
133
+
134
+
Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.
135
+
136
+
<imgsrc="./imgs/lwq.png">
137
+
138
+
*Figure 1: The process of layer-wise quantization. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*
139
+
140
+
### Supported Matrix
141
+
142
+
| Algorithms/Framework | PyTorch |
143
+
|:--------------:|:----------:|
144
+
| RTN |✔|
145
+
| AWQ |✕|
146
+
| GPTQ |✕|
147
+
| TEQ |✕|
148
+
149
+
### Example
150
+
```python
151
+
from neural_compressor import PostTrainingQuantConfig, quantization
152
+
from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_shell
[1]. Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022).
0 commit comments