Skip to content

Commit 1163603

Browse files
authored
add lwq doc in quantization_weight_only (#1311)
* add lwq doc in quantization_weight_only Signed-off-by: Guo, Heng <heng.guo@intel.com>
1 parent 6e08cca commit 1163603

File tree

2 files changed

+52
-6
lines changed

2 files changed

+52
-6
lines changed

docs/source/imgs/lwq.png

58.4 KB
Loading

docs/source/quantization_weight_only.md

Lines changed: 52 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ Weight Only Quantization (WOQ)
77

88
3. [Examples](#examples)
99

10+
4. [Layer Wise Quantization](#layer-wise-quantization)
11+
1012

1113
## Introduction
1214

@@ -43,7 +45,7 @@ There are many excellent works for weight only quantization to improve its accur
4345
> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.
4446
4547
## Examples
46-
### **Quantization Capability**:
48+
### **Quantization Capability**
4749
| Config | Capability |
4850
| :---: | :---:|
4951
| dtype | ['int', 'nf4', 'fp4'] |
@@ -56,22 +58,22 @@ Notes:
5658
- *group_size = -1* refers to **per output channel quantization**. Taking a linear layer (input channel = $C_{in}$, output channel = $C_{out}$) for instance, when *group size = -1*, quantization will calculate total $C_{out}$ quantization parameters. Otherwise, when *group_size = gs* quantization parameters are calculate with every $gs$ elements along with the input channel, leading to total $C_{out} \times (C_{in} / gs)$ quantization parameters.
5759
- 4-bit NormalFloat(NF4) is proposed in QLoRA[5]. 'fp4' includes [fp4_e2m1](../../neural_compressor/adaptor/torch_utils/weight_only.py#L37) and [fp4_e2m1_bnb](https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L735). By default, fp4 refers to fp4_e2m1_bnb.
5860

59-
**RTN arguments**:
61+
**RTN arguments**
6062
| rtn_args | default value | comments |
6163
|:----------:|:-------------:|:-------------------------------------------------------------------:|
6264
| enable_full_range | False | Whether to use -2**(bits-1) in sym scheme |
6365
| enable_mse_search | False | Whether to search for the best clip range from range [0.805, 1.0, 0.005] |
6466
| return_int | False | Whether to return compressed model with torch.int32 data type |
6567
| group_dim | 1 | 0 means splitting output channel, 1 means splitting input channel |
6668

67-
**AWQ arguments**:
69+
**AWQ arguments**
6870
| awq_args | default value | comments |
6971
|:----------:|:-------------:|:-------------------------------------------------------------------:|
7072
| enable_auto_scale | True | Whether to search for best scales based on activation distribution |
7173
| enable_mse_search | True | Whether to search for the best clip range from range [0.91, 1.0, 0.01] |
7274
| folding | False | False will allow insert mul before linear when the scale cannot be absorbed by last layer, else won't |
7375

74-
**GPTQ arguments**:
76+
**GPTQ arguments**
7577
| gptq_args | default value | comments |
7678
|:----------:|:-------------:|:-------------------------------------------------------------------:|
7779
| actorder | False | Whether to sort Hessian's diagonal values to rearrange channel-wise quantization order|
@@ -86,7 +88,7 @@ Notes:
8688
### **Export Compressed Model**
8789
To support low memory inference, Neural Compressor implemented WeightOnlyLinear, a torch.nn.Module, to compress the fake quantized fp32 model. Since torch does not provide flexible data type storage, WeightOnlyLinear combines low bits data into a long date type, such as torch.int8 and torch.int32. Low bits data includes weights and zero points. When using WeightOnlyLinear for inference, it will restore the compressed data to float32 and run torch linear function.
8890

89-
**Export arguments**:
91+
**Export arguments**
9092
| export args | default value | comments |
9193
|:----------:|:-------------:|:-------------------------------------------------------------------:|
9294
| qweight_config_path | None | If need to export model with fp32_model and json file, set the path of qconfig.json |
@@ -95,7 +97,7 @@ To support low memory inference, Neural Compressor implemented WeightOnlyLinear,
9597
| compression_dim | 1 | 0 means output channel while 1 means input channel |
9698
| scale_dtype | torch.float32 | Data type for scale and bias |
9799

98-
### **User code**:
100+
### **User Code Example**
99101
```python
100102
conf = PostTrainingQuantConfig(
101103
approach="weight_only",
@@ -127,6 +129,50 @@ torch.save(compressed_model.state_dict(), "compressed_model.pt")
127129

128130
The saved_results folder contains two files: `best_model.pt` and `qconfig.json`, and the generated q_model is a fake quantized model.
129131

132+
## Layer Wise Quantization
133+
134+
Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.
135+
136+
<img src="./imgs/lwq.png">
137+
138+
*Figure 1: The process of layer-wise quantization. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*
139+
140+
### Supported Matrix
141+
142+
| Algorithms/Framework | PyTorch |
143+
|:--------------:|:----------:|
144+
| RTN | &#10004; |
145+
| AWQ | &#10005; |
146+
| GPTQ | &#10005; |
147+
| TEQ | &#10005; |
148+
149+
### Example
150+
```python
151+
from neural_compressor import PostTrainingQuantConfig, quantization
152+
from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_shell
153+
154+
fp32_model = load_shell(model_name_or_path, AutoModelForCausalLM, torchscript=True)
155+
conf = PostTrainingQuantConfig(
156+
approach="weight_only",
157+
recipes={
158+
"layer_wise_quant": True,
159+
"layer_wise_quant_args": {
160+
"model_path": "facebook/opt-125m",
161+
},
162+
"rtn_args": {"enable_full_range": True},
163+
},
164+
)
165+
166+
q_model = quantization.fit(
167+
fp32_model,
168+
conf,
169+
calib_dataloader=eval_dataloader,
170+
eval_func=lambda x: 0.1,
171+
)
172+
ouput_dir = "./saved_model"
173+
q_model.save(ouput_dir)
174+
```
175+
130176
## Reference
131177

132178
[1]. Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022).

0 commit comments

Comments
 (0)