Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/source/imgs/lwq.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 52 additions & 6 deletions docs/source/quantization_weight_only.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ Weight Only Quantization (WOQ)

3. [Examples](#examples)

4. [Layer Wise Quantization](#layer-wise-quantization)


## Introduction

Expand Down Expand Up @@ -43,7 +45,7 @@ There are many excellent works for weight only quantization to improve its accur
> **TEQ:** A trainable equivalent transformation that preserves the FP32 precision in weight-only quantization. It is inspired by AWQ while providing a new solution to search for the optimal per-channel scaling factor between activations and weights.

## Examples
### **Quantization Capability**:
### **Quantization Capability**
| Config | Capability |
| :---: | :---:|
| dtype | ['int', 'nf4', 'fp4'] |
Expand All @@ -56,22 +58,22 @@ Notes:
- *group_size = -1* refers to **per output channel quantization**. Taking a linear layer (input channel = $C_{in}$, output channel = $C_{out}$) for instance, when *group size = -1*, quantization will calculate total $C_{out}$ quantization parameters. Otherwise, when *group_size = gs* quantization parameters are calculate with every $gs$ elements along with the input channel, leading to total $C_{out} \times (C_{in} / gs)$ quantization parameters.
- 4-bit NormalFloat(NF4) is proposed in QLoRA[5]. 'fp4' includes [fp4_e2m1](../../neural_compressor/adaptor/torch_utils/weight_only.py#L37) and [fp4_e2m1_bnb](https://github.com/TimDettmers/bitsandbytes/blob/18e827d666fa2b70a12d539ccedc17aa51b2c97c/bitsandbytes/functional.py#L735). By default, fp4 refers to fp4_e2m1_bnb.

**RTN arguments**:
**RTN arguments**
| rtn_args | default value | comments |
|:----------:|:-------------:|:-------------------------------------------------------------------:|
| enable_full_range | False | Whether to use -2**(bits-1) in sym scheme |
| enable_mse_search | False | Whether to search for the best clip range from range [0.805, 1.0, 0.005] |
| return_int | False | Whether to return compressed model with torch.int32 data type |
| group_dim | 1 | 0 means splitting output channel, 1 means splitting input channel |

**AWQ arguments**:
**AWQ arguments**
| awq_args | default value | comments |
|:----------:|:-------------:|:-------------------------------------------------------------------:|
| enable_auto_scale | True | Whether to search for best scales based on activation distribution |
| enable_mse_search | True | Whether to search for the best clip range from range [0.91, 1.0, 0.01] |
| folding | False | False will allow insert mul before linear when the scale cannot be absorbed by last layer, else won't |

**GPTQ arguments**:
**GPTQ arguments**
| gptq_args | default value | comments |
|:----------:|:-------------:|:-------------------------------------------------------------------:|
| actorder | False | Whether to sort Hessian's diagonal values to rearrange channel-wise quantization order|
Expand All @@ -86,7 +88,7 @@ Notes:
### **Export Compressed Model**
To support low memory inference, Neural Compressor implemented WeightOnlyLinear, a torch.nn.Module, to compress the fake quantized fp32 model. Since torch does not provide flexible data type storage, WeightOnlyLinear combines low bits data into a long date type, such as torch.int8 and torch.int32. Low bits data includes weights and zero points. When using WeightOnlyLinear for inference, it will restore the compressed data to float32 and run torch linear function.

**Export arguments**:
**Export arguments**
| export args | default value | comments |
|:----------:|:-------------:|:-------------------------------------------------------------------:|
| qweight_config_path | None | If need to export model with fp32_model and json file, set the path of qconfig.json |
Expand All @@ -95,7 +97,7 @@ To support low memory inference, Neural Compressor implemented WeightOnlyLinear,
| compression_dim | 1 | 0 means output channel while 1 means input channel |
| scale_dtype | torch.float32 | Data type for scale and bias |

### **User code**:
### **User Code Example**
```python
conf = PostTrainingQuantConfig(
approach="weight_only",
Expand Down Expand Up @@ -127,6 +129,50 @@ torch.save(compressed_model.state_dict(), "compressed_model.pt")

The saved_results folder contains two files: `best_model.pt` and `qconfig.json`, and the generated q_model is a fake quantized model.

## Layer Wise Quantization

Large language models (LLMs) have shown exceptional performance across various tasks, meanwhile, the substantial parameter size poses significant challenges for deployment. Layer-wise quantization(LWQ) can greatly reduce the memory footprint of LLMs, usually 80-90% reduction, which means that users can quantize LLMs even on single node using GPU or CPU. We can quantize the model under memory-constrained devices, therefore making the huge-sized LLM quantization possible.

<img src="./imgs/lwq.png">

*Figure 1: The process of layer-wise quantization. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*

### Supported Matrix

| Algorithms/Framework | PyTorch |
|:--------------:|:----------:|
| RTN | &#10004; |
| AWQ | &#10005; |
| GPTQ | &#10005; |
| TEQ | &#10005; |

### Example
```python
from neural_compressor import PostTrainingQuantConfig, quantization
from neural_compressor.adaptor.torch_utils.layer_wise_quant import load_shell

fp32_model = load_shell(model_name_or_path, AutoModelForCausalLM, torchscript=True)
conf = PostTrainingQuantConfig(
approach="weight_only",
recipes={
"layer_wise_quant": True,
"layer_wise_quant_args": {
"model_path": "facebook/opt-125m",
},
"rtn_args": {"enable_full_range": True},
},
)

q_model = quantization.fit(
fp32_model,
conf,
calib_dataloader=eval_dataloader,
eval_func=lambda x: 0.1,
)
ouput_dir = "./saved_model"
q_model.save(ouput_dir)
```

## Reference

[1]. Xiao, Guangxuan, et al. "Smoothquant: Accurate and efficient post-training quantization for large language models." arXiv preprint arXiv:2211.10438 (2022).
Expand Down