You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/main_classes/quantization.md
+132-1Lines changed: 132 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,6 +16,137 @@ rendered properly in your Markdown viewer.
16
16
17
17
# Quantize 🤗 Transformers models
18
18
19
+
## `AutoGPTQ` Integration
20
+
21
+
🤗 Transformers has integrated `optimum` API to perform GPTQ quantization on language models. You can load and quantize your model in 8,6,4 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares.
22
+
23
+
To learn more about the the quantization model, check out:
24
+
- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper
25
+
<!-- - the `optimum` [guide]() on GPTQ quantization -->
26
+
- the [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) library used as the backend
27
+
28
+
### Requirements
29
+
30
+
You need to have the following requirements installed to run the code below:
GPTQ integration supports for now only text models and you may encounter unexpected behaviour for vision, speech or multi-modal models.
44
+
45
+
### Load and quantize a model
46
+
47
+
GPTQ is a quantization method that requires weights calibration before using the quantized models. If you want to quantize transformers model from scratch, it might take some time before producing the quantized model (~10 min on a Google colab for `facebook/opt-350m` model.
48
+
49
+
Hence, there are two different scenarios where you want to use GPTQ-quantized models. The first use case would be to load models that has been already quantized by other users that are available on the Hub, the second use case would be to quantize your model from scratch and save it or push it on the Hub so that other users can also use it.
50
+
#### GPTQ Configuration
51
+
52
+
In order to load and quantize a model, you need to create a [`GPTQConfig`]. You need to pass the number of `bits`, a `dataset` in order to calibrate the quantization and the `tokenizer` of the model in order prepare the dataset.
You can quantize a model by using `from_pretrained` and setting the `quantization_config`.
69
+
70
+
```python
71
+
from transformers import AutoModelForCausalLM
72
+
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=gptq_config)
73
+
```
74
+
Note that you will need a GPU to quantize a model. We will put the model in the cpu and move the modules back and forth to the gpu in order to quantize them.
75
+
76
+
If you want to maximize your gpus usage while using cpu offload, you can set `device_map = "auto"`.
77
+
```python
78
+
from transformers import AutoModelForCausalLM
79
+
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config)
80
+
```
81
+
Note that disk offload is not supported. Furthermore, if you are out of memory because of the dataset, you may have to pass `max_memory` in `from_pretained`. Checkout this [guide](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map) to learn more about `device_map` and `max_memory`.
82
+
83
+
<Tipwarning={true}>
84
+
GPTQ quantization only works for text model for now. Futhermore, the quantization process can a lot of time depending on one's hardware (175B model = 4 gpu hours using NVIDIA A100). Please check on the hub if there is not a GPTQ quantized version of the model. If not, you can submit a demand on github.
85
+
</Tip>
86
+
87
+
### Push quantized model to 🤗 Hub
88
+
89
+
You can push the quantized model like any 🤗 model to Hub with `push_to_hub`. The quantization config will be saved and pushed along the model.
90
+
91
+
```python
92
+
quantized_model.push_to_hub("opt-125m-gptq")
93
+
tokenizer.push_to_hub("opt-125m-gptq")
94
+
```
95
+
96
+
If you want to save your quantized model on your local machine, you can also do it with `save_pretrained`:
97
+
```python
98
+
quantized_model.save_pretrained("opt-125m-gptq")
99
+
tokenizer.save_pretrained("opt-125m-gptq")
100
+
```
101
+
102
+
Note that if you have quantized your model with a `device_map`, make sure to move the entire model to one of your gpus or the `cpu` before saving it.
103
+
```python
104
+
quantized_model.to("cpu")
105
+
quantized_model.save_pretrained("opt-125m-gptq")
106
+
```
107
+
108
+
### Load a quantized model from the 🤗 Hub
109
+
110
+
You can load a quantized model from the Hub by using `from_pretrained`.
111
+
Make sure that the pushed weights are quantized, by checking that the attribute `quantization_config` is present in the model configuration object.
112
+
113
+
```python
114
+
from transformers import AutoModelForCausalLM
115
+
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq")
116
+
```
117
+
118
+
If you want to load a model faster and without allocating more memory than needed, the `device_map` argument also works with quantized model. Make sure that you have `accelerate` library installed.
119
+
```python
120
+
from transformers import AutoModelForCausalLM
121
+
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
122
+
```
123
+
124
+
### Exllama kernels for faster inference
125
+
126
+
For 4-bit model, you can use the exllama kernels in order to a faster inference speed. It is activated by default. You can change that behavior by passing `disable_exllama` in [`GPTQConfig`]. This will overwrite the quantization config stored in the config. Note that you will only be able to overwrite the attributes related to the kernels. Furthermore, you need to have the entire model on gpus if you want to use exllama kernels.
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config= gptq_config)
132
+
```
133
+
134
+
Note that only 4-bit models are supported for now. Furthermore, it is recommended to deactivate the exllama kernels if you are finetuning a quantized model with peft.
135
+
136
+
#### Fine-tune a quantized model
137
+
138
+
With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ.
139
+
Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details.
140
+
141
+
### Example demo
142
+
143
+
Check out the Google Colab [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) to learn how to quantize your model with GPTQ and how finetune the quantized model with peft.
144
+
145
+
### GPTQConfig
146
+
147
+
[[autodoc]] GPTQConfig
148
+
149
+
19
150
## `bitsandbytes` Integration
20
151
21
152
🤗 Transformers is closely integrated with most used modules on `bitsandbytes`. You can load your model in 8-bit precision with few lines of code.
@@ -215,7 +346,7 @@ This section is intended to advanced users, that want to explore what it is poss
215
346
216
347
One of the advanced use case of this is being able to load a model and dispatch the weights between `CPU` and `GPU`. Note that the weights that will be dispatched on CPU **will not** be converted in 8-bit, thus kept in `float32`. This feature is intended for users that want to fit a very large model and dispatch the model between GPU and CPU.
217
348
218
-
First, load a `BitsAndBytesConfig` from `transformers` and set the attribute `llm_int8_enable_fp32_cpu_offload` to `True`:
349
+
First, load a [`BitsAndBytesConfig`] from `transformers` and set the attribute `llm_int8_enable_fp32_cpu_offload` to `True`:
219
350
220
351
```python
221
352
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
0 commit comments