diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index e7fd94eac43b5d..48ad331395d976 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -133,6 +133,8 @@ - sections: - local: performance title: Overview + - local: quantization + title: Quantization - sections: - local: perf_train_gpu_one title: Methods and tools for efficient training on a single GPU diff --git a/docs/source/en/main_classes/quantization.md b/docs/source/en/main_classes/quantization.md index 7200039e3f5058..271cf17412fbe6 100644 --- a/docs/source/en/main_classes/quantization.md +++ b/docs/source/en/main_classes/quantization.md @@ -14,535 +14,24 @@ rendered properly in your Markdown viewer. --> -# Quantize 🤗 Transformers models +# Quantization -## AWQ integration +Quantization techniques reduces memory and computational costs by representing weights and activations with lower-precision data types like 8-bit integers (int8). This enables loading larger models you normally wouldn't be able to fit into memory, and speeding up inference. Transformers supports the AWQ and GPTQ quantization algorithms and it supports 8-bit and 4-bit quantization with bitsandbytes. -AWQ method has been introduced in the [*AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration* paper](https://arxiv.org/abs/2306.00978). With AWQ you can run models in 4-bit precision, while preserving its original quality (i.e. no performance degradation) with a superior throughput that other quantization methods presented below - reaching similar throughput as pure `float16` inference. + -We now support inference with any AWQ model, meaning anyone can load and use AWQ weights that are pushed on the Hub or saved locally. Note that using AWQ requires to have access to a NVIDIA GPU. CPU inference is not supported yet. +Learn how to quantize models in the [Quantization](../quantization) guide. -### Quantizing a model - -We advise users to look at different existing tools in the ecosystem to quantize their models with AWQ algorithm, such as: - -- [`llm-awq`](https://github.com/mit-han-lab/llm-awq) from MIT Han Lab -- [`autoawq`](https://github.com/casper-hansen/AutoAWQ) from [`casper-hansen`](https://github.com/casper-hansen) -- Intel neural compressor from Intel - through [`optimum-intel`](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc) - -Many other tools might exist in the ecosystem, please feel free to open a PR to add them to the list. -Currently the integration with 🤗 Transformers is only available for models that have been quantized using `autoawq` library and `llm-awq`. Most of the models quantized with `auto-awq` can be found under [`TheBloke`](https://huggingface.co/TheBloke) namespace of 🤗 Hub, and to quantize models with `llm-awq` please refer to the [`convert_to_hf.py`](https://github.com/mit-han-lab/llm-awq/blob/main/examples/convert_to_hf.py) script in the examples folder of [`llm-awq`](https://github.com/mit-han-lab/llm-awq/). - -### Load a quantized model - -You can load a quantized model from the Hub using the `from_pretrained` method. Make sure that the pushed weights are quantized, by checking that the attribute `quantization_config` is present in the model's configuration file (`configuration.json`). You can confirm that the model is quantized in the AWQ format by checking the field `quantization_config.quant_method` which should be set to `"awq"`. Note that loading the model will set other weights in `float16` by default for performance reasons. If you want to change that behavior, you can pass `torch_dtype` argument to `torch.float32` or `torch.bfloat16`. You can find in the sections below some example snippets and notebook. - -## Example usage - -First, you need to install [`autoawq`](https://github.com/casper-hansen/AutoAWQ) library - -```bash -pip install autoawq -``` - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer - -model_id = "TheBloke/zephyr-7B-alpha-AWQ" -model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0") -``` - -In case you first load your model on CPU, make sure to move it to your GPU device before using - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer - -model_id = "TheBloke/zephyr-7B-alpha-AWQ" -model = AutoModelForCausalLM.from_pretrained(model_id).to("cuda:0") -``` - -### Combining AWQ and Flash Attention - -You can combine AWQ quantization with Flash Attention to get a model that is both quantized and faster. Simply load the model using `from_pretrained` and pass `use_flash_attention_2=True` argument. - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer - -model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", use_flash_attention_2=True, device_map="cuda:0") -``` - -### Benchmarks - -We performed some speed, throughput and latency benchmarks using [`optimum-benchmark`](https://github.com/huggingface/optimum-benchmark) library. - -Note at that time of writing this documentation section, the available quantization methods were: `awq`, `gptq` and `bitsandbytes`. - -The benchmark was run on a NVIDIA-A100 instance and the model used was [`TheBloke/Mistral-7B-v0.1-AWQ`](https://huggingface.co/TheBloke/Mistral-7B-v0.1-AWQ) for the AWQ model, [`TheBloke/Mistral-7B-v0.1-GPTQ`](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ) for the GPTQ model. We also benchmarked it against `bitsandbytes` quantization methods and native `float16` model. Some results are shown below: - -
- -
- -
- -
- -
- -
- -
- -
- -You can find the full results together with packages versions in [this link](https://github.com/huggingface/optimum-benchmark/tree/main/examples/running-mistrals). - -From the results it appears that AWQ quantization method is the fastest quantization method for inference, text generation and among the lowest peak memory for text generation. However, AWQ seems to have the largest forward latency per batch size. - -### Google colab demo - -Check out how to use this integration throughout this [Google Colab demo](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)! - -### AwqConfig - -[[autodoc]] AwqConfig - -## `AutoGPTQ` Integration - -🤗 Transformers has integrated `optimum` API to perform GPTQ quantization on language models. You can load and quantize your model in 8, 4, 3 or even 2 bits without a big drop of performance and faster inference speed! This is supported by most GPU hardwares. - -To learn more about the quantization model, check out: -- the [GPTQ](https://arxiv.org/pdf/2210.17323.pdf) paper -- the `optimum` [guide](https://huggingface.co/docs/optimum/llm_quantization/usage_guides/quantization) on GPTQ quantization -- the [`AutoGPTQ`](https://github.com/PanQiWei/AutoGPTQ) library used as the backend - -### Requirements - -You need to have the following requirements installed to run the code below: - -- Install latest `AutoGPTQ` library -`pip install auto-gptq` - -- Install latest `optimum` from source -`pip install git+https://github.com/huggingface/optimum.git` - -- Install latest `transformers` from source -`pip install git+https://github.com/huggingface/transformers.git` - -- Install latest `accelerate` library -`pip install --upgrade accelerate` - -Note that GPTQ integration supports for now only text models and you may encounter unexpected behaviour for vision, speech or multi-modal models. - -### Load and quantize a model - -GPTQ is a quantization method that requires weights calibration before using the quantized models. If you want to quantize transformers model from scratch, it might take some time before producing the quantized model (~5 min on a Google colab for `facebook/opt-350m` model). - -Hence, there are two different scenarios where you want to use GPTQ-quantized models. The first use case would be to load models that has been already quantized by other users that are available on the Hub, the second use case would be to quantize your model from scratch and save it or push it on the Hub so that other users can also use it. - -#### GPTQ Configuration - -In order to load and quantize a model, you need to create a [`GPTQConfig`]. You need to pass the number of `bits`, a `dataset` in order to calibrate the quantization and the `tokenizer` of the model in order prepare the dataset. - -```python -model_id = "facebook/opt-125m" -tokenizer = AutoTokenizer.from_pretrained(model_id) -gptq_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer) -``` - -Note that you can pass your own dataset as a list of string. However, it is highly recommended to use the dataset from the GPTQ paper. - -```python -dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."] -quantization = GPTQConfig(bits=4, dataset = dataset, tokenizer=tokenizer) -``` - -#### Quantization - -You can quantize a model by using `from_pretrained` and setting the `quantization_config`. - -```python -from transformers import AutoModelForCausalLM -model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=gptq_config) - -``` -Note that you will need a GPU to quantize a model. We will put the model in the cpu and move the modules back and forth to the gpu in order to quantize them. - -If you want to maximize your gpus usage while using cpu offload, you can set `device_map = "auto"`. - -```python -from transformers import AutoModelForCausalLM -model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config) -``` - -Note that disk offload is not supported. Furthermore, if you are out of memory because of the dataset, you may have to pass `max_memory` in `from_pretained`. Checkout this [guide](https://huggingface.co/docs/accelerate/usage_guides/big_modeling#designing-a-device-map) to learn more about `device_map` and `max_memory`. - - -GPTQ quantization only works for text model for now. Futhermore, the quantization process can a lot of time depending on one's hardware (175B model = 4 gpu hours using NVIDIA A100). Please check on the hub if there is not a GPTQ quantized version of the model. If not, you can submit a demand on github. -### Push quantized model to 🤗 Hub - -You can push the quantized model like any 🤗 model to Hub with `push_to_hub`. The quantization config will be saved and pushed along the model. - -```python -quantized_model.push_to_hub("opt-125m-gptq") -tokenizer.push_to_hub("opt-125m-gptq") -``` - -If you want to save your quantized model on your local machine, you can also do it with `save_pretrained`: - -```python -quantized_model.save_pretrained("opt-125m-gptq") -tokenizer.save_pretrained("opt-125m-gptq") -``` - -Note that if you have quantized your model with a `device_map`, make sure to move the entire model to one of your gpus or the `cpu` before saving it. - -```python -quantized_model.to("cpu") -quantized_model.save_pretrained("opt-125m-gptq") -``` - -### Load a quantized model from the 🤗 Hub - -You can load a quantized model from the Hub by using `from_pretrained`. -Make sure that the pushed weights are quantized, by checking that the attribute `quantization_config` is present in the model configuration object. - -```python -from transformers import AutoModelForCausalLM -model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq") -``` - -If you want to load a model faster and without allocating more memory than needed, the `device_map` argument also works with quantized model. Make sure that you have `accelerate` library installed. - -```python -from transformers import AutoModelForCausalLM -model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto") -``` - -### Exllama kernels for faster inference - -For 4-bit model, you can use the exllama kernels in order to a faster inference speed. It is activated by default. You can change that behavior by passing `use_exllama` in [`GPTQConfig`]. This will overwrite the quantization config stored in the config. Note that you will only be able to overwrite the attributes related to the kernels. Furthermore, you need to have the entire model on gpus if you want to use exllama kernels. Also, you can perform CPU inference using Auto-GPTQ for Auto-GPTQ version > 0.4.2 by passing `device_map` = "cpu". For CPU inference, you have to pass `use_exllama = False` in the `GPTQConfig.` +## AwqConfig -```py -import torch -gptq_config = GPTQConfig(bits=4) -model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config) -``` - -With the release of the exllamav2 kernels, you can get faster inference speed compared to the exllama kernels. You just need to pass `exllama_config={"version": 2}` in [`GPTQConfig`]: - -```py -import torch -gptq_config = GPTQConfig(bits=4, exllama_config={"version":2}) -model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config = gptq_config) -``` - -Note that only 4-bit models are supported for now. Furthermore, it is recommended to deactivate the exllama kernels if you are finetuning a quantized model with peft. - -You can find the benchmark of these kernels [here](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark) -#### Fine-tune a quantized model - -With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been quantized with GPTQ. -Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details. - -### Example demo - -Check out the Google Colab [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) to learn how to quantize your model with GPTQ and how finetune the quantized model with peft. +[[autodoc]] AwqConfig -### GPTQConfig +## GPTQConfig [[autodoc]] GPTQConfig - -## `bitsandbytes` Integration - -🤗 Transformers is closely integrated with most used modules on `bitsandbytes`. You can load your model in 8-bit precision with few lines of code. -This is supported by most of the GPU hardwares since the `0.37.0` release of `bitsandbytes`. - -Learn more about the quantization method in the [LLM.int8()](https://arxiv.org/abs/2208.07339) paper, or the [blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) about the collaboration. - -Since its `0.39.0` release, you can load any model that supports `device_map` using 4-bit quantization, leveraging FP4 data type. - -If you want to quantize your own pytorch model, check out this [documentation](https://huggingface.co/docs/accelerate/main/en/usage_guides/quantization) from 🤗 Accelerate library. - -Here are the things you can do using `bitsandbytes` integration - -### General usage - -You can quantize a model by using the `load_in_8bit` or `load_in_4bit` argument when calling the [`~PreTrainedModel.from_pretrained`] method as long as your model supports loading with 🤗 Accelerate and contains `torch.nn.Linear` layers. This should work for any modality as well. - -```python -from transformers import AutoModelForCausalLM - -model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True) -model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True) -``` - -By default all other modules (e.g. `torch.nn.LayerNorm`) will be converted in `torch.float16`, but if you want to change their `dtype` you can overwrite the `torch_dtype` argument: - -```python ->>> import torch ->>> from transformers import AutoModelForCausalLM - ->>> model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32) ->>> model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype -torch.float32 -``` - - -### FP4 quantization - -#### Requirements - -Make sure that you have installed the requirements below before running any of the code snippets below. - -- Latest `bitsandbytes` library -`pip install bitsandbytes>=0.39.0` - -- Install latest `accelerate` -`pip install --upgrade accelerate` - -- Install latest `transformers` -`pip install --upgrade transformers` - -#### Tips and best practices - -- **Advanced usage:** Refer to [this Google Colab notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) for advanced usage of 4-bit quantization with all the possible options. - -- **Faster inference with `batch_size=1` :** Since the `0.40.0` release of bitsandbytes, for `batch_size=1` you can benefit from fast inference. Check out [these release notes](https://github.com/TimDettmers/bitsandbytes/releases/tag/0.40.0) and make sure to have a version that is greater than `0.40.0` to benefit from this feature out of the box. - -- **Training:** According to [QLoRA paper](https://arxiv.org/abs/2305.14314), for training 4-bit base models (e.g. using LoRA adapters) one should use `bnb_4bit_quant_type='nf4'`. - -- **Inference:** For inference, `bnb_4bit_quant_type` does not have a huge impact on the performance. However for consistency with the model's weights, make sure you use the same `bnb_4bit_compute_dtype` and `torch_dtype` arguments. - -#### Load a large model in 4bit - -By using `load_in_4bit=True` when calling the `.from_pretrained` method, you can divide your memory use by 4 (roughly). - -```python -# pip install transformers accelerate bitsandbytes -from transformers import AutoModelForCausalLM, AutoTokenizer - -model_id = "bigscience/bloom-1b7" - -tokenizer = AutoTokenizer.from_pretrained(model_id) -model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_4bit=True) -``` - - - -Note that once a model has been loaded in 4-bit it is currently not possible to push the quantized weights on the Hub. Note also that you cannot train 4-bit weights as this is not supported yet. However you can use 4-bit models to train extra parameters, this will be covered in the next section. - - - -### Load a large model in 8bit - -You can load a model by roughly halving the memory requirements by using `load_in_8bit=True` argument when calling `.from_pretrained` method - - -```python -# pip install transformers accelerate bitsandbytes -from transformers import AutoModelForCausalLM, AutoTokenizer - -model_id = "bigscience/bloom-1b7" - -tokenizer = AutoTokenizer.from_pretrained(model_id) -model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True) -``` - -Then, use your model as you would usually use a [`PreTrainedModel`]. - -You can check the memory footprint of your model with `get_memory_footprint` method. - -```python -print(model.get_memory_footprint()) -``` - -With this integration we were able to load large models on smaller devices and run them without any issue. - - - -Note that once a model has been loaded in 8-bit it is currently not possible to push the quantized weights on the Hub except if you use the latest `transformers` and `bitsandbytes`. Note also that you cannot train 8-bit weights as this is not supported yet. However you can use 8-bit models to train extra parameters, this will be covered in the next section. -Note also that `device_map` is optional but setting `device_map = 'auto'` is prefered for inference as it will dispatch efficiently the model on the available ressources. - - - -#### Advanced use cases - -Here we will cover some advanced use cases you can perform with FP4 quantization - -##### Change the compute dtype - -The compute dtype is used to change the dtype that will be used during computation. For example, hidden states could be in `float32` but computation can be set to bf16 for speedups. By default, the compute dtype is set to `float32`. - -```python -import torch -from transformers import BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) -``` - -##### Using NF4 (Normal Float 4) data type - -You can also use the NF4 data type, which is a new 4bit datatype adapted for weights that have been initialized using a normal distribution. For that run: - -```python -from transformers import BitsAndBytesConfig - -nf4_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_quant_type="nf4", -) - -model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config) -``` - -##### Use nested quantization for more memory efficient inference - -We also advise users to use the nested quantization technique. This saves more memory at no additional performance - from our empirical observations, this enables fine-tuning llama-13b model on an NVIDIA-T4 16GB with a sequence length of 1024, batch size of 1 and gradient accumulation steps of 4. - -```python -from transformers import BitsAndBytesConfig - -double_quant_config = BitsAndBytesConfig( - load_in_4bit=True, - bnb_4bit_use_double_quant=True, -) - -model_double_quant = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=double_quant_config) -``` - - -### Push quantized models on the 🤗 Hub - -You can push a quantized model on the Hub by naively using `push_to_hub` method. This will first push the quantization configuration file, then push the quantized model weights. -Make sure to use `bitsandbytes>0.37.2` (at this time of writing, we tested it on `bitsandbytes==0.38.0.post1`) to be able to use this feature. - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer - -model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", device_map="auto", load_in_8bit=True) -tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m") - -model.push_to_hub("bloom-560m-8bit") -``` - - - -Pushing 8bit models on the Hub is strongely encouraged for large models. This will allow the community to benefit from the memory footprint reduction and loading for example large models on a Google Colab. - - - -### Load a quantized model from the 🤗 Hub - -You can load a quantized model from the Hub by using `from_pretrained` method. Make sure that the pushed weights are quantized, by checking that the attribute `quantization_config` is present in the model configuration object. - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer - -model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto") -``` - -Note that in this case, you don't need to specify the arguments `load_in_8bit=True`, but you need to make sure that `bitsandbytes` and `accelerate` are installed. -Note also that `device_map` is optional but setting `device_map = 'auto'` is prefered for inference as it will dispatch efficiently the model on the available ressources. - -### Advanced use cases - -This section is intended to advanced users, that want to explore what it is possible to do beyond loading and running 8-bit models. - -#### Offload between `cpu` and `gpu` - -One of the advanced use case of this is being able to load a model and dispatch the weights between `CPU` and `GPU`. Note that the weights that will be dispatched on CPU **will not** be converted in 8-bit, thus kept in `float32`. This feature is intended for users that want to fit a very large model and dispatch the model between GPU and CPU. - -First, load a [`BitsAndBytesConfig`] from `transformers` and set the attribute `llm_int8_enable_fp32_cpu_offload` to `True`: - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig - -quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) -``` - -Let's say you want to load `bigscience/bloom-1b7` model, and you have just enough GPU RAM to fit the entire model except the `lm_head`. Therefore write a custom device_map as follows: - -```python -device_map = { - "transformer.word_embeddings": 0, - "transformer.word_embeddings_layernorm": 0, - "lm_head": "cpu", - "transformer.h": 0, - "transformer.ln_f": 0, -} -``` - -And load your model as follows: -```python -model_8bit = AutoModelForCausalLM.from_pretrained( - "bigscience/bloom-1b7", - device_map=device_map, - quantization_config=quantization_config, -) -``` - -And that's it! Enjoy your model! - -#### Play with `llm_int8_threshold` - -You can play with the `llm_int8_threshold` argument to change the threshold of the outliers. An "outlier" is a hidden state value that is greater than a certain threshold. -This corresponds to the outlier threshold for outlier detection as described in `LLM.int8()` paper. Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning). -This argument can impact the inference speed of the model. We suggest to play with this parameter to find which one is the best for your use case. - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig - -model_id = "bigscience/bloom-1b7" - -quantization_config = BitsAndBytesConfig( - llm_int8_threshold=10, -) - -model_8bit = AutoModelForCausalLM.from_pretrained( - model_id, - device_map=device_map, - quantization_config=quantization_config, -) -tokenizer = AutoTokenizer.from_pretrained(model_id) -``` - -#### Skip the conversion of some modules - -Some models has several modules that needs to be not converted in 8-bit to ensure stability. For example Jukebox model has several `lm_head` modules that should be skipped. Play with `llm_int8_skip_modules` - -```python -from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig - -model_id = "bigscience/bloom-1b7" - -quantization_config = BitsAndBytesConfig( - llm_int8_skip_modules=["lm_head"], -) - -model_8bit = AutoModelForCausalLM.from_pretrained( - model_id, - device_map=device_map, - quantization_config=quantization_config, -) -tokenizer = AutoTokenizer.from_pretrained(model_id) -``` - -#### Fine-tune a model that has been loaded in 8-bit - -With the official support of adapters in the Hugging Face ecosystem, you can fine-tune models that have been loaded in 8-bit. -This enables fine-tuning large models such as `flan-t5-large` or `facebook/opt-6.7b` in a single google Colab. Please have a look at [`peft`](https://github.com/huggingface/peft) library for more details. - -Note that you don't need to pass `device_map` when loading the model for training. It will automatically load your model on your GPU. You can also set the device map to a specific device if needed (e.g. `cuda:0`, `0`, `torch.device('cuda:0')`). Please note that `device_map=auto` should be used for inference only. - -### BitsAndBytesConfig +## BitsAndBytesConfig [[autodoc]] BitsAndBytesConfig - - -## Quantization with 🤗 `optimum` - -Please have a look at [Optimum documentation](https://huggingface.co/docs/optimum/index) to learn more about quantization methods that are supported by `optimum` and see if these are applicable for your use case. diff --git a/docs/source/en/quantization.md b/docs/source/en/quantization.md new file mode 100644 index 00000000000000..60903e36ad5968 --- /dev/null +++ b/docs/source/en/quantization.md @@ -0,0 +1,471 @@ + + +# Quantization + +Quantization techniques focus on representing data with less information while also trying to not lose too much accuracy. This often means converting a data type to represent the same information with fewer bits. For example, if your model weights are stored as 32-bit floating points and they're quantized to 16-bit floating points, this halves the model size which makes it easier to store and reduces memory-usage. Lower precision can also speedup inference because it takes less time to perform calculations with fewer bits. + +Transformers supports several quantization schemes to help you run inference with large language models (LLMs) and finetune adapters on quantized models. This guide will show you how to use Activation-aware Weight Quantization (AWQ), AutoGPTQ, and bitsandbytes. + +## AWQ + + + +Try AWQ quantization with this [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY)! + + + +[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) doesn't quantize all the weights in a model, and instead, it preserves a small percentage of weights that are important for LLM performance. This significantly reduces quantization loss such that you can run models in 4-bit precision without experiencing any performance degradation. + +There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the processs is similar for llm-awq quantized models. + +Make sure you have autoawq installed: + +```bash +pip install autoawq +``` + +AWQ-quantized models can be identified by checking the `quantization_config` attribute in the model's [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file: + +```json +{ + "_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source", + "architectures": [ + "MistralForCausalLM" + ], + ... + ... + ... + "quantization_config": { + "quant_method": "awq", + "zero_point": true, + "group_size": 128, + "bits": 4, + "version": "gemm" + } +} +``` + +A quantized model is loaded with the [`~PreTrainedModel.from_pretrained`] method. If you loaded your model on the CPU, make sure to move it to a GPU device first. Use the `device_map` parameter to specify where to place the model: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "TheBloke/zephyr-7B-alpha-AWQ" +model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cuda:0") +``` + +Loading an AWQ-quantized model automatically sets other weights to fp16 by default for performance reasons. If you want to load these other weights in a different format, use the `torch_dtype` parameter: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model_id = "TheBloke/zephyr-7B-alpha-AWQ" +model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float32) +``` + +AWQ quantization can also be combined with [FlashAttention-2](perf_infer_gpu_one#flashattention-2) to further accelerate inference: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("TheBloke/zephyr-7B-alpha-AWQ", use_flash_attention_2=True, device_map="cuda:0") +``` + +## AutoGPTQ + + + +Try GPTQ quantization with PEFT in this [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) and learn more about it's details in this [blog post](https://huggingface.co/blog/gptq-integration)! + + + +The [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) library implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save your memory-usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory, and you can also expect a speedup in inference because using a lower bitwidth takes less time to communicate. + +Before you begin, make sure the following libraries are installed: + +```bash +pip install auto-gptq +pip install git+https://github.com/huggingface/optimum.git +pip install git+https://github.com/huggingface/transformers.git +pip install --upgrade accelerate +``` + +To quantize a model (currently only supported for text models), you need to create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calibrate the weights for quantization, and a tokenizer to prepare the dataset. + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig + +model_id = "facebook/opt-125m" +tokenizer = AutoTokenizer.from_pretrained(model_id) +gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer) +``` + +You could also pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper. + +```py +dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."] +gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer) +``` + +Load a model to quantize and pass the `gptq_config` to the [`~AutoModelForCausalLM.from_pretrained`] method. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization. + +```py +quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=gptq_config) +``` + +If you're running out of memory because a dataset is too large, disk offloading is not supported. If this is the case, try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU): + +```py +quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"}, quantization_config=gptq_config) +``` + + + +Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [faceboook/opt-350m]() model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists. + + + +Once your model is quantized, you can push the model and tokenizer to the Hub where it can be easily shared and accessed. Use the [`~PreTrainedModel.push_to_hub`] method to save the [`GPTQConfig`]: + +```py +quantized_model.push_to_hub("opt-125m-gptq") +tokenizer.push_to_hub("opt-125m-gptq") +``` + +You could also save your quantized model locally with the [`~PreTrainedModel.save_pretrained`] method. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. For example, to save the model on a CPU: + +```py +quantized_model.save_pretrained("opt-125m-gptq") +tokenizer.save_pretrained("opt-125m-gptq") + +# if quantized with device_map set +quantized_model.to("cpu") +quantized_model.save_pretrained("opt-125m-gptq") +``` + +Reload a quantized model with the [`~PreTrainedModel.from_pretrained`] method, and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed. + +```py +from transformers import AutoModelForCausalLM + +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto") +``` + +### ExLlama + +[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object. To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter: + +```py +import torch +from transformers import AutoModelForCausalLM, GPTQConfig + +gptq_config = GPTQConfig(bits=4, exllama_config={"version":2}) +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=gptq_config) +``` + + + +Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT. + + + +The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ (version > 0.4.2), then you'll need to disable the ExLlama kernel. This overwrites the attributes related to the ExLlama kernels in the quantization config of the config.json file. + +```py +import torch +from transformers import AutoModelForCausalLM, GPTQConfig +gptq_config = GPTQConfig(bits=4, use_exllama=False) +model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="cpu", quantization_config=gptq_config) +``` + +## bitsandbytes + +[bitsandbytes](https://github.com/TimDettmers/bitsandbytes) is the easiest option for quantizing a model to 8 and 4-bit. 8-bit quantization multiplies outliers in fp16 with non-outliers in int8, converts the non-outlier values back to fp16, and then adds them together to return the weights in fp16. This reduces the degradative effect outlier values have on a model's performance. 4-bit quantization compresses a model even further, and it is commonly used with [QLoRA](https://hf.co/papers/2305.14314) to finetune quantized LLMs. + +To use bitsandbytes, make sure you have the following libraries installed: + + + + +```bash +pip install transformers accelerate bitsandbytes>0.37.0 +``` + + + + +```bash +pip install bitsandbytes>=0.39.0 +pip install --upgrade accelerate +pip install --upgrade transformers +``` + + + + +Now you can quantize a model with the `load_in_8bit` or `load_in_4bit` parameters in the [`~PreTrainedModel.from_pretrained`] method. This works for any model in any modality, as long as it supports loading with Accelerate and contains `torch.nn.Linear` layers. + + + + +Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently use the GPUs available: + +```py +from transformers import AutoModelForCausalLM + +model_8bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_8bit=True) +``` + +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: + +```py +import torch +from transformers import AutoModelForCausalLM + +model_8bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_8bit=True, torch_dtype=torch.float32) +model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype +``` + +Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with the [`~PreTrainedModel.push_to_hub`] method. The quantization config.json file is pushed first, followed by the quantized model weights. + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-560m", device_map="auto", load_in_8bit=True) +tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m") + +model.push_to_hub("bloom-560m-8bit") +``` + + + + +Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently use the GPUs available: + +```py +from transformers import AutoModelForCausalLM + +model_4bit = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b7", device_map="auto", load_in_4bit=True) +``` + +By default, all the other modules such as `torch.nn.LayerNorm` are converted to `torch.float16`. You can change the data type of these modules with the `torch_dtype` parameter if you want: + +```py +import torch +from transformers import AutoModelForCausalLM + +model_4bit = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", load_in_4bit=True, torch_dtype=torch.float32) +model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype +``` + +Once a model is quantized to 4-bit, you can't push the quantized weights to the Hub. + + + + + + +Training with 8-bit and 4-bit weights are only supported for training *extra* parameters. + + + +You can check your memory footprint with the `get_memory_footprint` method: + +```py +print(model.get_memory_footprint()) +``` + +Quantized models can be loaded from the [`~PreTrainedModel.from_pretrained`] method without needing to specify the `load_in_8bit` or `load_in_4bit` parameters: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer + +model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto") +``` + +### 8-bit + + + +Learn more about the details of 8-bit quantization in this [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration)! + + + +This section explores some of the specific features of 8-bit models, such as offloading, outlier thresholds, skipping module conversion, and finetuning. + +#### Offloading + +8-bit models can offload weights between the CPU and GPU to support fitting very large models into memory. The weights dispatched to the CPU are actually stored in **float32**, and aren't converted to 8-bit. For example, to enable offloading for the [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) model, start by creating a [`BitsAndBytesConfig`]: + +```py +from transformers import AutoModelForCausalLM, BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) +``` + +Design a custom device map to fit everything on your GPU except for the `lm_head`, which you'll dispatch to the CPU: + +```py +device_map = { + "transformer.word_embeddings": 0, + "transformer.word_embeddings_layernorm": 0, + "lm_head": "cpu", + "transformer.h": 0, + "transformer.ln_f": 0, +} +``` + +Now load your model with the custom `device_map` and `quantization_config`: + +```py +model_8bit = AutoModelForCausalLM.from_pretrained( + "bigscience/bloom-1b7", + device_map=device_map, + quantization_config=quantization_config, +) +``` + +#### Outlier threshold + +An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning). + +To find the best threshold for your model, we recommend experimenting with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]: + +```py +from transformers import AutoModelForCausalLM, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_threshold=10, +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map=device_map, + quantization_config=quantization_config, +) +``` + +#### Skip module conversion + +For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit which can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`]: + +```py +from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig + +model_id = "bigscience/bloom-1b7" + +quantization_config = BitsAndBytesConfig( + llm_int8_skip_modules=["lm_head"], +) + +model_8bit = AutoModelForCausalLM.from_pretrained( + model_id, + device_map="auto", + quantization_config=quantization_config, +) +``` + +#### Finetuning + +With the [PEFT](https://github.com/huggingface/peft) library, you can finetune large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it'll automatically load your model on a GPU. However, you can still customize the device map with the `device_map` parameter if you want to (`device_map="auto"` should only be used for inference). + +### 4-bit + + + +Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in this [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes). + + + +This section explores some of the specific features of 4-bit models, such as changing the compute data type, using the Normal Float 4 (NF4) data type, and using nested quantization. + +#### Compute data type + +To speedup computation, you can change the data type from float32 (the default value) to bf16 using the `bnb_4bit_compute_dtype` parameter in [`BitsAndBytesConfig`]: + +```py +import torch +from transformers import BitsAndBytesConfig + +quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) +``` + +#### Normal Float 4 (NF4) + +NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models. This can be configured with the `bnb_4bit_quant_type` parameter in the [`BitsAndBytesConfig`]: + +```py +from transformers import BitsAndBytesConfig + +nf4_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_quant_type="nf4", +) + +model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=nf4_config) +``` + +For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `torch_dtype` values. + +#### Nested quantization + +Nested quantization is a technique that can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an addition 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enabling gradient accumulation with 4 steps. + +```py +from transformers import BitsAndBytesConfig + +double_quant_config = BitsAndBytesConfig( + load_in_4bit=True, + bnb_4bit_use_double_quant=True, +) + +model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b", quantization_config=double_quant_config) +``` + +## Optimum + +The [Optimum](https://huggingface.co/docs/optimum/index) library supports quantization for Intel, Furiosa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. Consider using Optimum for quantization if you're using specific and optimized hardware like Intel CPUs, Furiosa NPUs or a model accelerator like ONNX Runtime. + +## Benchmarks + +To compare the speed, throughput, and latency of each quantization scheme, check the following benchmarks obtained from the [optimum-benchmark](https://github.com/huggingface/optimum-benchmark) library. The benchmark was run on a NVIDIA A1000 for the [TheBloke/Mistral-7B-v0.1-AWQ](https://huggingface.co/TheBloke/Mistral-7B-v0.1-AWQ) and [TheBloke/Mistral-7B-v0.1-GPTQ](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GPTQ) models. These were also tested against the bitsandbytes quantization methods as well as a native fp16 model. + +
+
+ forward peak memory per batch size +
forward peak memory/batch size
+
+
+ generate peak memory per batch size +
generate peak memory/batch size
+
+
+ +
+
+ generate throughput per batch size +
generate throughput/batch size
+
+
+ forward latency per batch size +
forward latency/batch size
+
+
+ +The benchmarks indicate AWQ quantization is the fastest for inference, text generation, and has the lowest peak memory for text generation. However, AWQ has the largest forward latency per batch size. For a more detailed discussion about the pros and cons of each quantization method, read the [Overview of natively supported quantization schemes in 🤗 Transformers](https://huggingface.co/blog/overview-quantization-transformers) blog post.