-
Notifications
You must be signed in to change notification settings - Fork 275
Support transformers-like api for woq quantization #1987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
88 commits
Select commit
Hold shift + click to select a range
48b0d23
pipeline pass
Kaihui-intel a496483
update import path
Kaihui-intel 23e7428
add examples
Kaihui-intel 40df805
add ut
Kaihui-intel 8cc0ba5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4224610
add gpu example
Kaihui-intel 4e70c9f
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel 0cc3c5d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 005ab85
update utility
Kaihui-intel 2429a8a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 2d5b152
update torch fp8 mapping
Kaihui-intel 1b4bae3
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel 30959fd
update float8_e4m3fnuz
Kaihui-intel d53ebc8
use_ipex=False
Kaihui-intel 1f04ee2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 913b953
update float8_e4m3fnuz
Kaihui-intel b81b9fc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] f7d7003
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel af3698e
reset fp8 mapping
Kaihui-intel ae2d42a
add evaluation
Kaihui-intel 036d438
update evaluation
Kaihui-intel 7c6b646
remove use_neural_speed
Kaihui-intel 94c177f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 50aa98c
update evaluation
Kaihui-intel fd63d0f
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] ed862d6
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel 6ed6656
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] cbef232
remove use_neural_speed from eval
Kaihui-intel 3186f25
update copyright
Kaihui-intel 6b1412e
follow master
Kaihui-intel 6f6da99
rebase master
Kaihui-intel 5c592f9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3cdff01
update models
Kaihui-intel ac13d1a
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 4dee955
update models
Kaihui-intel a9d42c2
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel 53658bf
remove inc sq
Kaihui-intel 22ec60e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 5b9b1cc
remove qat/static/dynamic
Kaihui-intel 7c5e6cc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] fcc315e
update ut
Kaihui-intel 37659c2
support layer wise
Kaihui-intel 5933e98
update model_path
Kaihui-intel 5aeedb5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 62b18a3
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel 6cbb8ab
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] d902909
rebase
Kaihui-intel 32fff22
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] c89b0c0
fix xpu
Kaihui-intel 4ef08f7
remove use_ipex
Kaihui-intel e81ce43
add quant_lm_head
Kaihui-intel 7bdd4f9
remove unused code
Kaihui-intel 2b16b7a
clean utility.py
Kaihui-intel 984f1eb
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 19d1e41
remove neural_speed from lm eval
Kaihui-intel 80b51e8
rm import inc version
Kaihui-intel 1d5a735
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 6850d6e
fix import
Kaihui-intel 4c9ef3c
fix weiight dtype
Kaihui-intel fc0e930
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] ca170d0
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel f6f9a1c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] aa6798e
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel 8fbdfaa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 3766706
update import set_module
Kaihui-intel 240d3c6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] ee5b931
update requirements
Kaihui-intel 6f16f8d
remove itrex
Kaihui-intel c254db1
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel ab72f1c
update optimized xpu model list
Kaihui-intel 21173dc
update README
Kaihui-intel 943dca7
remove config & add save/load ut
Kaihui-intel 38017f4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] eb3112c
improve basic ut coverage
Kaihui-intel fec937c
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel a1d3eda
skip lm eval json pre-commit
Kaihui-intel b1467a6
fix redefine torch
Kaihui-intel 9a79844
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 14732cc
clean code
Kaihui-intel 10c8c2b
revert is_xpu_available
Kaihui-intel 758a364
remove absorb_to_dict
Kaihui-intel 89eb5af
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] 32cccb6
Merge branch 'kaihui/transformers_api' of https://github.com/intel/ne…
Kaihui-intel 45627dd
add absorb & update xpu transformers==4.38.1
Kaihui-intel 6ae0389
update transformers version to README
Kaihui-intel 9fa6262
remove modelhub
Kaihui-intel 8f2d381
Update neural_compressor/transformers/models/modeling_auto.py
changwangss 610c558
bits int = 4
Kaihui-intel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
168 changes: 168 additions & 0 deletions
168
...nguage-modeling/quantization/transformers/weight_only/text-generation/README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
# Step-by-Step | ||
We provide a Transformers-like API for model compression using the `WeightOnlyQuant` with `Rtn/Awq/Teq/GPTQ/AutoRound` algorithms, besides we provide use ipex to use intel extension for pytorch to accelerate the model. | ||
We provide the inference benchmarking script `run_generation.py` for large language models, the default search algorithm is beam search with `num_beams = 4`. [Here](./llm_quantization_recipes.md) are some well accuracy and performance optimized models we validated, more models are working in progress. | ||
|
||
# Quantization for CPU device | ||
|
||
## Prerequisite | ||
### Create Environment | ||
python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements, we recommend create environment as the following steps. | ||
|
||
```bash | ||
pip install -r requirements_cpu_woq.txt | ||
``` | ||
|
||
|
||
### Run | ||
#### Performance | ||
```shell | ||
# fp32 | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--batch_size 1 \ | ||
--benchmark | ||
|
||
# quant and do benchmark. | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--woq \ | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided. | ||
--output_dir <WOQ_MODEL_SAVE_PATH> \ # Default is "./saved_results" | ||
--batch_size \ | ||
--benchmark | ||
|
||
# load WOQ quantized model and do benchmark. | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <WOQ_MODEL_SAVE_PATH> \ | ||
--benchmark | ||
|
||
# load WOQ model from Huggingface and do benchmark. | ||
OMP_NUM_THREADS=<physical cores num> numactl -m <node N> -C <cpu list> python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--benchmark | ||
|
||
``` | ||
#### Accuracy | ||
The accuracy validation is based from [lm_evaluation_harness](https://github.com/EleutherAI/lm-evaluation-harness/blob/v0.4.3/lm_eval/__main__.py). | ||
```shell | ||
# fp32 | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 | ||
|
||
# quant and do accuracy. | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--woq \ | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "Awq", "Teq", "GPTQ", "AutoRound" are provided. | ||
--output_dir <WOQ_MODEL_SAVE_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--batch_size 56 | ||
|
||
# load WOQ model quantied by itrex and do benchmark. | ||
python run_generate_cpu_woq.py \ | ||
--model <WOQ_MODEL_SAVE_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--batch_size 56 | ||
|
||
# load WOQ model quantied by itrex and do benchmark with neuralspeed. | ||
# only support quantized with algorithm "Awq", "GPTQ", "AutoRound" | ||
python run_generate_cpu_woq.py \ | ||
--model <WOQ_MODEL_SAVE_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 | ||
|
||
|
||
# load WOQ model from Huggingface and do benchmark. | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 | ||
|
||
# load WOQ model from Huggingface and do benchmark with neuralspeed. | ||
python run_generate_cpu_woq.py \ | ||
--model <MODEL_NAME_OR_PATH> \ | ||
--accuracy \ | ||
--tasks lambada_openai,piqa,hellaswag \ # notice: no space. | ||
--device cpu \ | ||
--batch_size 56 \ | ||
|
||
``` | ||
|
||
# Quantization for GPU device | ||
>**Note**: | ||
> 1. default search algorithm is beam search with num_beams = 1. | ||
> 2. [ipex.optimize_transformers](https://github.com/intel/intel-extension-for-pytorch/blob/v2.1.10%2Bxpu/docs/tutorials/llm/llm_optimize_transformers.md) Support for the optimized inference of model types "gptj," "mistral," "qwen," and "llama" to achieve high performance and accuracy. Ensure accurate inference for other model types as well. | ||
> 3. We provide compression technologies `WeightOnlyQuant` with `Rtn/GPTQ/AutoRound` algorithms and `load_in_4bit` and `load_in_8bit` work on intel GPU device. | ||
|
||
## Prerequisite | ||
### Dependencies | ||
Intel-extension-for-pytorch dependencies are in oneapi package, before install intel-extension-for-pytorch, we should install oneapi first. Please refer to [Installation Guide](https://intel.github.io/intel-extension-for-pytorch/index.html#installation?platform=gpu&version=v2.1.10%2Bxpu) to install the OneAPI to "/opt/intel folder". | ||
|
||
### Create Environment | ||
Pytorch and Intel-extension-for-pytorch version for intel GPU > 2.1 are required, python version requests equal or higher than 3.9 due to [text evaluation library](https://github.com/EleutherAI/lm-evaluation-harness/tree/master) limitation, the dependent packages are listed in requirements_GPU.txt, we recommend create environment as the following steps. For Intel-exension-for-pytorch, we should install from source code now, and Intel-extension-for-pytorch will add weight-only quantization in the next version. | ||
|
||
>**Note**: please install transformers==4.40.2. | ||
|
||
```bash | ||
pip install -r requirements_GPU.txt | ||
pip install transformers==4.38.1 # llama use 4.38.1 | ||
source /opt/intel/oneapi/setvars.sh | ||
git clone https://github.com/intel/intel-extension-for-pytorch.git ipex-gpu | ||
cd ipex-gpu | ||
git submodule update --init --recursive | ||
export USE_AOT_DEVLIST='pvc,ats-m150' | ||
export BUILD_WITH_CPU=OFF | ||
|
||
python setup.py install | ||
``` | ||
|
||
## Run | ||
The following are command to show how to use it. | ||
|
||
### 1. Performance | ||
``` bash | ||
# fp16 | ||
python run_generation_gpu_woq.py \ | ||
--model EleutherAI/gpt-j-6b \ | ||
--benchmark | ||
|
||
# weightonlyquant | ||
python run_generation_gpu_woq.py \ | ||
--model EleutherAI/gpt-j-6b \ | ||
--woq \ | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided. | ||
--benchmark | ||
``` | ||
> Note: If your device memory is not enough, please quantize and save the model first, then rerun the example with loading the model as below, If your device memory is enough, skip below instruction, just quantization and inference. | ||
```bash | ||
# First step: Quantize and save model | ||
python run_generation_gpu_woq.py \ | ||
--model EleutherAI/gpt-j-6b \ | ||
--woq \ # default quantize method is Rtn | ||
--woq_algo <ALGORITHM_NAME> \ # Default is "Rtn", "GPTQ", "AutoRound" are provided. | ||
--output_dir "saved_dir" | ||
|
||
# Second step: Load model and inference | ||
python run_generation_gpu_woq.py \ | ||
--model "saved_dir" \ | ||
--benchmark | ||
``` | ||
|
||
### 2. Accuracy | ||
```bash | ||
# quantized model by following the steps above | ||
python run_generation_gpu_woq.py \ | ||
--model "saved_dir" \ | ||
--accuracy \ | ||
--tasks "lambada_openai" | ||
``` |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.