Support directly loading gptq models from huggingface #9391

yangw1234 · 2023-11-09T00:06:43Z

Description

Support directly loading gptq models from huggingface.

Many models are published as gptq format in huggingface. It would be nice to load them directly using bigdl-llm.

Install:

BUILD_CUDA_EXT=0 pip install git+https://github.com/PanQiWei/AutoGPTQ.git@1de9ab6
pip install optimum==0.14.0

Usage:

from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import GPTQConfig
quantization_config = GPTQConfig(
    bits=4,
    use_exllama=False,
    )

# Load model in 4 bit,
# which convert the relevant layers in the model into INT4 format
model = AutoModelForCausalLM.from_pretrained(model_path,
                                             load_in_4bit=True, # will load into asym_int4. if using load_in_low_bit, then `load_in_low_bit` must be  "asym_int4"
                                             torch_dtype=torch.float,
                                             trust_remote_code=True,
                                             quantization_config=quantization_config,)

Limitations:

Only works on 4bit and asc_order=False
GPU version is really slow. Investigation needed.

Perf: https://github.com/analytics-zoo/nano/issues/738

qiuxin2012 · 2023-11-09T02:27:41Z

python/llm/src/bigdl/llm/transformers/model.py

+                    invalidInputError(q_config["bits"] == 4,
+                                      "Only 4-bit gptq is supported in bigdl-llm.")
+                    invalidInputError(q_config["desc_act"] is False,
+                                      "Only desc_act=False is supported in bigdl-llm.")


also check group_size, group_size should be a multiple of 64.

jason-dai · 2023-11-09T13:20:01Z

python/llm/src/bigdl/llm/transformers/model.py

@@ -89,6 +89,18 @@ def from_pretrained(cls,
        optimize_model = kwargs.pop("optimize_model", True)

        if load_in_4bit or load_in_low_bit:
+
+            if config_dict.get("quantization_config", None) is not None:


do we need to add it to Python Doc?

Added and passing quantization_config is no longer required.

jason-dai · 2023-11-09T13:38:34Z

python/llm/example/CPU/HF-Transformers-AutoModels/Model/llama2_gptq/README.md

@@ -0,0 +1,73 @@
+# Llama2


move to bigdl/python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ

# GPTQ

This example shows how to directly run 4-bit GPTQ models using BigDL-LLM on Intel CPU

jason-dai · 2023-11-09T14:20:48Z

python/llm/src/bigdl/llm/transformers/convert.py

+                            mp_group=mp_group,
+                        )
+
+                        device_type = module.qweight.data.device.type


is it used?

jason-dai · 2023-11-10T14:33:58Z

python/llm/src/bigdl/llm/transformers/model.py

+                        invalidInputError(False,
+                                          (f"group_size must be divisible by "
+                                           f"{get_ggml_qk_size(load_in_low_bit)}."))
+                    if user_quantization_config is not None:


do we want the user to pass user_quantization_config?

I think not letting user to pass user_quantization_config might be a better choice

jason-dai · 2023-11-10T14:35:16Z

python/llm/src/bigdl/llm/transformers/model.py

+                    else:
+                        from transformers import GPTQConfig
+                        user_quantization_config = GPTQConfig(bits=4, use_exllama=False)
+                    kwargs["quantization_config"] = user_quantization_config


Does save/load low bit work? Do we need to remove quantization_config in save_low_bit?

save/load low bit works. It seems our load_low_bit will ignore quantization_config, but I remove it anyway.

jason-dai

LGTM

qiuxin2012 · 2023-11-14T00:23:00Z

python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/README.md

+E.g. on Linux,
+```bash
+# set BigDL-Nano env variables
+source bigdl-nano-init


why nano variables? we have bigdl-llm-init now.

copied from existing examples. will change that.

It seems most of our examples still use bigdl-nano-init. How about we leave it here and change them together in the future.

OK - please open an issue

yangw1234 · 2023-11-14T04:48:00Z

The failed test is irrelevant and unit tests on arc lack resources to run. I'll merge this PR first to unblock further development.

cyita · 2023-11-14T09:18:47Z

python/llm/example/CPU/HF-Transformers-AutoModels/Advanced-Quantizations/GPTQ/README.md

+pip install bigdl-llm[all] # install bigdl-llm with 'all' option
+pip install transformers==4.34.0
+BUILD_CUDA_EXT=0 pip install git+https://github.com/PanQiWei/AutoGPTQ.git@1de9ab6
+pip install optimum==0.14.0


0.14.0 -》1.14.0

* Support directly loading GPTQ models from huggingface * fix style * fix tests * change example structure * address comments * fix style * address comments

yangw1234 added 2 commits November 8, 2023 23:56

Support directly loading GPTQ models from huggingface

559ecd2

fix style

14f39c5

yangw1234 requested review from qiuxin2012 and jason-dai November 9, 2023 01:06

fix tests

375a8ef

qiuxin2012 reviewed Nov 9, 2023

View reviewed changes

jason-dai reviewed Nov 9, 2023

View reviewed changes

yangw1234 added 3 commits November 9, 2023 21:04

change example structure

1f7d998

address comments

12ee14c

fix style

4c48c61

jason-dai reviewed Nov 10, 2023

View reviewed changes

address comments

c7d492f

jason-dai approved these changes Nov 13, 2023

View reviewed changes

qiuxin2012 reviewed Nov 14, 2023

View reviewed changes

cyita mentioned this pull request Nov 14, 2023

Add awq load support #9453

Merged

5 tasks

yangw1234 merged commit 282b0df into intel-analytics:main Nov 14, 2023
34 of 36 checks passed

cyita reviewed Nov 14, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support directly loading gptq models from huggingface #9391

Support directly loading gptq models from huggingface #9391

yangw1234 commented Nov 9, 2023 •

edited

Loading

qiuxin2012 Nov 9, 2023

yangw1234 Nov 9, 2023

jason-dai Nov 9, 2023

yangw1234 Nov 9, 2023

jason-dai Nov 9, 2023

yangw1234 Nov 9, 2023

jason-dai Nov 9, 2023

yangw1234 Nov 9, 2023

jason-dai Nov 10, 2023

yangw1234 Nov 11, 2023

jason-dai Nov 10, 2023

yangw1234 Nov 13, 2023

jason-dai left a comment

qiuxin2012 Nov 14, 2023

yangw1234 Nov 14, 2023

yangw1234 Nov 14, 2023

jason-dai Nov 14, 2023

yangw1234 Nov 14, 2023

yangw1234 commented Nov 14, 2023

cyita Nov 14, 2023

Support directly loading gptq models from huggingface #9391

Support directly loading gptq models from huggingface #9391

Conversation

yangw1234 commented Nov 9, 2023 • edited Loading

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jason-dai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yangw1234 commented Nov 14, 2023

Choose a reason for hiding this comment

yangw1234 commented Nov 9, 2023 •

edited

Loading