IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

jianweimama · 2024-09-14T02:49:53Z

源2.0-M32大模型研发团队深入分析当前主流的量化方案，综合评估模型压缩效果和精度损失表现，最终采用了GPTQ量化方法，并采用AutoGPTQ作为量化框架。

Model： Yuan2-M32-HF-INT4 https://blog.csdn.net/2401_82700030/article/details/141469514
容器： intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1

Test Step：
Log into container:

docker exec -ti arc_vllm-new-2 bash

cd /benchmark/all-in-one/

vim config.yaml

Config.yaml 配置：

run-arc.sh

运行报错，结果如下log.
Results Log:

hzjane · 2024-09-14T06:38:24Z

I try to reproduce it and meet the same issue again. And as I found that.

The official vllm does not support the yuan model yet.
Maybe this model's quantized method is not supportted to be load by ipex-llm yet.

# https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/b403a2beb2746c0c923b4eb936fe1e2560c83b19/docs/README_GPTQ_CN.md#3-gptq%E9%87%8F%E5%8C%96%E6%A8%A1%E5%9E%8B%E7%9A%84%E6%8E%A8%E7%90%86
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
# `gptq_model-4bit-128g.safetensors 0-2`
tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)

glorysdj assigned hzjane Sep 14, 2024

glorysdj added user issue multi-arc labels Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

jianweimama commented Sep 14, 2024

hzjane commented Sep 14, 2024 •

edited

Loading

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

Comments

jianweimama commented Sep 14, 2024

docker exec -ti arc_vllm-new-2 bash

cd /benchmark/all-in-one/

vim config.yaml

run-arc.sh

hzjane commented Sep 14, 2024 • edited Loading

hzjane commented Sep 14, 2024 •

edited

Loading