Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

Open
jianweimama opened this issue Sep 14, 2024 · 1 comment
Open

IPEX-LLM 运行源2.0 M32量化版失败 on Intel ARC #12082

jianweimama opened this issue Sep 14, 2024 · 1 comment

Comments

@jianweimama
Copy link

源2.0-M32大模型研发团队深入分析当前主流的量化方案,综合评估模型压缩效果和精度损失表现,最终采用了GPTQ量化方法,并采用AutoGPTQ作为量化框架。


Model: Yuan2-M32-HF-INT4 https://blog.csdn.net/2401_82700030/article/details/141469514
容器: intelanalytics/ipex-llm-serving-xpu-vllm-0.5.4-experimental:2.2.0b1

Test Step:
Log into container:

docker exec -ti arc_vllm-new-2 bash

cd /benchmark/all-in-one/

vim config.yaml

Config.yaml 配置:
image

run-arc.sh

运行报错 , 结果如下log.
Results Log:
image

image

@hzjane
Copy link
Contributor

hzjane commented Sep 14, 2024

I try to reproduce it and meet the same issue again. And as I found that.

  1. The official vllm does not support the yuan model yet.
  2. Maybe this model's quantized method is not supportted to be load by ipex-llm yet.
# https://github.com/IEIT-Yuan/Yuan2.0-M32/blob/b403a2beb2746c0c923b4eb936fe1e2560c83b19/docs/README_GPTQ_CN.md#3-gptq%E9%87%8F%E5%8C%96%E6%A8%A1%E5%9E%8B%E7%9A%84%E6%8E%A8%E7%90%86
quantized_model_dir = "/mnt/beegfs2/Yuan2-M32-GPTQ-int4"
# `gptq_model-4bit-128g.safetensors 0-2`
tokenizer = LlamaTokenizer.from_pretrained('/mnt/beegfs2/Yuan2-M32-GPTQ-int4', add_eos_token=False, add_bos_token=False, eos_token='<eod>')
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0", trust_remote_code=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants