Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 #231

Closed
aabbccddwasd opened this issue Sep 19, 2024 · 29 comments
Assignees

Comments

@aabbccddwasd
Copy link

在使用官方发布的量化模型中发现无法进行张量并行
原因在于intermediate_size为29568,除以groupsize(128)后剩下的231无法被2或4或8整除,这在vllm会引发错误导致无法进行张量并行

现在请求官方使用不同的groupsize进行量化以使得intermediate_size / groupsize可以被2,4,8整除,或者略微修改模型将intermediate_size变为qwen2.5的29696,这样便可以在groupsize为128的量化下正常张量并行

如果上述方法不可行,希望说明下如何使用这些量化后的模型进行张量并行

@qingwu11
Copy link

遇到同样问题,希望得到解决

@osoctz
Copy link

osoctz commented Sep 21, 2024

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

@bash99
Copy link

bash99 commented Sep 21, 2024

遇到同样问题,希望得到解决

@bash99
Copy link

bash99 commented Sep 21, 2024

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html
你这个相当于直接忽略掉了少量参数?

我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。
这个方式倒是能跑结果也正常

@aabbccddwasd
Copy link
Author

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数?

我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这个方式倒是能跑结果也正常

padding不也是忽略了参数吗,这里可不可以选择加没有意义的空参数

@aabbccddwasd
Copy link
Author

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数?

我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这个方式倒是能跑结果也正常

我看了下文档,我觉得padding应该就是加参数,你是不是说错了

@bash99
Copy link

bash99 commented Sep 22, 2024

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数?
我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这个方式倒是能跑结果也正常

我看了下文档,我觉得padding应该就是加参数,你是不是说错了

原始值是 29568吧,文档是用0 padding到 29696,你的改法是减少到 29184.
这样确实参数就少了啊?等于忽略了极少量参数?

@aabbccddwasd
Copy link
Author

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数?
我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这个方式倒是能跑结果也正常

我看了下文档,我觉得padding应该就是加参数,你是不是说错了

原始值是 29568吧,文档是用0 padding到 29696,你的改法是减少到 29184. 这样确实参数就少了啊?等于忽略了极少量参数?

我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。

这句话是不是应该改成
"我是V100-32G,双卡够跑,所以只加了128 = 29568,这样是232,2的倍数。"

@QwertyJack
Copy link

intermediate_size=29440 可以运行,intermediate_size=29568 无法启动:

RuntimeError: start (14848) + length (14848) exceeds dimension size (29568).

@QwertyJack
Copy link

intermediate_size=29440 可以运行,intermediate_size=29568 无法启动:

RuntimeError: start (14848) + length (14848) exceeds dimension size (29568).

补充:

  1. 减还是加? 可以直接运行,需要重新量化;实测减好像看不出来表现明显变差;
  2. 减/加多少? 取决于tp,要求 29568 / 128tp 的整数倍,所以如果两卡 tp=2 那么减/加 128 即可,如果是8卡 tp=8 那么减 896 或者加 128。

总之,还是希望官方出一个 padding 到 29696 的版本,毕竟 calibration dataset 没有公开。

@whitesay
Copy link

同样的问题,我这边8卡3090部署,按照这个逻辑似乎只能减小到28672

@whitesay
Copy link

8卡减小到28672可以work,但是增加到29696会出现类似 exceeds dimension size错误

@bash99
Copy link

bash99 commented Sep 23, 2024

这句话是不是应该改成 "我是V100-32G,双卡够跑,所以只加了128 = 29568,这样是232,2的倍数。"

上面 QwertyJack 说得很清楚了,不padding加上重新量化,没法加,减=忽略少量参数。

@QwertyJack
Copy link

太巧了,这个issue也是 #231

@Cherryjingyao
Copy link

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数?

我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这个方式倒是能跑结果也正常

请问你运行的指令是什么,我修改了config.json 之后,VLLM_WORKER_MULTIPROC_METHOD=spawn cuda_visible_devices=0,2 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /data/LLM_model/Qwen2-VL-2B-Instruct/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2 还是显存不够。一张卡是40G

@osoctz
Copy link

osoctz commented Sep 23, 2024

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数?

我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这个方式倒是能跑结果也正常

嗯, 我是4卡, --tensor-parallel-size 4

@liuyanyi
Copy link

tp需要等官方更新权重,着急用的可以先试试pipeline parallel,今天提的pr已经合了,用vllm的main分支就可以。

@bash99
Copy link

bash99 commented Sep 23, 2024

把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd

官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 你这个相当于直接忽略掉了少量参数?
我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这个方式倒是能跑结果也正常

请问你运行的指令是什么,我修改了config.json 之后,VLLM_WORKER_MULTIPROC_METHOD=spawn cuda_visible_devices=0,2 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /data/LLM_model/Qwen2-VL-2B-Instruct/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2 还是显存不够。一张卡是40G

VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-72B-Instruct-GPTQ-Int4 --model Qwen2-VL-72B-Instruct-GPTQ-Int4 --port 7865 --dtype half --trust-remote-code --kv-cache-dtype fp8 -q gptq --disable-log-requests --gpu-memory-utilization 0.998 --max-model-len 8192 --enforce_eager -tp 2

V100-32G * 2

@niaoyu
Copy link

niaoyu commented Sep 23, 2024

tp需要等官方更新权重,着急用的可以先试试pipeline parallel,今天提的pr已经合了,用vllm的main分支就可以。

thx so much!
By the way, if i want to use 4*A10 to run Qwen2-VL-72B-Instruct-GPTQ-Int4, the command is show as:

 python3 -m vllm.entrypoints.openai.api_server \
      --model Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 \
      --pipeline-parallel-size 4 \
      --gpu-memory-utilization 0.95 \
      --max-num-seqs 16 \
      --max-model-len 4096 \
      --tokenizer-mode auto \
      --disable-log-requests

Is this correct?

Also could you give us the proper version of transformer? Because It seems like a bug in the latest version of transformer mentioned in vllm-project/vllm#7905 (comment)
We should not use the latest transformer library
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

But I see you mentioned "qwen2 vl need the latest transformer library," in the vllm-project/vllm#8696. So the bug is fixed , right?

@aabbccddwasd
Copy link
Author

so when will the tensor parallel version be uploaded

所以啥时候传能tp的版本

@liuyanyi
Copy link

tp需要等官方更新权重,着急用的可以先试试pipeline parallel,今天提的pr已经合了,用vllm的main分支就可以。

thx so much!
By the way, if i want to use 4*A10 to run Qwen2-VL-72B-Instruct-GPTQ-Int4, the command is show as:

 python3 -m vllm.entrypoints.openai.api_server \
      --model Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 \
      --pipeline-parallel-size 4 \
      --gpu-memory-utilization 0.95 \
      --max-num-seqs 16 \
      --max-model-len 4096 \
      --tokenizer-mode auto \
      --disable-log-requests

Is this correct?

Also could you give us the proper version of transformer? Because It seems like a bug in the latest version of transformer mentioned in vllm-project/vllm#7905 (comment)
We should not use the latest transformer library
pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

But I see you mentioned "qwen2 vl need the latest transformer library," in the vllm-project/vllm#8696. So the bug is fixed , right?

It's not clear in the pr, you should use the specific version not the latest.

@kq-chen
Copy link
Collaborator

kq-chen commented Sep 24, 2024

Based on the suggestion from @aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face.

You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel:

Server:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen2vl \
  --model Qwen/Qwen2-VL-72B-Instruct-AWQ \
  --tensor-parallel-size 4 \
  --max_num_seqs 16

Client:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen2vl",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

@kq-chen kq-chen closed this as completed Sep 24, 2024
@NaiveYan
Copy link

Based on the suggestion from @aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face.

You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel:

Server:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen2vl \
  --model Qwen/Qwen2-VL-72B-Instruct-AWQ \
  --tensor-parallel-size 4 \
  --max_num_seqs 16

Client:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen2vl",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

Any plan for modelscope?

@aabbccddwasd
Copy link
Author

thanks
qwen best FOREVER!

@kq-chen
Copy link
Collaborator

kq-chen commented Sep 25, 2024

@NaiveYan the 72b awq/gptq ckpt has been updated in modelscope.

@YChengxin
Copy link

根据@aabbccddwasd,我们将中间尺寸调整为 29696,并重新量化了模型。更新后的 72B AWQ/GPTQ-Int4/GPTQ-Int8 检查点已上传至 Hugging Face。要使用新的检查点,请从 Hugging Face 再次下载。
您可以使用以下命令通过 VLLM tensor-parallel 对量化的 72B 模型执行推理:
服务器:

VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --served-model-name qwen2vl \
  --model Qwen/Qwen2-VL-72B-Instruct-AWQ \
  --tensor-parallel-size 4 \
  --max_num_seqs 16

客户:

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "qwen2vl",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": [
        {"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
        {"type": "text", "text": "What is the text in the illustration?"}
    ]}
    ]
    }'

对 modelscope 有什么计划吗?

会不会考虑对未量化Qwen2.5-72b-instruct等版本均更新为29696,这样可以方便大家微调后自行量化加速部署 ღ( ´・ᴗ・` )比心

@kq-chen
Copy link
Collaborator

kq-chen commented Oct 1, 2024

@YChengxin you can pad the checkpoint with the following code snippet:

import os

import torch
from torch.nn import functional as F
from transformers import Qwen2VLForConditionalGeneration

def fix_dim(
    model_path: str,
    output_path: str,
    src_dim: int = 29568,
    tar_dim: int = 29696,
):
    pad_size = tar_dim - src_dim
    model = Qwen2VLForConditionalGeneration.from_pretrained(model_path, torch_dtype='auto', device_map='auto')
    sd = model.state_dict()
    for i, k in enumerate(sd):
        v = sd[k]
        if ('mlp.up_proj.weight' in k) or ('mlp.gate_proj.weight' in k):
            prev_v = F.pad(v.unsqueeze(1), (0, 0, 0, 1, 0, 0)).reshape(src_dim*2, -1)[:pad_size*2]
            new_v = torch.cat([prev_v, v[pad_size:]], dim=0)
            sd[k] = new_v
        elif 'mlp.down_proj.weight' in k:
            prev_v = F.pad(v.unsqueeze(2), (0, 1)).reshape(v.shape[0], src_dim*2)[:, :pad_size*2]
            new_v = torch.cat([prev_v, v[:, pad_size:]], dim=1)
            sd[k] = new_v
    os.makedirs(output_path, exist_ok=True)
    torch.save(sd, f"{output_path}/pytorch_model.bin")

@luosting
Copy link

luosting commented Nov 29, 2024

请问我用
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "qwen2vl",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
]
}'

总会一直提示:{"object":"error","message":"[{'type': 'json_invalid', 'loc': ('body', 358), 'msg': 'JSON decode error', 'input': {}, 'ctx': {'error': 'Expecting property name enclosed in double quotes'}}]","type":"BadRequestError","param":null,"code":400}

这是服务器端的问题吗@kq-chen

@whitesay
Copy link

whitesay commented Nov 29, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests