-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please add a version that is able to run with 2/4/8 tensor parallel 请做一个可以2卡4卡8卡张量并行的版本 #231
Comments
遇到同样问题,希望得到解决 |
把config.json中的intermediate_size改为29184,暂时不清楚有什么影响 @aabbccddwasd |
遇到同样问题,希望得到解决 |
官方建议是padding,https://qwen.readthedocs.io/en/latest/quantization/gptq.html 我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 |
padding不也是忽略了参数吗,这里可不可以选择加没有意义的空参数 |
我看了下文档,我觉得padding应该就是加参数,你是不是说错了 |
原始值是 29568吧,文档是用0 padding到 29696,你的改法是减少到 29184. |
我是V100-32G,双卡够跑,所以只减了128 = 29440,这样是230,2的倍数。 这句话是不是应该改成 |
|
补充:
总之,还是希望官方出一个 padding 到 29696 的版本,毕竟 calibration dataset 没有公开。 |
同样的问题,我这边8卡3090部署,按照这个逻辑似乎只能减小到28672 |
8卡减小到28672可以work,但是增加到29696会出现类似 exceeds dimension size错误 |
上面 QwertyJack 说得很清楚了,不padding加上重新量化,没法加,减=忽略少量参数。 |
太巧了,这个issue也是 #231 |
请问你运行的指令是什么,我修改了config.json 之后,VLLM_WORKER_MULTIPROC_METHOD=spawn cuda_visible_devices=0,2 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /data/LLM_model/Qwen2-VL-2B-Instruct/Qwen2-VL-72B-Instruct-GPTQ-Int4 --tensor-parallel-size 2 还是显存不够。一张卡是40G |
嗯, 我是4卡, --tensor-parallel-size 4 |
tp需要等官方更新权重,着急用的可以先试试pipeline parallel,今天提的pr已经合了,用vllm的main分支就可以。 |
VLLM_WORKER_MULTIPROC_METHOD=spawn CUDA_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-72B-Instruct-GPTQ-Int4 --model Qwen2-VL-72B-Instruct-GPTQ-Int4 --port 7865 --dtype half --trust-remote-code --kv-cache-dtype fp8 -q gptq --disable-log-requests --gpu-memory-utilization 0.998 --max-model-len 8192 --enforce_eager -tp 2 V100-32G * 2 |
thx so much! python3 -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 \
--pipeline-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 16 \
--max-model-len 4096 \
--tokenizer-mode auto \
--disable-log-requests Is this correct? Also could you give us the proper version of transformer? Because It seems like a bug in the latest version of transformer mentioned in vllm-project/vllm#7905 (comment) But I see you mentioned "qwen2 vl need the latest transformer library," in the vllm-project/vllm#8696. So the bug is fixed , right? |
so when will the tensor parallel version be uploaded 所以啥时候传能tp的版本 |
It's not clear in the pr, you should use the specific version not the latest. |
Based on the suggestion from @aabbccddwasd, we have adjusted the intermediate size to 29696 and re-quantized the model. The updated 72B AWQ/GPTQ-Int4/GPTQ-Int8 checkpoints have been uploaded to Hugging Face. To utilize the new checkpoints, please download them again from Hugging Face. You can use the following command to perform inference on the quantized 72B model with VLLM tensor-parallel: Server: VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
--served-model-name qwen2vl \
--model Qwen/Qwen2-VL-72B-Instruct-AWQ \
--tensor-parallel-size 4 \
--max_num_seqs 16 Client: curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2vl",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "https://modelscope.oss-cn-beijing.aliyuncs.com/resource/qwen.png"}},
{"type": "text", "text": "What is the text in the illustration?"}
]}
]
}' |
Any plan for modelscope? |
thanks |
@NaiveYan the 72b awq/gptq ckpt has been updated in modelscope. |
会不会考虑对未量化Qwen2.5-72b-instruct等版本均更新为29696,这样可以方便大家微调后自行量化加速部署 ღ( ´・ᴗ・` )比心 |
@YChengxin you can pad the checkpoint with the following code snippet: import os
import torch
from torch.nn import functional as F
from transformers import Qwen2VLForConditionalGeneration
def fix_dim(
model_path: str,
output_path: str,
src_dim: int = 29568,
tar_dim: int = 29696,
):
pad_size = tar_dim - src_dim
model = Qwen2VLForConditionalGeneration.from_pretrained(model_path, torch_dtype='auto', device_map='auto')
sd = model.state_dict()
for i, k in enumerate(sd):
v = sd[k]
if ('mlp.up_proj.weight' in k) or ('mlp.gate_proj.weight' in k):
prev_v = F.pad(v.unsqueeze(1), (0, 0, 0, 1, 0, 0)).reshape(src_dim*2, -1)[:pad_size*2]
new_v = torch.cat([prev_v, v[pad_size:]], dim=0)
sd[k] = new_v
elif 'mlp.down_proj.weight' in k:
prev_v = F.pad(v.unsqueeze(2), (0, 1)).reshape(v.shape[0], src_dim*2)[:, :pad_size*2]
new_v = torch.cat([prev_v, v[:, pad_size:]], dim=1)
sd[k] = new_v
os.makedirs(output_path, exist_ok=True)
torch.save(sd, f"{output_path}/pytorch_model.bin") |
请问我用 总会一直提示:{"object":"error","message":"[{'type': 'json_invalid', 'loc': ('body', 358), 'msg': 'JSON decode error', 'input': {}, 'ctx': {'error': 'Expecting property name enclosed in double quotes'}}]","type":"BadRequestError","param":null,"code":400} 这是服务器端的问题吗@kq-chen |
已收到您的信件!
|
在使用官方发布的量化模型中发现无法进行张量并行
原因在于intermediate_size为29568,除以groupsize(128)后剩下的231无法被2或4或8整除,这在vllm会引发错误导致无法进行张量并行
现在请求官方使用不同的groupsize进行量化以使得intermediate_size / groupsize可以被2,4,8整除,或者略微修改模型将intermediate_size变为qwen2.5的29696,这样便可以在groupsize为128的量化下正常张量并行
如果上述方法不可行,希望说明下如何使用这些量化后的模型进行张量并行
The text was updated successfully, but these errors were encountered: