-
Notifications
You must be signed in to change notification settings - Fork 29.9k
Description
System Info
transformers
version: 4.52.3bitsandbytes
version: 0.45.5vLLM
version: 0.8.5.post1- Platform: Linux-5.15.0-1045-azure-x86_64-with-glibc2.35
- Python version: 3.10.16
- Huggingface_hub version: 0.32.1
- Safetensors version: 0.5.3
- Accelerate version: 1.7.0
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA A100 80GB PCIe
Who can help?
I have been trying to get fast inference on Qwen-2.5-VL-7B-Instruct through vLLM and have been successful. But to get more throughput, I tried saving a bitsandbytes 4 bit quantized model locally and loading it into vLLM. It fails with key mismatch. To root cause, I tried doing the same thing with just loading a transformers model, saving it locally, and then loading it into vLLM. It failed again. Both times the error message was:
ERROR 05-27 12:50:03 [core.py:396] File "/home/aiscuser/.conda/envs/slm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 405, in load_weights
ERROR 05-27 12:50:03 [core.py:396] param = params_dict[name]
ERROR 05-27 12:50:03 [core.py:396] KeyError: 'language_model.embed_tokens.weight'
Investigating more, when I download the model using huggingface-cli
and check in the cache, the model.safetensors.index.json
file has different keys as compared to the model.safetensors.index.json
of the locally saved model (even unquantized).
Some lines from model.safetensors.index.json
of the model from cache:
"lm_head.weight": "model-00005-of-00005.safetensors",
"model.embed_tokens.weight": "model-00001-of-00005.safetensors",
"model.layers.0.input_layernorm.weight": "model-00001-of-00005.safetensors",
"model.layers.0.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
"model.layers.0.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
"model.layers.0.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
"model.layers.0.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
"visual.blocks.0.attn.proj.bias": "model-00001-of-00005.safetensors",
"visual.blocks.0.attn.proj.weight": "model-00001-of-00005.safetensors",
"visual.blocks.0.attn.qkv.bias": "model-00001-of-00005.safetensors",
"visual.blocks.0.attn.qkv.weight": "model-00001-of-00005.safetensors",
"visual.blocks.0.mlp.down_proj.bias": "model-00001-of-00005.safetensors",
"visual.blocks.0.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
Corresponding lines from model.safetensors.index.json
of the locally saved model:
"lm_head.weight": "model-00004-of-00004.safetensors",
"model.language_model.embed_tokens.weight": "model-00001-of-00004.safetensors",
"model.language_model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
"model.language_model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
"model.language_model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
"model.language_model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
"model.language_model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
"model.visual.blocks.0.attn.proj.bias": "model-00001-of-00004.safetensors",
"model.visual.blocks.0.attn.proj.weight": "model-00001-of-00004.safetensors",
"model.visual.blocks.0.attn.qkv.bias": "model-00001-of-00004.safetensors",
"model.visual.blocks.0.attn.qkv.weight": "model-00001-of-00004.safetensors",
"model.visual.blocks.0.mlp.down_proj.bias": "model-00001-of-00004.safetensors",
"model.visual.blocks.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
This is also probably why vLLM can load the model when I just give it the HF model name but cannot load it from the locally saved model. But my question is why is this happening? Why is the locally saved model different?
Code used to locally save the unquantized model:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor, BitsAndBytesConfig
model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
DEVICE = "cuda:0"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, min_pixels=1024*28*28, use_Fast=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
device_map=DEVICE,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to(DEVICE)
processor.tokenizer.padding_side = 'left'
model.save_pretrained("./qwen7b")
processor.save_pretrained("./qwen7b")
Code used to save the quantized model:
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor, BitsAndBytesConfig
model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
DEVICE = "cuda:0"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, min_pixels=1024*28*28, use_Fast=True)
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
device_map=DEVICE,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
quantization_config=quantization_config
).to(DEVICE)
processor.tokenizer.padding_side = 'left'
model.save_pretrained("./qwen7b_bnb4")
processor.save_pretrained("./qwen7b_bnb4")
Code used to load the model into vLLM
import torch
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
model_path = "/home/aiscuser/qwen7b"
# model_path = "/home/aiscuser/qwen7b_bnb4"
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, min_pixels=1024*28*28, use_fast=True)
model = LLM(
model=model_path,
tokenizer_mode="auto",
limit_mm_per_prompt={"image": 1, "video": 0},
trust_remote_code=True,
tensor_parallel_size=1,
# quantization="bitsandbytes" # tried with and without this. Error is on non-quantization weights
)
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to repro:
- Run the script to save the model locally. Just running the one for unquantized should be fine.
- Now run the script to try loading the model into vLLM.
Expected behavior
Model keys should be same between downloaded and locally saved model. Subsequently, vLLM should load the model correctly.