Qwen2.5-VL-7B-Instruct model keys are different between a saved model and downloaded one

### System Info

- `transformers` version: 4.52.3
- `bitsandbytes` version: 0.45.5
- `vLLM` version: 0.8.5.post1
- Platform: Linux-5.15.0-1045-azure-x86_64-with-glibc2.35
- Python version: 3.10.16
- Huggingface_hub version: 0.32.1
- Safetensors version: 0.5.3
- Accelerate version: 1.7.0
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu124 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: no
- Using GPU in script?: yes
- GPU type: NVIDIA A100 80GB PCIe

### Who can help?

I have been trying to get fast inference on Qwen-2.5-VL-7B-Instruct through vLLM and have been successful. But to get more throughput, I tried saving a bitsandbytes 4 bit quantized model locally and loading it into vLLM. It fails with key mismatch. To root cause, I tried doing the same thing with just loading a transformers model, saving it locally, and then loading it into vLLM. It failed again. Both times the error message was:

```
ERROR 05-27 12:50:03 [core.py:396]   File "/home/aiscuser/.conda/envs/slm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 405, in load_weights
ERROR 05-27 12:50:03 [core.py:396]     param = params_dict[name]
ERROR 05-27 12:50:03 [core.py:396] KeyError: 'language_model.embed_tokens.weight'
```

Investigating more, when I download the model using `huggingface-cli` and check in the cache, the `model.safetensors.index.json` file has different keys as compared to the `model.safetensors.index.json` of the locally saved model (even unquantized).

Some lines from `model.safetensors.index.json` of the model from cache:
```
    "lm_head.weight": "model-00005-of-00005.safetensors",
    "model.embed_tokens.weight": "model-00001-of-00005.safetensors",
    "model.layers.0.input_layernorm.weight": "model-00001-of-00005.safetensors",
    "model.layers.0.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
    "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00005.safetensors",
    "model.layers.0.mlp.up_proj.weight": "model-00001-of-00005.safetensors",
    "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00005.safetensors",
    "visual.blocks.0.attn.proj.bias": "model-00001-of-00005.safetensors",
    "visual.blocks.0.attn.proj.weight": "model-00001-of-00005.safetensors",
    "visual.blocks.0.attn.qkv.bias": "model-00001-of-00005.safetensors",
    "visual.blocks.0.attn.qkv.weight": "model-00001-of-00005.safetensors",
    "visual.blocks.0.mlp.down_proj.bias": "model-00001-of-00005.safetensors",
    "visual.blocks.0.mlp.down_proj.weight": "model-00001-of-00005.safetensors",
```

Corresponding lines from `model.safetensors.index.json` of the locally saved model:
```
    "lm_head.weight": "model-00004-of-00004.safetensors",
    "model.language_model.embed_tokens.weight": "model-00001-of-00004.safetensors",
    "model.language_model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
    "model.language_model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
    "model.language_model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
    "model.language_model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
    "model.language_model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
    "model.visual.blocks.0.attn.proj.bias": "model-00001-of-00004.safetensors",
    "model.visual.blocks.0.attn.proj.weight": "model-00001-of-00004.safetensors",
    "model.visual.blocks.0.attn.qkv.bias": "model-00001-of-00004.safetensors",
    "model.visual.blocks.0.attn.qkv.weight": "model-00001-of-00004.safetensors",
    "model.visual.blocks.0.mlp.down_proj.bias": "model-00001-of-00004.safetensors",
    "model.visual.blocks.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
```

This is also probably why vLLM can load the model when I just give it the HF model name but cannot load it from the locally saved model. But my question is why is this happening? Why is the locally saved model different?

Code used to locally save the unquantized model:

```
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor, BitsAndBytesConfig

model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
DEVICE = "cuda:0"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, min_pixels=1024*28*28, use_Fast=True)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, 
    device_map=DEVICE, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to(DEVICE)

processor.tokenizer.padding_side = 'left'

model.save_pretrained("./qwen7b")
processor.save_pretrained("./qwen7b")
```

Code used to save the quantized model:
```
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor, BitsAndBytesConfig

model_path = "Qwen/Qwen2.5-VL-7B-Instruct"
DEVICE = "cuda:0"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, min_pixels=1024*28*28, use_Fast=True)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, 
    device_map=DEVICE, 
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    quantization_config=quantization_config
).to(DEVICE)

processor.tokenizer.padding_side = 'left'

model.save_pretrained("./qwen7b_bnb4")
processor.save_pretrained("./qwen7b_bnb4")
```

Code used to load the model into vLLM
```
import torch
from transformers import AutoProcessor
from vllm import LLM, SamplingParams

model_path = "/home/aiscuser/qwen7b"
# model_path = "/home/aiscuser/qwen7b_bnb4"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True, min_pixels=1024*28*28, use_fast=True)
model = LLM(
    model=model_path,
    tokenizer_mode="auto",
    limit_mm_per_prompt={"image": 1, "video": 0},
    trust_remote_code=True,
    tensor_parallel_size=1,
    # quantization="bitsandbytes"  # tried with and without this. Error is on non-quantization weights
)
```

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Steps to repro:

1. Run the script to save the model locally. Just running the one for unquantized should be fine.
2. Now run the script to try loading the model into vLLM.

### Expected behavior

Model keys should be same between downloaded and locally saved model. Subsequently, vLLM should load the model correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen2.5-VL-7B-Instruct model keys are different between a saved model and downloaded one #38403

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen2.5-VL-7B-Instruct model keys are different between a saved model and downloaded one #38403

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions