Skip to content

[Bug]: Error when loading model(gemma-3-4b) merged after DeepSpeed training into vLLM #19139

@taegyunjjang

Description

@taegyunjjang

Your current environment

library versions

  • transformers==4.52.3
  • vllm==0.9.0.1

🐛 Describe the bug

I encountered an error when trying to load a model into vLLM that was merged after training with DeepSpeed.

The merging process was done using the following code, which produced pytorch_model.bin, config.json, etc. (I also saved the model using safetensors.)

state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_path)
torch.save(state_dict, os.path.join(output_dir, "pytorch_model.bin"))

For merging, I used the checkpoint with the lowest validation loss as the best model.
Image

There are no issues when using the original, untrained gemma-3-4b model:
python generate_answer_vllm.py --model_name gemma-3-4b --shot 0

However, when using the fine-tuned model:
python generate_answer_vllm.py --model_name gemma-3-4b --shot 0 --use_finetuned_model

def load_model_and_tokenizer(model_name, use_finetuned_model, fintuned_path):
    model_path = fintuned_path if use_finetuned_model else MODEL_MAPPING[model_name]
    
    # vLLM 모델 로드
    llm = LLM(
        model=model_path,
        tensor_parallel_size=2,
        trust_remote_code=True,
        dtype=torch.bfloat16,
        gpu_memory_utilization=0.9,
    )
    
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id
    return llm, tokenizer
ValueError: There is no module or parameter named 'model' in Gemma3ForConditionalGeneration

I have checked the following resources but was unable to resolve the issue.
#15031

Let me know if additional information is needed. Thank you!

full error

(vllm) (base) work@main1[w1ecL7JL-session]:~/factchecking/PetQA$ python generate_answer_vllm.py --model_name gemma-3-4b --shot 0 --use_finetuned_model
INFO 06-04 19:33:00 [__init__.py:243] Automatically detected platform cuda.
2025/06/04 19:33:02 - INFO - MODEL NAME: gemma-3-4b
2025/06/04 19:33:02 - INFO - SHOT: 0
2025/06/04 19:33:02 - INFO - USE RAW FORMAT: False
2025/06/04 19:33:02 - INFO - USE FINETUNED MODEL: True
INFO 06-04 19:33:02 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-04 19:33:02 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-04 19:33:02 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-04 19:33:11 [config.py:793] This model supports multiple tasks: {'classify', 'generate', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 06-04 19:33:11 [arg_utils.py:1431] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 06-04 19:33:11 [config.py:1875] Defaulting to use mp for distributed inference
INFO 06-04 19:33:11 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0) with config: model='data/outputs/gemma-3-4b_petqa_preprocessed/best_model', speculative_config=None, tokenizer='data/outputs/gemma-3-4b_petqa_preprocessed/best_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=data/outputs/gemma-3-4b_petqa_preprocessed/best_model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=False, 
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:12 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
INFO 06-04 19:33:14 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:14 [cuda.py:292] Using Flash Attention backend.
INFO 06-04 19:33:15 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:15 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:15 [pynccl.py:69] vLLM is using nccl==2.26.2
INFO 06-04 19:33:15 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [custom_all_reduce_utils.py:245] reading GPU P2P access cache from /home/work/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-04 19:33:16 [custom_all_reduce_utils.py:245] reading GPU P2P access cache from /home/work/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-04 19:33:16 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_cd4abdf8'), local_subscribe_addr='ipc:///tmp/eb21c03c-6a5c-479a-9948-56093dedfc3b', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [parallel_state.py:1064] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 06-04 19:33:16 [parallel_state.py:1064] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-04 19:33:16 [model_runner.py:1170] Starting to load model data/outputs/gemma-3-4b_petqa_preprocessed/best_model...
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [model_runner.py:1170] Starting to load model data/outputs/gemma-3-4b_petqa_preprocessed/best_model...
INFO 06-04 19:33:16 [cuda.py:266] Cannot use FlashAttention-2 backend for head size 72.
INFO 06-04 19:33:16 [cuda.py:289] Using XFormers backend.
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [cuda.py:266] Cannot use FlashAttention-2 backend for head size 72.
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [cuda.py:289] Using XFormers backend.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method load_model.
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     return func(*args, **kwargs)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 207, in load_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     self.model_runner.load_model()
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1173, in load_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     self.model = get_model(vllm_config=self.vllm_config)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 58, in get_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     return loader.load_model(vllm_config=vllm_config,
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     loaded_weights = model.load_weights(
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]                      ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 699, in load_weights
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     return loader.load_weights(weights)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 277, in load_weights
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 263, in _load_module
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238]     raise ValueError(msg)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ValueError: There is no module or parameter named 'model' in Gemma3ForConditionalGeneration
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/work/factchecking/PetQA/generate_answer_vllm.py", line 203, in <module>
[rank0]:     main(args.model_name, args.shot, args.use_raw_format, args.use_finetuned_model)
[rank0]:   File "/home/work/factchecking/PetQA/generate_answer_vllm.py", line 185, in main
[rank0]:     llm, tokenizer = load_model_and_tokenizer(model_name, use_finetuned_model, env["fintuned_path"])
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/PetQA/generate_answer_vllm.py", line 78, in load_model_and_tokenizer
[rank0]:     llm = LLM(
[rank0]:           ^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 1183, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 253, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 501, in from_engine_args
[rank0]:     return engine_cls.from_vllm_config(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 477, in from_vllm_config
[rank0]:     return cls(
[rank0]:            ^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]:     self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 286, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
[rank0]:     driver_worker_output = run_method(self.driver_worker, sent_method,
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 207, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1173, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 58, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_model
[rank0]:     loaded_weights = model.load_weights(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 699, in load_weights
[rank0]:     return loader.load_weights(weights)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 277, in load_weights
[rank0]:     autoloaded_weights = set(self._load_module("", self.module, weights))
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 263, in _load_module
[rank0]:     raise ValueError(msg)
[rank0]: ValueError: There is no module or parameter named 'model' in Gemma3ForConditionalGeneration
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

[rank0]:[W604 19:33:18.464078973 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[main1:2132950:0:2132950] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa0013bc9e0)
==== backtrace (tid:2132950) ====
 0  /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f9fee270614]
 1  /opt/hpcx/ucx/lib/libucs.so.0(+0x3680c) [0x7f9fee27080c]
 2  /opt/hpcx/ucx/lib/libucs.so.0(+0x36a48) [0x7f9fee270a48]
 3  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x7fa001633320]
 4  /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.86.15(+0x2d4c1) [0x7f9ffc02d4c1]
 5  /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.86.15(nvmlShutdown+0xe) [0x7f9ffc01c46e]
 6  /opt/kernel/libcudahook.ubuntu18.04.x86_64.so(+0xc3caf) [0x7fa0018c3caf]
 7  /opt/kernel/libcudahook.ubuntu18.04.x86_64.so(+0xc3d48) [0x7fa0018c3d48]
 8  /opt/kernel/libcudahook.ubuntu18.04.x86_64.so(+0xc3a2d) [0x7fa0018c3a2d]
 9  /lib64/ld-linux-x86-64.so.2(+0x10f2) [0x7fa0024280f2]
10  /lib64/ld-linux-x86-64.so.2(+0x5578) [0x7fa00242c578]
11  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x47a66) [0x7fa001635a66]
12  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x47bae) [0x7fa001635bae]
13  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1d1) [0x7fa0016181d1]
14  /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7fa00161828b]
15  python() [0x5d01f9]
=================================
/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Segmentation fault (core dumped)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions