-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Description
Your current environment
library versions
- transformers==4.52.3
- vllm==0.9.0.1
🐛 Describe the bug
I encountered an error when trying to load a model into vLLM that was merged after training with DeepSpeed.
The merging process was done using the following code, which produced pytorch_model.bin, config.json, etc. (I also saved the model using safetensors.)
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_path)
torch.save(state_dict, os.path.join(output_dir, "pytorch_model.bin"))
For merging, I used the checkpoint with the lowest validation loss as the best model.
There are no issues when using the original, untrained gemma-3-4b model:
python generate_answer_vllm.py --model_name gemma-3-4b --shot 0
However, when using the fine-tuned model:
python generate_answer_vllm.py --model_name gemma-3-4b --shot 0 --use_finetuned_model
def load_model_and_tokenizer(model_name, use_finetuned_model, fintuned_path):
model_path = fintuned_path if use_finetuned_model else MODEL_MAPPING[model_name]
# vLLM 모델 로드
llm = LLM(
model=model_path,
tensor_parallel_size=2,
trust_remote_code=True,
dtype=torch.bfloat16,
gpu_memory_utilization=0.9,
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
return llm, tokenizer
ValueError: There is no module or parameter named 'model' in Gemma3ForConditionalGeneration
I have checked the following resources but was unable to resolve the issue.
#15031
Let me know if additional information is needed. Thank you!
full error
(vllm) (base) work@main1[w1ecL7JL-session]:~/factchecking/PetQA$ python generate_answer_vllm.py --model_name gemma-3-4b --shot 0 --use_finetuned_model
INFO 06-04 19:33:00 [__init__.py:243] Automatically detected platform cuda.
2025/06/04 19:33:02 - INFO - MODEL NAME: gemma-3-4b
2025/06/04 19:33:02 - INFO - SHOT: 0
2025/06/04 19:33:02 - INFO - USE RAW FORMAT: False
2025/06/04 19:33:02 - INFO - USE FINETUNED MODEL: True
INFO 06-04 19:33:02 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-04 19:33:02 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-04 19:33:02 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-04 19:33:11 [config.py:793] This model supports multiple tasks: {'classify', 'generate', 'score', 'embed', 'reward'}. Defaulting to 'generate'.
WARNING 06-04 19:33:11 [arg_utils.py:1431] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value.
INFO 06-04 19:33:11 [config.py:1875] Defaulting to use mp for distributed inference
INFO 06-04 19:33:11 [llm_engine.py:230] Initializing a V0 LLM engine (v0.9.0) with config: model='data/outputs/gemma-3-4b_petqa_preprocessed/best_model', speculative_config=None, tokenizer='data/outputs/gemma-3-4b_petqa_preprocessed/best_model', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='xgrammar', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=None, served_model_name=data/outputs/gemma-3-4b_petqa_preprocessed/best_model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=None, chunked_prefill_enabled=False, use_async_output_proc=True, pooler_config=None, compilation_config={"compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "cudagraph_capture_sizes": [256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 256}, use_cached_outputs=False,
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:12 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
INFO 06-04 19:33:14 [cuda.py:292] Using Flash Attention backend.
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:14 [cuda.py:292] Using Flash Attention backend.
INFO 06-04 19:33:15 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:15 [utils.py:1077] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:15 [pynccl.py:69] vLLM is using nccl==2.26.2
INFO 06-04 19:33:15 [pynccl.py:69] vLLM is using nccl==2.26.2
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [custom_all_reduce_utils.py:245] reading GPU P2P access cache from /home/work/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-04 19:33:16 [custom_all_reduce_utils.py:245] reading GPU P2P access cache from /home/work/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 06-04 19:33:16 [shm_broadcast.py:250] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_cd4abdf8'), local_subscribe_addr='ipc:///tmp/eb21c03c-6a5c-479a-9948-56093dedfc3b', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [parallel_state.py:1064] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 06-04 19:33:16 [parallel_state.py:1064] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-04 19:33:16 [model_runner.py:1170] Starting to load model data/outputs/gemma-3-4b_petqa_preprocessed/best_model...
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [model_runner.py:1170] Starting to load model data/outputs/gemma-3-4b_petqa_preprocessed/best_model...
INFO 06-04 19:33:16 [cuda.py:266] Cannot use FlashAttention-2 backend for head size 72.
INFO 06-04 19:33:16 [cuda.py:289] Using XFormers backend.
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [cuda.py:266] Cannot use FlashAttention-2 backend for head size 72.
(VllmWorkerProcess pid=2133317) INFO 06-04 19:33:16 [cuda.py:289] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] Exception in worker VllmWorkerProcess while processing method load_model.
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] Traceback (most recent call last):
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/multiproc_worker_utils.py", line 232, in _run_worker_process
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] return func(*args, **kwargs)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 207, in load_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] self.model_runner.load_model()
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1173, in load_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] self.model = get_model(vllm_config=self.vllm_config)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 58, in get_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] return loader.load_model(vllm_config=vllm_config,
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_model
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] loaded_weights = model.load_weights(
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 699, in load_weights
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] return loader.load_weights(weights)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 277, in load_weights
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] autoloaded_weights = set(self._load_module("", self.module, weights))
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 263, in _load_module
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] raise ValueError(msg)
(VllmWorkerProcess pid=2133317) ERROR 06-04 19:33:18 [multiproc_worker_utils.py:238] ValueError: There is no module or parameter named 'model' in Gemma3ForConditionalGeneration
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/work/factchecking/PetQA/generate_answer_vllm.py", line 203, in <module>
[rank0]: main(args.model_name, args.shot, args.use_raw_format, args.use_finetuned_model)
[rank0]: File "/home/work/factchecking/PetQA/generate_answer_vllm.py", line 185, in main
[rank0]: llm, tokenizer = load_model_and_tokenizer(model_name, use_finetuned_model, env["fintuned_path"])
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/PetQA/generate_answer_vllm.py", line 78, in load_model_and_tokenizer
[rank0]: llm = LLM(
[rank0]: ^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 1183, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 253, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 501, in from_engine_args
[rank0]: return engine_cls.from_vllm_config(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 477, in from_vllm_config
[rank0]: return cls(
[rank0]: ^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 265, in __init__
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 286, in __init__
[rank0]: super().__init__(*args, **kwargs)
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
[rank0]: self._init_executor()
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 125, in _init_executor
[rank0]: self._run_workers("load_model",
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
[rank0]: driver_worker_output = run_method(self.driver_worker, sent_method,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 207, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1173, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 58, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 277, in load_model
[rank0]: loaded_weights = model.load_weights(
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 699, in load_weights
[rank0]: return loader.load_weights(weights)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 277, in load_weights
[rank0]: autoloaded_weights = set(self._load_module("", self.module, weights))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 263, in _load_module
[rank0]: raise ValueError(msg)
[rank0]: ValueError: There is no module or parameter named 'model' in Gemma3ForConditionalGeneration
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
[rank0]:[W604 19:33:18.464078973 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[main1:2132950:0:2132950] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x7fa0013bc9e0)
==== backtrace (tid:2132950) ====
0 /opt/hpcx/ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7f9fee270614]
1 /opt/hpcx/ucx/lib/libucs.so.0(+0x3680c) [0x7f9fee27080c]
2 /opt/hpcx/ucx/lib/libucs.so.0(+0x36a48) [0x7f9fee270a48]
3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45320) [0x7fa001633320]
4 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.86.15(+0x2d4c1) [0x7f9ffc02d4c1]
5 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.570.86.15(nvmlShutdown+0xe) [0x7f9ffc01c46e]
6 /opt/kernel/libcudahook.ubuntu18.04.x86_64.so(+0xc3caf) [0x7fa0018c3caf]
7 /opt/kernel/libcudahook.ubuntu18.04.x86_64.so(+0xc3d48) [0x7fa0018c3d48]
8 /opt/kernel/libcudahook.ubuntu18.04.x86_64.so(+0xc3a2d) [0x7fa0018c3a2d]
9 /lib64/ld-linux-x86-64.so.2(+0x10f2) [0x7fa0024280f2]
10 /lib64/ld-linux-x86-64.so.2(+0x5578) [0x7fa00242c578]
11 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x47a66) [0x7fa001635a66]
12 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x47bae) [0x7fa001635bae]
13 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x2a1d1) [0x7fa0016181d1]
14 /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b) [0x7fa00161828b]
15 python() [0x5d01f9]
=================================
/home/work/factchecking/miniconda3/envs/vllm/lib/python3.12/multiprocessing/resource_tracker.py:255: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Segmentation fault (core dumped)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.