Skip to content

[Bug]: deepseek-vl2 RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same #19219

Closed
@NickLucche

Description

@NickLucche

Your current environment

0.9.1.dev70+g8f8900cee.precompiled

🐛 Describe the bug

Just a quick guess, perhaps related to the dtype resolution change? @DarkLight1337

vllm serve deepseek-ai/deepseek-vl2-small --trust_remote_code --hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'
INFO 06-05 16:06:04 [__init__.py:244] Automatically detected platform cuda.
INFO 06-05 16:06:08 [api_server.py:1289] vLLM API server version 0.9.1.dev70+g8f8900cee
INFO 06-05 16:06:09 [cli_args.py:309] non-default args: {'model': 'deepseek-ai/deepseek-vl2-small', 'trust_remote_code': True, 'hf_overrides': {'architectures': ['DeepseekVLV2ForCausalLM']}}
INFO 06-05 16:06:09 [config.py:532] Overriding HF config with {'architectures': ['DeepseekVLV2ForCausalLM']}
INFO 06-05 16:06:15 [config.py:822] This model supports multiple tasks: {'score', 'classify', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 06-05 16:06:15 [config.py:2176] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 06-05 16:06:15 [cuda.py:154] Forcing kv cache block size to 64 for FlashMLA backend.
WARNING 06-05 16:06:17 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: https://github.com/NVIDIA/nccl/issues/1234
INFO 06-05 16:06:20 [__init__.py:244] Automatically detected platform cuda.
INFO 06-05 16:06:22 [core.py:455] Waiting for init message from front-end.
INFO 06-05 16:06:22 [core.py:70] Initializing a V1 LLM engine (v0.9.1.dev70+g8f8900cee) with config: model='deepseek-ai/deepseek-vl2-small', speculative_config=None, tokenizer='deepseek-ai/deepseek-vl2-small', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=deepseek-ai/deepseek-vl2-small, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
WARNING 06-05 16:06:22 [utils.py:2722] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x792c6c5e15e0>
INFO 06-05 16:06:23 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 06-05 16:06:25 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 06-05 16:06:25 [gpu_model_runner.py:1586] Starting to load model deepseek-ai/deepseek-vl2-small...
INFO 06-05 16:06:25 [gpu_model_runner.py:1591] Loading model from scratch...
WARNING 06-05 16:06:26 [rocm.py:28] Failed to import from amdsmi with ModuleNotFoundError("No module named 'amdsmi'")
WARNING 06-05 16:06:26 [rocm.py:39] Failed to import from vllm._rocm_C with ModuleNotFoundError("No module named 'vllm._rocm_C'")
INFO 06-05 16:06:26 [cuda.py:216] Using FlashMLA backend on V1 engine.
INFO 06-05 16:06:26 [weight_utils.py:292] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:05,  1.90s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.24s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:06<00:02,  2.39s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.45s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:09<00:00,  2.37s/it]

INFO 06-05 16:06:36 [default_loader.py:272] Loading weights took 9.89 seconds
INFO 06-05 16:06:37 [gpu_model_runner.py:1615] Model loading took 30.1190 GiB and 10.867510 seconds
INFO 06-05 16:06:37 [gpu_model_runner.py:1940] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 4 image items of the maximum feature size.
ERROR 06-05 16:06:37 [core.py:515] EngineCore failed to start.
ERROR 06-05 16:06:37 [core.py:515] Traceback (most recent call last):
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/v1/engine/core.py", line 506, in run_engine_core
ERROR 06-05 16:06:37 [core.py:515]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-05 16:06:37 [core.py:515]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-05 16:06:37 [core.py:515]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/v1/engine/core.py", line 83, in __init__
ERROR 06-05 16:06:37 [core.py:515]     self._initialize_kv_caches(vllm_config)
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/v1/engine/core.py", line 141, in _initialize_kv_caches
ERROR 06-05 16:06:37 [core.py:515]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 06-05 16:06:37 [core.py:515]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
ERROR 06-05 16:06:37 [core.py:515]     output = self.collective_rpc("determine_available_memory")
ERROR 06-05 16:06:37 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/executor/uniproc_executor.py", line 57, in collective_rpc
ERROR 06-05 16:06:37 [core.py:515]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-05 16:06:37 [core.py:515]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/utils.py", line 2656, in run_method
ERROR 06-05 16:06:37 [core.py:515]     return func(*args, **kwargs)
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 06-05 16:06:37 [core.py:515]     return func(*args, **kwargs)
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/v1/worker/gpu_worker.py", line 186, in determine_available_memory
ERROR 06-05 16:06:37 [core.py:515]     self.model_runner.profile_run()
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/v1/worker/gpu_model_runner.py", line 1962, in profile_run
ERROR 06-05 16:06:37 [core.py:515]     dummy_encoder_outputs = self.model.get_multimodal_embeddings(
ERROR 06-05 16:06:37 [core.py:515]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/model_executor/models/deepseek_vl2.py", line 594, in get_multimodal_embeddings
ERROR 06-05 16:06:37 [core.py:515]     vision_embeddings = self._process_image_input(image_input)
ERROR 06-05 16:06:37 [core.py:515]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/model_executor/models/deepseek_vl2.py", line 583, in _process_image_input
ERROR 06-05 16:06:37 [core.py:515]     return self._pixel_values_to_embedding(
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/vllm/vllm/model_executor/models/deepseek_vl2.py", line 483, in _pixel_values_to_embedding
ERROR 06-05 16:06:37 [core.py:515]     images_feature = self.vision.forward_features(total_tiles)
ERROR 06-05 16:06:37 [core.py:515]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/timm/models/vision_transformer.py", line 827, in forward_features
ERROR 06-05 16:06:37 [core.py:515]     x = self.patch_embed(x)
ERROR 06-05 16:06:37 [core.py:515]         ^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-05 16:06:37 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-05 16:06:37 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/timm/layers/patch_embed.py", line 131, in forward
ERROR 06-05 16:06:37 [core.py:515]     x = self.proj(x)
ERROR 06-05 16:06:37 [core.py:515]         ^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
ERROR 06-05 16:06:37 [core.py:515]     return self._call_impl(*args, **kwargs)
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
ERROR 06-05 16:06:37 [core.py:515]     return forward_call(*args, **kwargs)
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 554, in forward
ERROR 06-05 16:06:37 [core.py:515]     return self._conv_forward(input, self.weight, self.bias)
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515]   File "/home/nicolo/vllmd/.venv/lib/python3.12/site-packages/torch/nn/modules/conv.py", line 549, in _conv_forward
ERROR 06-05 16:06:37 [core.py:515]     return F.conv2d(
ERROR 06-05 16:06:37 [core.py:515]            ^^^^^^^^^
ERROR 06-05 16:06:37 [core.py:515] RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions