Skip to content

[Usage]: Failure to Init Qwen2.5-VL-7B-Instruct with inflight bnb quantization #12900

Closed
@MotorBottle

Description

@MotorBottle

Your current environment

docker vllm-openai:v0.7.2 with latest transformers installed

How would you like to use vllm

Hi and I'm trying to launch qwen2.5-vl-7b-instruct in bnb inflight quanization but got error

 (AssertionError: param_data.shape == loaded_weight.shape) 

I was able to run this model at full precision with docker. Below is how I init the full precision one:

sudo docker run --runtime nvidia --gpus '"device=0,1"' --ipc=host -p 18434:8000 \
   -v hf_cache:/root/.cache/huggingface -d \
   --name qwen2.5-vl-7b \
   --entrypoint "python3" qwen-vl-fixed \ # I installed new transformer and commited into a new image. 
   -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-VL-7B-Instruct \
   --tensor-parallel-size 2 --trust-remote-code --max-model-len 18000 --dtype half

When I added --quantization bitsandbytes --load-format bitsandbytes into the docker command, the launch of the model in bnb 4bit inflight quantization failed. #12604 have said supporting this model and I wonder if the dtype cause of the error (but my 2080ti Turing GPUs only support float16 instead of bfloat16 here)

Below is the full error log:

INFO 02-07 05:08:11 __init__.py:190] Automatically detected platform cuda.

INFO 02-07 05:08:13 api_server.py:840] vLLM API server version 0.7.2

�
uINFO 02-07 05:08:13 api_server.py:841] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='Qwen/Qwen2.5-VL-7B-Instruct', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='bitsandbytes', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=12000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization='bitsandbytes', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

INFO 02-07 05:08:13 api_server.py:206] Started engine process with PID 77

WARNING 02-07 05:08:17 config.py:2386] Casting torch.bfloat16 to torch.float16.

INFO 02-07 05:08:18 __init__.py:190] Automatically detected platform cuda.

WARNING 02-07 05:08:23 config.py:2386] Casting torch.bfloat16 to torch.float16.

INFO 02-07 05:08:24 config.py:542] This model supports multiple tasks: {'generate', 'classify', 'embed', 'score', 'reward'}. Defaulting to 'generate'.

WARNING 02-07 05:08:24 config.py:621] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.

INFO 02-07 05:08:31 config.py:542] This model supports multiple tasks: {'score', 'classify', 'generate', 'reward', 'embed'}. Defaulting to 'generate'.

WARNING 02-07 05:08:31 config.py:621] bitsandbytes quantization is not fully optimized yet. The speed can be slower than non-quantized models.

INFO 02-07 05:08:33 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2) with config: model='Qwen/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=12000, download_dir=None, load_format=LoadFormat.BITSANDBYTES, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=bitsandbytes, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, 

INFO 02-07 05:08:34 cuda.py:179] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.

INFO 02-07 05:08:34 cuda.py:227] Using XFormers backend.

INFO 02-07 05:08:34 model_runner.py:1110] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct...

INFO 02-07 05:08:35 config.py:2992] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]

INFO 02-07 05:08:35 loader.py:1102] Loading weights with BitsAndBytes quantization.  May take a while ...

INFO 02-07 05:08:36 weight_utils.py:252] Using model weights format ['*.safetensors']


Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]

ERROR 02-07 05:08:37 engine.py:389] 

Traceback (most recent call last):

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine

    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args

    return cls(ipc_path=ipc_path,

           ^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 75, in __init__

    self.engine = LLMEngine(*args, **kwargs)

                  ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__

    self.model_executor = executor_class(vllm_config=vllm_config, )

                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 51, in __init__

    self._init_executor()

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 42, in _init_executor

    self.collective_rpc("load_model")

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc

    answer = run_method(self.driver_worker, method, args, kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2220, in run_method

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model

    self.model_runner.load_model()

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model

    self.model = get_model(vllm_config=self.vllm_config)

Process SpawnProcess-1:

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model

    return loader.load_model(vllm_config=vllm_config)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1225, in load_model

    self._load_weights(model_config, model)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1135, in _load_weights

    loaded_weights = model.load_weights(qweight_iterator)

                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1124, in load_weights

    return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 515, in load_weights

    return loader.load_weights(weights)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 400, in load_weights

    weight_loader(param, loaded_weight, shard_id)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 589, in weight_loader

    assert param_data.shape == loaded_weight.shape

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError

Traceback (most recent call last):

  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

    self.run()

  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

    self._target(*self._args, **self._kwargs)

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine

    raise e

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 380, in run_mp_engine

    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 123, in from_engine_args

    return cls(ipc_path=ipc_path,

           ^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 75, in __init__

    self.engine = LLMEngine(*args, **kwargs)

                  ^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__

    self.model_executor = executor_class(vllm_config=vllm_config, )

                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 51, in __init__

    self._init_executor()

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 42, in _init_executor

    self.collective_rpc("load_model")

  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc

    answer = run_method(self.driver_worker, method, args, kwargs)

             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2220, in run_method

    return func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model

    self.model_runner.load_model()

  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model

    self.model = get_model(vllm_config=self.vllm_config)

                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model

    return loader.load_model(vllm_config=vllm_config)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1225, in load_model

    self._load_weights(model_config, model)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 1135, in _load_weights

    loaded_weights = model.load_weights(qweight_iterator)

                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1124, in load_weights

    return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 515, in load_weights

    return loader.load_weights(weights)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 235, in load_weights

    autoloaded_weights = set(self._load_module("", self.module, weights))

                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 196, in _load_module

    yield from self._load_module(prefix,

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 173, in _load_module

    loaded_params = module_load_weights(weights)

                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen2.py", line 400, in load_weights

    weight_loader(param, loaded_weight, shard_id)

  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 589, in weight_loader

    assert param_data.shape == loaded_weight.shape

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError


Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]

Traceback (most recent call last):

  File "<frozen runpy>", line 198, in _run_module_as_main

  File "<frozen runpy>", line 88, in _run_code

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 911, in <module>

    uvloop.run(run_server(args))

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run

    return __asyncio.run(

           ^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run

    return runner.run(main)

           ^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run

    return self._loop.run_until_complete(task)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete

  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper

    return await main

           ^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 875, in run_server

    async with build_async_engine_client(args) as engine_client:

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client

    async with build_async_engine_client_from_engine_args(

               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__

    return await anext(self.gen)

           ^^^^^^^^^^^^^^^^^^^^^

  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 230, in build_async_engine_client_from_engine_args

    raise RuntimeError(

RuntimeError: Engine process failed to start. See stack trace for the root cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

Labels

usageHow to use vllm

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions