Skip to content

[Bug]: New bug in last few days for phi-3-vision. The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944) #5976

Closed
@pseudotensor

Description

@pseudotensor

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Same launching as #5969

Only difference is hash 2cd402e (latest main as of earlier today).

GPU is totally free, so just new bug in vLLM between the e9de9dd and 2cd402e hashes

INFO 06-28 23:40:03 api_server.py:206] vLLM API server version 0.5.0.post1
INFO 06-28 23:40:03 api_server.py:207] args: Namespace(host='0.0.0.0', port=5063, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, respon>
INFO 06-28 23:40:03 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto>
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:05 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-28 23:40:06 model_runner.py:220] Loading model weights took 7.7732 GB
/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast>
  warnings.warn(
INFO 06-28 23:40:14 gpu_executor.py:83] # GPU blocks: 3184, # CPU blocks: 682
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/ubuntu/vllm/vllm/entrypoints/openai/api_server.py", line 225, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 425, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 359, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 500, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 246, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 342, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/executor/gpu_executor.py", line 86, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 207, in initialize_cache
[rank0]:     raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]:   File "/home/ubuntu/vllm/vllm/worker/worker.py", line 344, in raise_if_cache_size_invalid
[rank0]:     raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions