Closed
Description
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
Same launching as #5969
Only difference is hash 2cd402e (latest main as of earlier today).
GPU is totally free, so just new bug in vLLM between the e9de9dd and 2cd402e hashes
INFO 06-28 23:40:03 api_server.py:206] vLLM API server version 0.5.0.post1
INFO 06-28 23:40:03 api_server.py:207] args: Namespace(host='0.0.0.0', port=5063, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, chat_template=None, respon>
INFO 06-28 23:40:03 llm_engine.py:164] Initializing an LLM engine (v0.5.0.post1) with config: model='microsoft/Phi-3-vision-128k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-vision-128k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto>
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:04 selector.py:171] Cannot use FlashAttention-2 backend due to sliding window.
INFO 06-28 23:40:04 selector.py:53] Using XFormers backend.
INFO 06-28 23:40:05 weight_utils.py:218] Using model weights format ['*.safetensors']
INFO 06-28 23:40:06 model_runner.py:220] Loading model weights took 7.7732 GB
/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:510: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast>
warnings.warn(
INFO 06-28 23:40:14 gpu_executor.py:83] # GPU blocks: 3184, # CPU blocks: 682
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/home/ubuntu/miniconda3/envs/vllm/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/home/ubuntu/vllm/vllm/entrypoints/openai/api_server.py", line 225, in <module>
[rank0]: engine = AsyncLLMEngine.from_engine_args(
[rank0]: File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 425, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 359, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/home/ubuntu/vllm/vllm/engine/async_llm_engine.py", line 500, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 246, in __init__
[rank0]: self._initialize_kv_caches()
[rank0]: File "/home/ubuntu/vllm/vllm/engine/llm_engine.py", line 342, in _initialize_kv_caches
[rank0]: self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/home/ubuntu/vllm/vllm/executor/gpu_executor.py", line 86, in initialize_cache
[rank0]: self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]: File "/home/ubuntu/vllm/vllm/worker/worker.py", line 207, in initialize_cache
[rank0]: raise_if_cache_size_invalid(num_gpu_blocks,
[rank0]: File "/home/ubuntu/vllm/vllm/worker/worker.py", line 344, in raise_if_cache_size_invalid
[rank0]: raise ValueError(
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (50944). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.