Closed
Description
Your current environment
branch:qwen25vl
How would you like to use vllm
VLLM_ARGS="--limit-mm-per-prompt image=2 \
--tensor-parallel-size 1 \
--max-model-len 16384
--served-model-name Qwen2.5-VL-7B-Instruct/ \
--mm_processor_kwargs {\"max_pixels\":1000000} \
--gpu_memory_utilization 0.9 \
--model Qwen/Qwen2.5-VL-7B-Instruct/"
CUDA_VISIBLE_DEVICES=0 python3 -m vllm.entrypoints.openai.api_server ${VLLM_ARGS} --port 8000
bash run in the preceding way, the configured max_pixels is invalid in the following log
INFO 02-12 03:01:35 cuda.py:230] Using Flash Attention backend.
[W212 03:01:36.265895118 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())
INFO 02-12 03:01:36 model_runner.py:1110] Starting to load model Qwen/Qwen2.5-VL-7B-autoglm-android-wechat-test-250211/...
INFO 02-12 03:01:36 config.py:2930] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256] is overridden by config [256, 128, 2, 1, 4, 136, 8, 144, 16, 152, 24, 160, 32, 168, 40, 176, 48, 184, 56, 192, 64, 200, 72, 208, 80, 216, 88, 120, 224, 96, 232, 104, 240, 112, 248]
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:03, 1.15s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:01, 1.00it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.08it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.42it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00, 1.23it/s]
INFO 02-12 03:01:39 model_runner.py:1115] Loading model weights took 16.7361 GB
WARNING 02-12 03:01:41 model_runner.py:1288] Computed max_num_seqs (min(256, 10384 // 11025)) to be less than 1. Setting it to the minimum value of 1.
Keyword argument `max_pixels` is not a valid argument for this processor and will be ignored.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
WARNING 02-12 03:01:43 profiling.py:184] The context length (10384) of the model is too short to hold the multi-modal embeddings in the worst case (11025 tokens in total, out of which {'image': 2450, 'video': 8575} are reserved for multi-modal embeddings). This may cause certain multi-modal inputs to fail during inference, even when the input text is short. To avoid this, you should increase `max_model_len`, reduce `max_num_seqs`, and/or reduce `mm_counts`.
INFO 02-12 03:01:45 worker.py:266] Memory profiling takes 5.11 seconds
INFO 02-12 03:01:45 worker.py:266] the current vLLM instance can use total_gpu_memory (23.65GiB) x gpu_memory_utilization (0.90) = 21.28GiB
INFO 02-12 03:01:45 worker.py:266] model weights take 16.74GiB; non_torch_memory takes 0.08GiB; PyTorch activation peak memory takes 1.38GiB; the rest of the memory reserved for KV Cache is 3.09GiB.
INFO 02-12 03:01:45 executor_base.py:108] # CUDA blocks: 3613, # CPU blocks: 4681
INFO 02-12 03:01:45 executor_base.py:113] Maximum concurrency for 10384 tokens per request: 5.57x
INFO 02-12 03:01:47 model_runner.py:1430] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.44it/s]
INFO 02-12 03:02:02 model_runner.py:1558] Graph capturing finished in 14 secs, took 1.89 GiB
INFO 02-12 03:02:02 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 22.17 seconds
INFO 02-12 03:02:02 api_server.py:754] Using supplied chat template:
INFO 02-12 03:02:02 api_server.py:754] None
INFO 02-12 03:02:02 launcher.py:19] Available routes are:
INFO 02-12 03:02:02 launcher.py:27] Route: /openapi.json, Methods: HEAD, GET
INFO 02-12 03:02:02 launcher.py:27] Route: /docs, Methods: HEAD, GET
INFO 02-12 03:02:02 launcher.py:27] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 02-12 03:02:02 launcher.py:27] Route: /redoc, Methods: HEAD, GET
INFO 02-12 03:02:02 launcher.py:27] Route: /health, Methods: GET
INFO 02-12 03:02:02 launcher.py:27] Route: /ping, Methods: POST, GET
INFO 02-12 03:02:02 launcher.py:27] Route: /tokenize, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /detokenize, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /v1/models, Methods: GET
INFO 02-12 03:02:02 launcher.py:27] Route: /version, Methods: GET
INFO 02-12 03:02:02 launcher.py:27] Route: /v1/chat/completions, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /v1/completions, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /v1/embeddings, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /pooling, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /score, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /v1/score, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /rerank, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /v1/rerank, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /v2/rerank, Methods: POST
INFO 02-12 03:02:02 launcher.py:27] Route: /invocations, Methods: POST
INFO: Started server process [6960]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
### Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.