[Bug]: OOM when using pp=2 qwen2.5 vl 32B on 2 L20


```text
WARNING 06-27 13:31:35 [sampling_params.py:344] temperature 1e-06 is less than 0.01, which may cause numerical errors nan or inf in tensors. We have maxed it out to 0.01.
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527] WorkerProc hit an exception.
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527] Traceback (most recent call last):
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 522, in worker_busy_loop
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     output = func(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 293, in execute_model
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     output = self.model_runner.execute_model(scheduler_output,
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return func(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1260, in execute_model
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     model_output = self.model(
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 1136, in forward
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     hidden_states = self.language_model.model(
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 246, in __call__
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     model_output = self.forward(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 336, in forward
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     def forward(
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return fn(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 406, in __call__
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     raise e
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 393, in __call__
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return self._call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "<eval_with_key>.66", line 380, in forward
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     submod_4 = self.submod_4(getitem_12, s0, l_self_modules_layers_modules_33_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_33_modules_self_attn_modules_o_proj_parameters_weight_scale_, getitem_13, l_self_modules_layers_modules_33_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_33_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_33_modules_mlp_modules_gate_up_proj_parameters_weight_scale_, l_self_modules_layers_modules_33_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_33_modules_mlp_modules_down_proj_parameters_weight_scale_, l_self_modules_layers_modules_34_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_34_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_34_modules_self_attn_modules_qkv_proj_parameters_weight_scale_, l_self_modules_layers_modules_34_modules_self_attn_modules_qkv_proj_parameters_bias_, l_self_modules_layers_modules_32_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_, s3, getitem_5, getitem_6, getitem_7, getitem_8);  getitem_12 = l_self_modules_layers_modules_33_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_33_modules_self_attn_modules_o_proj_parameters_weight_scale_ = getitem_13 = l_self_modules_layers_modules_33_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_33_modules_mlp_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_33_modules_mlp_modules_gate_up_proj_parameters_weight_scale_ = l_self_modules_layers_modules_33_modules_mlp_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_33_modules_mlp_modules_down_proj_parameters_weight_scale_ = l_self_modules_layers_modules_34_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_34_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_34_modules_self_attn_modules_qkv_proj_parameters_weight_scale_ = l_self_modules_layers_modules_34_modules_self_attn_modules_qkv_proj_parameters_bias_ = None
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 116, in __call__
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return self.compiled_graph_for_general_shape(*args)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/compiler_interface.py", line 490, in compiled_graph
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     graph_output = inductor_compiled_graph(list_args)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/usr/local/lib/python3.10/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     return self.current_callable(inputs)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]   File "/root/.cache/vllm/torch_compile_cache/ec35e2abcd/rank_1_0/inductor_cache/7e/c7ehmh5l45kfn5w4qzezhvnjvfg3w2at4gxqj2b4ouaqamgv6wtd.py", line 688, in call
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527]     buf20 = empty_strided_cuda((s0, 27648), (27648, 1), torch.bfloat16)
(VllmWorker rank=1 pid=1162710) ERROR 06-27 13:31:36 [multiproc_executor.py:527] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 1 has a total capacity of 44.53 GiB of which 91.94 MiB is free. Process 525525 has 44.43 GiB memory in use. Of the allocated memory 42.24 GiB is allocated by PyTorch, with 158.00 MiB allocated in private pools (e.g., CUDA Graphs), and 1.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR 06-27 13:31:36 [core.py:517] EngineCore encountered a fatal error.
ERROR 06-27 13:31:36 [core.py:517] Traceback (most recent call last):
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 508, in run_engine_core
ERROR 06-27 13:31:36 [core.py:517]     engine_core.run_busy_loop()
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 535, in run_busy_loop
ERROR 06-27 13:31:36 [core.py:517]     self._process_engine_step()
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 560, in _process_engine_step
ERROR 06-27 13:31:36 [core.py:517]     outputs, model_executed = self.step_fn()
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 279, in step_with_batch_queue
ERROR 06-27 13:31:36 [core.py:517]     model_output = future.result()
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
ERROR 06-27 13:31:36 [core.py:517]     return self.__get_result()
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
ERROR 06-27 13:31:36 [core.py:517]     raise self._exception
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 06-27 13:31:36 [core.py:517]     result = self.fn(*self.args, **self.kwargs)
ERROR 06-27 13:31:36 [core.py:517]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 207, in get_response
ERROR 06-27 13:31:36 [core.py:517]     raise RuntimeError(
ERROR 06-27 13:31:36 [core.py:517] RuntimeError: Worker failed with error 'CUDA out of memory. Tried to allocate 108.00 MiB. GPU 1 has a total capacity of 44.53 GiB of which 91.94 MiB is free. Process 525525 has 44.43 GiB memory in use. Of the allocated memory 42.24 GiB is allocated by PyTorch, with 158.00 MiB allocated in private pools (e.g., CUDA Graphs), and 1.14 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)', please check the stack trace above for the root cause
ERROR 06-27 13:31:36 [async_llm.py:420] AsyncLLM output_handler failed.
ERROR 06-27 13:31:36 [async_llm.py:420] Traceback (most recent call last):
ERROR 06-27 13:31:36 [async_llm.py:420]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 379, in output_handler
ERROR 06-27 13:31:36 [async_llm.py:420]     outputs = await engine_core.get_output_async()
ERROR 06-27 13:31:36 [async_llm.py:420]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 790, in get_output_async
ERROR 06-27 13:31:36 [async_llm.py:420]     raise self._format_exception(outputs) from None
ERROR 06-27 13:31:36 [async_llm.py:420] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
INFO:     fdbd:dc61:7:215:aa7b:dffb:f600:e0:59224 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     fdbd:dc61:7:215:aa7b:dffb:f600:e0:37656 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1160091]
```

It occurred frequently when there are few requests, but when qps is high, it occurred not often.

### 🐛 Describe the bug

I use vllm == 0.9.1, and memory fraction is set to 0.9, I don't think this is a high fraction.

Qwen2.5 VL 32B on 2 L20, using pp=2, quantization = fp8


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: OOM when using pp=2 qwen2.5 vl 32B on 2 L20 #20184

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: OOM when using pp=2 qwen2.5 vl 32B on 2 L20 #20184

Description

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions