Open
Description
Your current environment
The output of `python collect_env.py`
GCC version: (GCC) 9.2.1 20200522
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.30
Python version: 3.10.16 (main, Dec 11 2024, 16:24:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.134-008.15.1.kangaroo.al8.x86_64-x86_64-with-glibc2.30
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA L20
GPU 1: NVIDIA L20
Nvidia driver version: 535.161.08
cuDNN version: Probably one of the following:
/usr/lib64/libcudnn.so.8.9.0
/usr/lib64/libcudnn_adv_infer.so.8.9.0
/usr/lib64/libcudnn_adv_train.so.8.9.0
/usr/lib64/libcudnn_cnn_infer.so.8.9.0
/usr/lib64/libcudnn_cnn_train.so.8.9.0
/usr/lib64/libcudnn_ops_infer.so.8.9.0
/usr/lib64/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Processor
Stepping: 8
CPU MHz: 3300.000
BogoMIPS: 6600.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 48K
L1i cache: 32K
L2 cache: 2048K
L3 cache: 61440K
NUMA node0 CPU(s): 0-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd avx512vbmi umip pku waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk avx512_fp16 arch_capabilities
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pynvml==12.0.0
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.3
[pip3] triton==3.2.0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
[conda] nvidia-ml-py 12.570.86 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] pynvml 12.0.0 pypi_0 pypi
[conda] pyzmq 26.4.0 pypi_0 pypi
[conda] torch 2.6.0 pypi_0 pypi
[conda] torchaudio 2.6.0 pypi_0 pypi
[conda] torchvision 0.21.0 pypi_0 pypi
[conda] transformers 4.51.3 pypi_0 pypi
[conda] triton 3.2.0 pypi_0 pypi
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.5
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB 0-31 0 N/A
GPU1 PHB X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NCCL_SOCKET_IFNAME=eth0
NVIDIA_VISIBLE_DEVICES=all
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_DRIVER_CAPABILITIES=compute,utility
CUDA_VERSION=12.1.1
NVIDIA_PRODUCT_NAME=CUDA
NCCL_VERSION=2.17.1
NCCL_DEBUG=TRACE
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
vllm serve Qwen3-32B --tensor-parallel-size 2
then compile a long time end with cuda oom
ray-test-create-ai-node2-rqy511-head-7qd62:12171:12984 [1] NCCL INFO ncclCommInitRank comm 0x2aa6cd00 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 20 commId 0xb0f2a75f6fb6730b - Init COMPLETE
INFO 04-30 16:20:37 [kv_cache_utils.py:634] GPU KV cache size: 67,856 tokens
INFO 04-30 16:20:37 [kv_cache_utils.py:637] Maximum concurrency for 16,384 tokens per request: 4.14x
INFO 04-30 16:20:37 [kv_cache_utils.py:634] GPU KV cache size: 67,856 tokens
INFO 04-30 16:20:37 [kv_cache_utils.py:637] Maximum concurrency for 16,384 tokens per request: 4.14x
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] WorkerProc hit an exception: %s
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] Traceback (most recent call last):
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in worker_busy_loop
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] output = func(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 216, in compile_or_warm_up_model
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] self.model_runner.capture_model()
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1619, in capture_model
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] self._dummy_run(num_tokens)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return func(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] hidden_states = model(
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 245, in __call__
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] model_output = self.forward(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 326, in forward
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] def forward(
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return fn(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] raise e
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._call_impl(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return forward_call(*args, **kwargs)
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "<eval_with_key>.130", line 662, in forward
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] submod_40 = self.submod_40(getitem_98, s0, l_self_modules_layers_modules_19_modules_self_attn_modules_o_proj_parameters_weight_, getitem_99, l_self_modules_layers_modules_19_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_19_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_19_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_20_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_20_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_20_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_20_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_98 = l_self_modules_layers_modules_19_modules_self_attn_modules_o_proj_parameters_weight_ = getitem_99 = l_self_modules_layers_modules_19_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_19_modules_mlp_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_19_modules_mlp_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_20_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_20_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_20_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_20_modules_self_attn_modules_k_norm_parameters_weight_ = None
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/compilation/backends.py", line 677, in __call__
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] with torch.cuda.graph(cudagraph, pool=self.graph_pool):
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/cuda/graphs.py", line 186, in __exit__
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] self.cuda_graph.capture_end()
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/cuda/graphs.py", line 84, in capture_end
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] super().capture_end()
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] RuntimeError: CUDA error: out of memory
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=1 pid=12171) ERROR 04-30 16:21:27 [multiproc_executor.py:380]
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] WorkerProc hit an exception: %s
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] Traceback (most recent call last):
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 375, in worker_busy_loop
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] output = func(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 216, in compile_or_warm_up_model
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] self.model_runner.capture_model()
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1619, in capture_model
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] self._dummy_run(num_tokens)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return func(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1441, in _dummy_run
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] hidden_states = model(
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3.py", line 301, in forward
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/compilation/decorators.py", line 245, in __call__
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] model_output = self.forward(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen2.py", line 326, in forward
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] def forward(
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return fn(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/fx/graph_module.py", line 822, in call_wrapped
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._wrapped_call(self, *args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] raise e
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/fx/graph_module.py", line 387, in __call__
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc]
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return self._call_impl(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] return forward_call(*args, **kwargs)
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "<eval_with_key>.130", line 662, in forward
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] submod_40 = self.submod_40(getitem_98, s0, l_self_modules_layers_modules_19_modules_self_attn_modules_o_proj_parameters_weight_, getitem_99, l_self_modules_layers_modules_19_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_19_modules_mlp_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_19_modules_mlp_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_20_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_20_modules_self_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_20_modules_self_attn_modules_q_norm_parameters_weight_, l_self_modules_layers_modules_20_modules_self_attn_modules_k_norm_parameters_weight_, l_positions_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_); getitem_98 = l_self_modules_layers_modules_19_modules_self_attn_modules_o_proj_parameters_weight_ = getitem_99 = l_self_modules_layers_modules_19_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_19_modules_mlp_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_19_modules_mlp_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_20_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_20_modules_self_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_20_modules_self_attn_modules_q_norm_parameters_weight_ = l_self_modules_layers_modules_20_modules_self_attn_modules_k_norm_parameters_weight_ = None
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/compilation/backends.py", line 677, in __call__
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] with torch.cuda.graph(cudagraph, pool=self.graph_pool):
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/cuda/graphs.py", line 186, in __exit__
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] self.cuda_graph.capture_end()
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] File "/home/ray/anaconda3/lib/python3.10/site-packages/torch/cuda/graphs.py", line 84, in capture_end
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] super().capture_end()
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] RuntimeError: CUDA error: out of memory
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker rank=0 pid=12144) ERROR 04-30 16:21:27 [multiproc_executor.py:380]
ERROR 04-30 16:21:27 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-30 16:21:27 [core.py:387] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 04-30 16:21:27 [core.py:387] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-30 16:21:27 [core.py:387] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 320, in __init__
ERROR 04-30 16:21:27 [core.py:387] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-30 16:21:27 [core.py:387] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 04-30 16:21:27 [core.py:387] self._initialize_kv_caches(vllm_config)
ERROR 04-30 16:21:27 [core.py:387] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 160, in _initialize_kv_caches
ERROR 04-30 16:21:27 [core.py:387] self.model_executor.initialize_from_config(kv_cache_configs)
ERROR 04-30 16:21:27 [core.py:387] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 63, in initialize_from_config
ERROR 04-30 16:21:27 [core.py:387] self.collective_rpc("compile_or_warm_up_model")
ERROR 04-30 16:21:27 [core.py:387] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 04-30 16:21:27 [core.py:387] raise e
ERROR 04-30 16:21:27 [core.py:387] File "/home/ray/anaconda3/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 04-30 16:21:27 [core.py:387] raise RuntimeError(
ERROR 04-30 16:21:27 [core.py:387] RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause', 'CUDA error: out of memory\nCUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1\nCompile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n')
ERROR 04-30 16:21:27 [core.py:387]
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.