Description
Your current environment
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31
Python version: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.8.0-43-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800 80GB PCIe
GPU 1: NVIDIA A800 80GB PCIe
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm_nccl_cu12==2.18.1.0.4.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
🐛 Describe the bug
Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] Traceback (most recent call last):
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllms/vllm/worker/worker_base.py", line 138, in execute_method
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return executor(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 286, in start_worker_execution_loop
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] while self._execute_model_non_driver():
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 295, in _execute_model_non_driver
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] data = broadcast_tensor_dict(src=0)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] torch.distributed.broadcast_object_list(recv_metadata_list,
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] broadcast(object_sizes_tensor, src=src, group=group)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] work.wait()
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete