Skip to content

[Bug]: The vllm is disconnected after running for some time #5084

Closed
@zxcdsa45687

Description

@zxcdsa45687

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31

Python version: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.8.0-43-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800 80GB PCIe
GPU 1: NVIDIA A800 80GB PCIe
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm_nccl_cu12==2.18.1.0.4.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] Traceback (most recent call last):
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllms/vllm/worker/worker_base.py", line 138, in execute_method
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return executor(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 286, in start_worker_execution_loop
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] while self._execute_model_non_driver():
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 295, in _execute_model_non_driver
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] data = broadcast_tensor_dict(src=0)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] torch.distributed.broadcast_object_list(recv_metadata_list,
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] broadcast(object_sizes_tensor, src=src, group=group)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] work.wait()
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions