Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The vllm is disconnected after running for some time #5084

Closed
zxcdsa45687 opened this issue May 28, 2024 · 27 comments
Closed

[Bug]: The vllm is disconnected after running for some time #5084

zxcdsa45687 opened this issue May 28, 2024 · 27 comments
Labels
bug Something isn't working

Comments

@zxcdsa45687
Copy link

Your current environment

PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31

Python version: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.8.0-43-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800 80GB PCIe
GPU 1: NVIDIA A800 80GB PCIe
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm_nccl_cu12==2.18.1.0.4.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] Traceback (most recent call last):
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllms/vllm/worker/worker_base.py", line 138, in execute_method
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return executor(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 286, in start_worker_execution_loop
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] while self._execute_model_non_driver():
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 295, in _execute_model_non_driver
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] data = broadcast_tensor_dict(src=0)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] torch.distributed.broadcast_object_list(recv_metadata_list,
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] broadcast(object_sizes_tensor, src=src, group=group)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] work.wait()
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete

@zxcdsa45687 zxcdsa45687 added the bug Something isn't working label May 28, 2024
@thirteenflt
Copy link

I run into the same issue with vllm=0.4.3 with only one nccl version nvidia-nccl-cu12==2.20.5.

@754807266
Copy link

i meet same problem, Timed out waiting 1800000ms

ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Traceback (most recent call last):
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker_base.py", line 140, in execut
e_method
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return executor(*args, **kwargs)
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decora
te_context
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(*args, **kwargs)
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker.py", line 286, in start_worke
r_execution_loop
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] while self._execute_model_non_driver():
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/worker/worker.py", line 295, in _execute_mo
del_non_driver
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] data = broadcast_tensor_dict(src=0)
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/vllm/distributed/communication_op.py", line 284,
in broadcast_tensor_dict
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] torch.distributed.broadcast_object_list(recv_metadata_list,
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in w
rapper
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(*args, **kwargs)
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 264
9, in broadcast_object_list
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] broadcast(object_sizes_tensor, src=src, group=group)
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in w
rapper
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] return func(*args, **kwargs)
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] File "/home/admin/anaconda3/envs/conda-zx-qwen/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 214
4, in broadcast
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] work.wait()
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to
complete
ESC[36m(RayWorkerWrapper pid=37200)ESC[0m INFO 06-12 14:42:33 model_runner.py:924] Graph capturing finished in 11 secs.ESC[32m [repeated 2x across cluster]ESC[0m

@JaheimLee
Copy link

JaheimLee commented Jun 17, 2024

same problem with vllm==0.5.0.post1

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-17 14:51:19 gptq_marlin.py:134] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 06-17 14:51:19 config.py:628] Defaulting to use mp for distributed inference
INFO 06-17 14:51:19 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=5000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/pretrained_models/Qwen2-72B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m INFO 06-17 14:51:22 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m INFO 06-17 14:51:22 utils.py:645] Found nccl from library libnccl.so.2
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 06-17 14:51:22 pynccl.py:63] vLLM is using nccl==2.20.5
WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m WARNING 06-17 14:51:22 custom_all_reduce.py:166] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
INFO 06-17 14:51:26 model_runner.py:160] Loading model weights took 9.7410 GB
INFO 06-17 14:51:33 distributed_gpu_executor.py:56] # GPU blocks: 9267, # CPU blocks: 3276
Server start!
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     while self._execute_model_non_driver():
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     data = broadcast_tensor_dict(src=0)
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     torch.distributed.broadcast_object_list(recv_metadata_list,
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     broadcast(object_sizes_tensor, src=src, group=group)
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226]     work.wait()
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
�[1;36m(VllmWorkerProcess pid=1078909)�[0;0m ERROR 06-17 15:30:29 multiproc_worker_utils.py:226] 
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     while self._execute_model_non_driver():
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     data = broadcast_tensor_dict(src=0)
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     torch.distributed.broadcast_object_list(recv_metadata_list,
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     broadcast(object_sizes_tensor, src=src, group=group)
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     work.wait()
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
�[1;36m(VllmWorkerProcess pid=1078911)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] 
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/executor/multiproc_worker_utils.py", line 223, in _run_worker_process
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     output = executor(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]              ^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 294, in start_worker_execution_loop
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     while self._execute_model_non_driver():
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/worker/worker.py", line 303, in _execute_model_non_driver
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     data = broadcast_tensor_dict(src=0)
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 400, in broadcast_tensor_dict
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     torch.distributed.broadcast_object_list(recv_metadata_list,
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     broadcast(object_sizes_tensor, src=src, group=group)
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     return func(*args, **kwargs)
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]            ^^^^^^^^^^^^^^^^^^^^^
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]   File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226]     work.wait()
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
�[1;36m(VllmWorkerProcess pid=1078910)�[0;0m ERROR 06-17 15:30:30 multiproc_worker_utils.py:226] 
[2024-06-17 15:38:08 +0800] [1078766] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/uvicorn/protocols/http/httptools_impl.py", line 411, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/starlette/routing.py", line 72, in app
    response = await func(request)
               ^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 171, in predict
    else:
        ^^
  File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 97, in inference
    def inference(question, res):
                    ^^^^^^^^^^^^^
  File "/data/lijinghui/online_chat/chat_fastapi_vllm.py", line 88, in run
    query = build_query(kwargs["model_path"], tokenizer, question, history)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/utils.py", line 699, in inner
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/entrypoints/llm.py", line 304, in generate
    outputs = self._run_engine(use_tqdm=use_tqdm)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/entrypoints/llm.py", line 556, in _run_engine
    step_outputs = self.llm_engine.step()
                   ^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/engine/llm_engine.py", line 776, in step
    output = self.model_executor.execute_model(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/executor/distributed_gpu_executor.py", line 76, in execute_model
    return self._driver_execute_model(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/executor/multiproc_gpu_executor.py", line 84, in _driver_execute_model
    return self.driver_worker.execute_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/worker/worker.py", line 272, in execute_model
    broadcast_tensor_dict(data, src=0)
  File "/data/lijinghui/vllm/vllm/distributed/communication_op.py", line 32, in broadcast_tensor_dict
    return get_tp_group().broadcast_tensor_dict(tensor_dict, src)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/lijinghui/vllm/vllm/distributed/parallel_state.py", line 374, in broadcast_tensor_dict
    torch.distributed.broadcast_object_list([metadata_list],
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
    broadcast(object_sizes_tensor, src=src, group=group)
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/data/miniconda3/envs/ljh_py311/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [192.168.12.71]:51841

@youkaichao
Copy link
Member

when does the timeout happen? is it because the vLLM instance is idle for too long?

@JaheimLee
Copy link

JaheimLee commented Jun 19, 2024

when does the timeout happen? Is it because the vLLM instance is idle for too long?

Yes, I need to make a request every 1800s in the background to keep it alive. But I have never encountered this problem before.

@754807266
Copy link

when does the timeout happen? is it because the vLLM instance is idle for too long?

i run a llm server in test environment,there is no request in night. i find vllm and local model time out in tomorrow .
Cause service exception

@youkaichao
Copy link
Member

cc @njhill

@njhill
Copy link
Member

njhill commented Jun 22, 2024

I haven't seen this before but will try to reproduce. Does it happen every time if starting a server and not then sending any requests for > 30 minutes?

@njhill
Copy link
Member

njhill commented Jun 23, 2024

@JaheimLee I've been unable to reproduce this so far. Both waiting for > 30 min after the server starts, and then waiting after making a request. Have you made any custom changes? Are you using the out-of-the-box OpenAI server?

@enkiid
Copy link

enkiid commented Jun 24, 2024

I have the same issue with vllm 0.5.0.post1! When using the LLM-class:

from vllm import LLM
model = LLM(model=model_name,  # e.g. mistralai/Mixtral-8x7B-Instruct-v0.1
            tokenizer=self.model_name,  # HF tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1
            trust_remote_code=False,
            dtype=dtype,  # torch.bfloat16
            tensor_parallel_size=tensor_parallel_size,  # 4
            gpu_memory_utilization=gpu_memory_utilization,  # 0.9
            distributed_executor_backend="ray",
            max_model_len=context_window)

After 30 min of no requests to my API I get the timed out error:

...
(RayWorkerWrapper pid=3171083) ERROR 06-20 14:03:38 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete 
...
ERROR 06-24 07:07:08 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.2]:27025

@JaheimLee
Copy link

@JaheimLee I've been unable to reproduce this so far. Both waiting for > 30 min after the server starts, and then waiting after making a request. Have you made any custom changes? Are you using the out-of-the-box OpenAI server?

Same as @enkiid . And I also got a warning using the latest source code [WARNING 06-24 20:44:26 shm_broadcast.py:214] No available block found in 60 second.

@youkaichao
Copy link
Member

@JaheimLee when you use the latest source code, does it still timeout after 30 mins? The warning you see is expected, it literally just tells you there are no requests in 60s. I assume #5399 should fix the 30min timeout issue.

@JaheimLee
Copy link

@JaheimLee when you use the latest source code, does it still timeout after 30 mins? The warning you see is expected, it literally just tells you there are no requests in 60s. I assume #5399 should fix the 30min timeout issue.

seems no timeout error now

@youkaichao
Copy link
Member

closing as #5399 should fix this.

@njhill njhill reopened this Jun 29, 2024
@njhill
Copy link
Member

njhill commented Jun 29, 2024

The problem is that the workers should never be waiting on the broadcast operation indefinitely so I don't think #5399 will fix it. They'll just be waiting in an shm broadcast operation rather than torch.distributed.

I'm hoping that #5987 will address it though.

@nightflight-dk
Copy link

nightflight-dk commented Jul 15, 2024

this blocks our Prod work (reproducible w/ 0.5.1), pls consider prioritizing cc @WoosukKwon

@njhill
Copy link
Member

njhill commented Jul 17, 2024

@nightflight-dk could you provide more details about how you are running vLLM, how the problem is manifesting in your case, and confirm that you are on 0.5.1 (or later)? This problem should be fixed by #5987 which is in 0.5.1

@nightflight-dk
Copy link

hi @njhill, sure. tensor parallel deployment on DGX clusters w/ 8xA100 GPUs - Azure ML managed endpoints. vllm 0.5.0.post1 - the deployment failed to process an incoming request with above symptoms after a period of inactivity. about to retry with 0.5.2, thank you for your efforts

@njhill
Copy link
Member

njhill commented Jul 18, 2024

Thanks @nightflight-dk, yes as mentioned this issue would be expected on 0.5.0.post1. And I assume you're using LLM or LLMEngine rather than AsyncLLMEngine or the OpenAI server?

@nightflight-dk
Copy link

nightflight-dk commented Jul 19, 2024

Correct @njhill, LLMEngine (would like to switch to AsyncLLMEngine from Triton within a month).

Unfortunately switch to 0.5.2 (again with punica and tensor parallelism enabled) hangs loading a 7B model, import warning points to https://stackoverflow.com/questions/65120136/lib64-libc-so-6-version-glibc-2-32-not-found

libc6=2.31 cuda12.2 nvidia-driver=535.183 nccl=2.20.5 (ubuntu 20)

Meanwhile switch to 0.5.1 appears blocked by below:

[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f699a57a897 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f699a52ab25 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f699a9a7718 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f694e4738e6 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f694e4779e8 in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f694e47d05c in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f694e47ddcc in /azureml-envs/azureml_1c92738a7bdba66b39ae344257bb429d/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7f699b0b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7f699e9b2ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x7f699ea44850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

 nranks 8 cudaDev 3 nvmlDev 3 busId 400000 commId 0xadd71877c1f309c6 - Init COMPLETE

@ehuaa
Copy link

ehuaa commented Jul 19, 2024

My production env is the same as @nightflight-dk and my code env is v0.5.2
During the prefilling stage, it turns out the error which is not quite similar to Damian's above, but the result is my vllm env hangs over.
the prompt length is about 128K, and my model is qwen2-72b, maybe my prompt is too long which causes the timout of nccl collective operation?

(VllmWorkerProcess pid=143) WARNING 07-19 17:03:17 shm_broadcast.py:394] No available block found in 60 second. 
(VllmWorkerProcess pid=147) WARNING 07-19 17:03:17 shm_broadcast.py:394] No available block found in 60 second. 
[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7402, OpType=GATHER, NumelIn=57024, NumelOut=0, Timeout(ms)=600000) ran for 600008 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:1537] [PG 2 Rank 7] Timeout at NCCL work: 7402, last enqueued NCCL work: 7402, last completed NCCL work: 7401.
[rank7]:[E ProcessGroupNCCL.cpp:577] [Rank 7] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank7]:[E ProcessGroupNCCL.cpp:583] [Rank 7] To avoid data inconsistency, we are taking the entire process down.
[rank7]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 7] Process group watchdog thread terminated with exception: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=7402, OpType=GATHER, NumelIn=57024, NumelOut=0, Timeout(ms)=600000) ran for 600008 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f4c7bacf897 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f4c2f84bc62 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f4c2f850a80 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f4c2f851dcc in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f4c7b2b0253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f4c7c7b6ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126850 (0x7f4c7c848850 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Can you help me with this problem? Thanks. @njhill @youkaichao

@nightflight-dk
Copy link

nightflight-dk commented Jul 19, 2024

Ubuntu 24.04 OK

libc6==2.39-0ubuntu8.2, 
CUDA 12.4
NCCL==2.20.5,
GPUs=2x4090,
lora_enabled = True, 
tensor_parallel_size=2
vllm==0.5.2

Ubuntu 22.04 OK

libc6==2.35-0ubuntu3.8,
CUDA 12.4   
NCCL==2.20.5,
GPUs=2x4090,
lora_enabled = True, 
tensor_parallel_size=2
vllm==0.5.2

issues point to ubuntu 20.04 w/ old libc6. Working Ubuntu 22.04 image deployed on a docker host w/ 20.04 (unfortunately current default w/ some cloud providers) seems affected as well.

@nightflight-dk
Copy link

nightflight-dk commented Jul 23, 2024

@ehuaa leveraging tensor parallelism inside docker we're exposing ourselves as it turns out, to two issues: glibc conflicts and (in your case I believe) assigned shared memory limits. If you control the docker host, you're luckier than me. Bumping the memory at the time of running the container is a matter of parameter to docker run. Might help or you might deal with glibc next like me. Can you confirm the OS and version of libc6 for the host and container in your setup?
@njhill Nick kindly please confirm the minimum version of glibc required to run vllm 0.5.2? I am hoping it's <=0.31. We can continue part of separate issue. thanks

@njhill
Copy link
Member

njhill commented Jul 24, 2024

@nightflight-dk I'm actually not sure, it would be best to open a dedicated issue related to that (or use the other one you linked from if that's similar).

@ehuaa your issue also looks like something different from the original one. I'd suggest to open an issue specifically for that, with more details including more of the log output. There is detailed method tracing that can be enabled to help debug these kind of cuda/nccl issues.

I'm going to close this one since I'm fairly certain now the original reported problem with the distributed timeout is fixed (in 0.5.1 onward).

@njhill njhill closed this as completed Jul 24, 2024
@ktrapeznikov
Copy link

Seeing this issue again when sending multiple asynchronous requests.
Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.

I am running openai server using vllm/vllm-openai:v0.5.3.post1 with the following arguments:
--model llava-hf/llava-v1.6-34b-hf --tensor-parallel-size 2 --distributed-executor-backend=ray --gpu-memory-utilization=.9

@qinhuangdaoStation
Copy link

@wkf96
Copy link

wkf96 commented Oct 12, 2024

Hello, have you found a solution for the issue--[Bug]: The vllm is disconnected after running for some time? I'm still facing this problem and haven't resolved it yet. @zxcdsa45687

ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
...
ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to
complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests