-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: The vllm is disconnected after running for some time #5084
Comments
I run into the same issue with vllm=0.4.3 with only one nccl version nvidia-nccl-cu12==2.20.5. |
i meet same problem, Timed out waiting 1800000ms ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. |
same problem with vllm==0.5.0.post1
|
when does the timeout happen? is it because the vLLM instance is idle for too long? |
Yes, I need to make a request every 1800s in the background to keep it alive. But I have never encountered this problem before. |
i run a llm server in test environment,there is no request in night. i find vllm and local model time out in tomorrow . |
cc @njhill |
I haven't seen this before but will try to reproduce. Does it happen every time if starting a server and not then sending any requests for > 30 minutes? |
@JaheimLee I've been unable to reproduce this so far. Both waiting for > 30 min after the server starts, and then waiting after making a request. Have you made any custom changes? Are you using the out-of-the-box OpenAI server? |
I have the same issue with from vllm import LLM
model = LLM(model=model_name, # e.g. mistralai/Mixtral-8x7B-Instruct-v0.1
tokenizer=self.model_name, # HF tokenizer mistralai/Mixtral-8x7B-Instruct-v0.1
trust_remote_code=False,
dtype=dtype, # torch.bfloat16
tensor_parallel_size=tensor_parallel_size, # 4
gpu_memory_utilization=gpu_memory_utilization, # 0.9
distributed_executor_backend="ray",
max_model_len=context_window) After 30 min of no requests to my API I get the timed out error:
|
Same as @enkiid . And I also got a warning using the latest source code |
@JaheimLee when you use the latest source code, does it still timeout after 30 mins? The warning you see is expected, it literally just tells you there are no requests in 60s. I assume #5399 should fix the 30min timeout issue. |
seems no timeout error now |
closing as #5399 should fix this. |
this blocks our Prod work (reproducible w/ 0.5.1), pls consider prioritizing cc @WoosukKwon |
@nightflight-dk could you provide more details about how you are running vLLM, how the problem is manifesting in your case, and confirm that you are on 0.5.1 (or later)? This problem should be fixed by #5987 which is in 0.5.1 |
hi @njhill, sure. tensor parallel deployment on DGX clusters w/ 8xA100 GPUs - Azure ML managed endpoints. vllm 0.5.0.post1 - the deployment failed to process an incoming request with above symptoms after a period of inactivity. about to retry with 0.5.2, thank you for your efforts |
Thanks @nightflight-dk, yes as mentioned this issue would be expected on 0.5.0.post1. And I assume you're using |
Correct @njhill, LLMEngine (would like to switch to AsyncLLMEngine from Triton within a month). Unfortunately switch to 0.5.2 (again with punica and tensor parallelism enabled) hangs loading a 7B model, import warning points to https://stackoverflow.com/questions/65120136/lib64-libc-so-6-version-glibc-2-32-not-found libc6=2.31 cuda12.2 nvidia-driver=535.183 nccl=2.20.5 (ubuntu 20) Meanwhile switch to 0.5.1 appears blocked by below:
|
My production env is the same as @nightflight-dk and my code env is v0.5.2
Can you help me with this problem? Thanks. @njhill @youkaichao |
Ubuntu 24.04 OK
Ubuntu 22.04 OK
issues point to ubuntu 20.04 w/ old libc6. Working Ubuntu 22.04 image deployed on a docker host w/ 20.04 (unfortunately current default w/ some cloud providers) seems affected as well. |
@ehuaa leveraging tensor parallelism inside docker we're exposing ourselves as it turns out, to two issues: glibc conflicts and (in your case I believe) assigned shared memory limits. If you control the docker host, you're luckier than me. Bumping the memory at the time of running the container is a matter of parameter to docker run. Might help or you might deal with glibc next like me. Can you confirm the OS and version of libc6 for the host and container in your setup? |
@nightflight-dk I'm actually not sure, it would be best to open a dedicated issue related to that (or use the other one you linked from if that's similar). @ehuaa your issue also looks like something different from the original one. I'd suggest to open an issue specifically for that, with more details including more of the log output. There is detailed method tracing that can be enabled to help debug these kind of cuda/nccl issues. I'm going to close this one since I'm fairly certain now the original reported problem with the distributed timeout is fixed (in 0.5.1 onward). |
Seeing this issue again when sending multiple asynchronous requests. I am running openai server using vllm/vllm-openai:v0.5.3.post1 with the following arguments: |
Hello, have you found a solution for the issue--[Bug]: The vllm is disconnected after running for some time? I'm still facing this problem and haven't resolved it yet. @zxcdsa45687 ESC[36m(RayWorkerWrapper pid=37007)ESC[0m ERROR 06-12 15:44:15 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. |
Your current environment
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.3
Libc version: glibc-2.31
Python version: 3.8.19 (default, Mar 20 2024, 19:58:24) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.8.0-43-generic-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A800 80GB PCIe
GPU 1: NVIDIA A800 80GB PCIe
Versions of relevant libraries:
[pip3] numpy==1.24.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] triton==2.3.0
[pip3] vllm_nccl_cu12==2.18.1.0.4.0
[conda] numpy 1.24.4 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] torch 2.3.0 pypi_0 pypi
[conda] triton 2.3.0 pypi_0 pypi
[conda] vllm-nccl-cu12 2.18.1.0.4.0 pypi_0 pypiROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
🐛 Describe the bug
Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution.
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] Traceback (most recent call last):
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllms/vllm/worker/worker_base.py", line 138, in execute_method
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return executor(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 286, in start_worker_execution_loop
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] while self._execute_model_non_driver():
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/worker/worker.py", line 295, in _execute_model_non_driver
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] data = broadcast_tensor_dict(src=0)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] torch.distributed.broadcast_object_list(recv_metadata_list,
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] broadcast(object_sizes_tensor, src=src, group=group)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] return func(*args, **kwargs)
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] File "miniconda3/envs/vllmss/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] work.wait()
�[36m(RayWorkerWrapper pid=3957362)�[0m ERROR 05-28 16:05:41 worker_base.py:146] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
The text was updated successfully, but these errors were encountered: