[Bug]: In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang.

### Your current environment

```text
The output of `python collect_env.py`

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> quit
Use quit() or Ctrl-D (i.e. EOF) to exit
>>> quit()
root@vllm-0-4-3-predictor-00001-deployment-54797fd955-p7b8g:/vllm-workspace# python3 collect_env.py
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB

Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   45 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          128
On-line CPU(s) list:             0-127
Vendor ID:                       GenuineIntel
Model name:                      Intel(R) Xeon(R) Platinum 8468
CPU family:                      6
Model:                           143
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
Stepping:                        8
BogoMIPS:                        4199.99
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear flush_l1d arch_capabilities
Hypervisor vendor:               VMware
Virtualization type:             full
L1d cache:                       6 MiB (128 instances)
L1i cache:                       4 MiB (128 instances)
L2 cache:                        256 MiB (128 instances)
L3 cache:                        210 MiB (2 instances)
NUMA node(s):                    2
NUMA node0 CPU(s):               0-63
NUMA node1 CPU(s):               64-127
Vulnerability Itlb multihit:     KVM: Vulnerable
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    0-127   0-1             N/A
GPU1    NV12     X      0-127   0-1             N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

```


### 🐛 Describe the bug

In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang.

Based on vLLM v0.4.3, I tried to modify the code to know the status of where the multi lora adapter is currently up on the CPU/GPU.

As shown below, I simply made a call to self.list_loras() inside the do_log_stats() method of vllm/engine/llm_engine.py.
```
    def do_log_stats(
            self,
            scheduler_outputs: Optional[SchedulerOutputs] = None,
            model_output: Optional[List[SamplerOutput]] = None) -> None:
        """Forced log when no requests active."""
        if self.log_stats:
            logger.info(f"self.list_loras(): {self.list_loras()}")
            self.stat_logger.log(
                self._get_stats(scheduler_outputs, model_output))
```

I ran the framework through the openai entrypoint, and the do_log_stats() method works fine without LLM inference.
However, the moment I call the /v1/completions API, it gets stuck on the list_loras() method, and I don't even get a response from the /v1/completions API.
After 30 minutes in this state, the following error message is returned.
```

(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 222, in _run_worker_process
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     output = executor(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 286, in start_worker_execution_loop
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     while self._execute_model_non_driver():
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 295, in _execute_model_non_driver
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     data = broadcast_tensor_dict(src=0)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     torch.distributed.broadcast_object_list(recv_metadata_list,
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     broadcast(object_sizes_tensor, src=src, group=group)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]     work.wait()
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]
ERROR 06-13 01:30:49 async_llm_engine.py:524] Engine iteration timed out. This should never happen!
ERROR 06-13 01:30:49 async_llm_engine.py:45] Engine background task failed
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
ERROR 06-13 01:30:49 async_llm_engine.py:45]     request_outputs = await self.engine.step_async()
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
ERROR 06-13 01:30:49 async_llm_engine.py:45]     output = await self.model_executor.execute_model_async(
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
ERROR 06-13 01:30:49 async_llm_engine.py:45]     return await self._driver_execute_model_async(execute_model_req)
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
ERROR 06-13 01:30:49 async_llm_engine.py:45]     return await self.driver_exec_model(execute_model_req)
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.CancelledError
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] During handling of the above exception, another exception occurred:
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 06-13 01:30:49 async_llm_engine.py:45]     return fut.result()
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.CancelledError
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] The above exception was the direct cause of the following exception:
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
ERROR 06-13 01:30:49 async_llm_engine.py:45]     task.result()
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
ERROR 06-13 01:30:49 async_llm_engine.py:45]     has_requests_in_progress = await asyncio.wait_for(
ERROR 06-13 01:30:49 async_llm_engine.py:45]   File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 06-13 01:30:49 async_llm_engine.py:45]     raise exceptions.TimeoutError() from exc
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f351ca30ee0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.russianblue_async_llm_engine.RussianBlueAsyncLLMEngine object at 0x7f3510194100>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f351ca30ee0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.russianblue_async_llm_engine.RussianBlueAsyncLLMEngine object at 0x7f3510194100>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO:     172.21.232.226:37556 - "GET /metrics HTTP/1.1" 200 OK
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
    request_outputs = await self.engine.step_async()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
    output = await self.model_executor.execute_model_async(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
    response = await func(request)
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in create_completion
    generator = await openai_serving_completion.create_completion(
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 166, in create_completion
    async for i, res in result_generator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 244, in consumer
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 235, in consumer
    raise item
  File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 219, in producer
    async for item in iterator:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 662, in generate
    async for output in self._process_request(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
    async for request_output in stream:
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__
    raise result
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
    task.result()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
    has_requests_in_progress = await asyncio.wait_for(
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

```

If I add --disable-log-stats to the argument when running, do_log_stats() is not called, so the /v1/completions API responds normally.


In v0.4.2, the list_loras() method was called correctly, but since v0.4.3, the following scheduling improvements have been made, and this code seems to be the problem.
 - Eliminate parallel worker per-step task scheduling overhead (https://github.com/vllm-project/vllm/pull/4894)



I'm also curious as to why the above PR is an issue with the list_loras() call.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang. #5531

Your current environment

🐛 Describe the bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang. #5531

Description

Your current environment

🐛 Describe the bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions