-
-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Description
Your current environment
The output of `python collect_env.py`
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> quit
Use quit() or Ctrl-D (i.e. EOF) to exit
>>> quit()
root@vllm-0-4-3-predictor-00001-deployment-54797fd955-p7b8g:/vllm-workspace# python3 collect_env.py
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
Nvidia driver version: 535.86.10
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 45 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8468
CPU family: 6
Model: 143
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
Stepping: 8
BogoMIPS: 4199.99
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear flush_l1d arch_capabilities
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 6 MiB (128 instances)
L1i cache: 4 MiB (128 instances)
L2 cache: 256 MiB (128 instances)
L3 cache: 210 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-63
NUMA node1 CPU(s): 64-127
Vulnerability Itlb multihit: KVM: Vulnerable
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 0-127 0-1 N/A
GPU1 NV12 X 0-127 0-1 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 Describe the bug
In vLLM v0.4.3 and later, calling list_loras() in a tensor parallelism situation causes the system to hang.
Based on vLLM v0.4.3, I tried to modify the code to know the status of where the multi lora adapter is currently up on the CPU/GPU.
As shown below, I simply made a call to self.list_loras() inside the do_log_stats() method of vllm/engine/llm_engine.py.
def do_log_stats(
self,
scheduler_outputs: Optional[SchedulerOutputs] = None,
model_output: Optional[List[SamplerOutput]] = None) -> None:
"""Forced log when no requests active."""
if self.log_stats:
logger.info(f"self.list_loras(): {self.list_loras()}")
self.stat_logger.log(
self._get_stats(scheduler_outputs, model_output))
I ran the framework through the openai entrypoint, and the do_log_stats() method works fine without LLM inference.
However, the moment I call the /v1/completions API, it gets stuck on the list_loras() method, and I don't even get a response from the /v1/completions API.
After 30 minutes in this state, the following error message is returned.
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] Exception in worker VllmWorkerProcess while processing method start_worker_execution_loop: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete, Traceback (most recent call last):
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_worker_utils.py", line 222, in _run_worker_process
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] output = executor(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 286, in start_worker_execution_loop
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] while self._execute_model_non_driver():
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 295, in _execute_model_non_driver
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] data = broadcast_tensor_dict(src=0)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/vllm/distributed/communication_op.py", line 284, in broadcast_tensor_dict
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] torch.distributed.broadcast_object_list(recv_metadata_list,
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2649, in broadcast_object_list
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] broadcast(object_sizes_tensor, src=src, group=group)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] return func(*args, **kwargs)
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2144, in broadcast
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] work.wait()
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225] RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
(VllmWorkerProcess pid=73) ERROR 06-13 01:30:49 multiproc_worker_utils.py:225]
ERROR 06-13 01:30:49 async_llm_engine.py:524] Engine iteration timed out. This should never happen!
ERROR 06-13 01:30:49 async_llm_engine.py:45] Engine background task failed
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
ERROR 06-13 01:30:49 async_llm_engine.py:45] request_outputs = await self.engine.step_async()
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
ERROR 06-13 01:30:49 async_llm_engine.py:45] output = await self.model_executor.execute_model_async(
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
ERROR 06-13 01:30:49 async_llm_engine.py:45] return await self._driver_execute_model_async(execute_model_req)
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
ERROR 06-13 01:30:49 async_llm_engine.py:45] return await self.driver_exec_model(execute_model_req)
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.CancelledError
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] During handling of the above exception, another exception occurred:
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
ERROR 06-13 01:30:49 async_llm_engine.py:45] return fut.result()
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.CancelledError
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] The above exception was the direct cause of the following exception:
ERROR 06-13 01:30:49 async_llm_engine.py:45]
ERROR 06-13 01:30:49 async_llm_engine.py:45] Traceback (most recent call last):
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
ERROR 06-13 01:30:49 async_llm_engine.py:45] task.result()
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
ERROR 06-13 01:30:49 async_llm_engine.py:45] has_requests_in_progress = await asyncio.wait_for(
ERROR 06-13 01:30:49 async_llm_engine.py:45] File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
ERROR 06-13 01:30:49 async_llm_engine.py:45] raise exceptions.TimeoutError() from exc
ERROR 06-13 01:30:49 async_llm_engine.py:45] asyncio.exceptions.TimeoutError
ERROR:asyncio:Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f351ca30ee0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.russianblue_async_llm_engine.RussianBlueAsyncLLMEngine object at 0x7f3510194100>>)
handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f351ca30ee0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.russianblue_async_llm_engine.RussianBlueAsyncLLMEngine object at 0x7f3510194100>>)>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 47, in _raise_exception_on_finish
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
INFO: 172.21.232.226:37556 - "GET /metrics HTTP/1.1" 200 OK
ERROR: Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 495, in engine_step
request_outputs = await self.engine.step_async()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 226, in step_async
output = await self.model_executor.execute_model_async(
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 169, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
File "/usr/local/lib/python3.10/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 148, in _driver_execute_model_async
return await self.driver_exec_model(execute_model_req)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
return fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 399, in run_asgi
result = await app( # type: ignore[func-returns-value]
File "/usr/local/lib/python3.10/dist-packages/uvicorn/middleware/proxy_headers.py", line 70, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/fastapi/applications.py", line 1054, in __call__
await super().__call__(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/applications.py", line 123, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 186, in __call__
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/errors.py", line 164, in __call__
await self.app(scope, receive, _send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/cors.py", line 85, in __call__
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/middleware/exceptions.py", line 65, in __call__
await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 756, in __call__
await self.middleware_stack(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 776, in app
await route.handle(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 297, in handle
await self.app(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 77, in app
await wrap_app_handling_exceptions(app, request)(scope, receive, send)
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 64, in wrapped_app
raise exc
File "/usr/local/lib/python3.10/dist-packages/starlette/_exception_handler.py", line 53, in wrapped_app
await app(scope, receive, sender)
File "/usr/local/lib/python3.10/dist-packages/starlette/routing.py", line 72, in app
response = await func(request)
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 278, in app
raw_response = await run_endpoint_function(
File "/usr/local/lib/python3.10/dist-packages/fastapi/routing.py", line 191, in run_endpoint_function
return await dependant.call(**values)
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in create_completion
generator = await openai_serving_completion.create_completion(
File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_completion.py", line 166, in create_completion
async for i, res in result_generator:
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 244, in consumer
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 235, in consumer
raise item
File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 219, in producer
async for item in iterator:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 662, in generate
async for output in self._process_request(
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 769, in _process_request
raise e
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 765, in _process_request
async for request_output in stream:
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 80, in __anext__
raise result
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 40, in _raise_exception_on_finish
task.result()
File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 521, in run_engine_loop
has_requests_in_progress = await asyncio.wait_for(
File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
If I add --disable-log-stats to the argument when running, do_log_stats() is not called, so the /v1/completions API responds normally.
In v0.4.2, the list_loras() method was called correctly, but since v0.4.3, the following scheduling improvements have been made, and this code seems to be the problem.
- Eliminate parallel worker per-step task scheduling overhead ([Core] Eliminate parallel worker per-step task scheduling overhead #4894)
I'm also curious as to why the above PR is an issue with the list_loras() call.