Description
What happened + What you expected to happen
When deploying platforms based on the Ray framework, such as Ray Serve and Ray LLM, together with vLLM's OpenAI server, the errors "No CUDA GPUs are available" or "Ray does not allocate any GPUs on the driver node" have become recurring issues.
In this issue, I will provide a detailed analysis of these problems, along with a brief solution, experimental records. I sincerely invite developers from the Ray and vLLM communities to participate in the discussion, point out any shortcomings, and share your suggestions!
Quick Troubleshoot
For older versions of vLLM, I have also provided a hack to temporarily resolve this issue. Please refer to: Ray Issue #51154.
For Ray LLM and Ray Serve documentation:
- Ray LLM: Ray LLM Documentation
- Ray Serve: Ray Serve vLLM Example
A proper configuration for TP=1 involves modifying the build_app
function in the example code from the Ray Serve documentation by replacing the following content.
pg_resources = []
- pg_resources.append({"CPU": 1}) # for the deployment replica
for i in range(tp):
pg_resources.append({"CPU": 1, accelerator: 1}) # for the vLLM actors
# We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
# the same Ray node.
return VLLMDeployment.options(
+ ray_actor_options={"num_gpus": 1,"num_cpus": 1},
placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK"
).bind(
Introduction
The issue can be summarized simply: the framework design of vLLM does not fully accommodate LLMEngine
running within a placement group.
The process that creates RayDistributedExecutor
, which serves as the entry point, must have access to a GPU while not occupying GPU resources within Ray. This conflicts with the typical configuration of Ray Serve. Additionally, since vLLM always requests a whole number of GPUs when world_size > 1
, it is not possible to work around this limitation by allocating fractional GPUs.
Regardless of whether using LLM
(offline inference) or OpenAIServingCompletion
(online deployment), both are considered entry points. The class responsible for managing the specific processes during initialization is called an Executor
. The Executor
itself creates a local actor to use the GPU and also spawns a dummy actor to reserve resources in the placement group.
However, when integrating this framework into Ray, several issues arise:
- In Ray, the
Executor
itself also runs within an Actor and uses the first bundle of the placement group.- If no GPU resources are assigned to it,
CUDA_VISIBLE_DEVICES
will be an empty string, leading to the "No CUDA GPUs are available" error when trying to callset_device
. - On the other hand, if we do allocate a GPU to it, vLLM will still use a
dummy_driver_worker
that occupies a GPU, which causes the total number of requested workers to exceed the placement group capacity. - Since vLLM does not allocate resources based on bundles but instead forces each worker to use exactly one GPU when
world_size > 1
, we cannot work around this limitation by assigning fractional GPUs.
- If no GPU resources are assigned to it,
A Deadlock!
Experiments
Due to the specific feature of the code, there are actually two executable scenarios. I will first present the experimental table and then analyze each case one by one.
VLLM Version | Placement Group Configuration | TP | Status | Notes |
---|---|---|---|---|
VLLM 0.7.3 | [{'CPU':1} + {'GPU':1} * TP] | >1 | ✅ Works | Replica actor has no GPU but gains access via update_environment_variables |
VLLM 0.7.3 | [{'GPU':1} * TP] | >1 | ❌ Fails | Extra worker creation causes deadlock due to loop in ray_distributed_executor.py#L187 |
VLLM 0.7.3 | [{'CPU':1} + {'GPU':1} * TP] | 1 | ❌ Fails | Replica actor has no GPU, and Executor can no longer "borrow" CUDA_VISIBLE_DEVICES |
VLLM 0.7.3 | [{'GPU':1} * TP] | 1 | ✅ Works | Replica actor has no GPU, but uniproc_executor avoids dummy worker creation |
Analysis
In the existing code, there are actually two scenarios where execution is possible:
- TP > 1 without explicitly assigning GPUs (this is the default setting in Ray Serve). This explains why the issue has not become a critical blocker—under the current configuration, execution is still possible.
- TP = 1 with GPU assignment (as mentioned earlier, using an appropriate configuration combined with Ray Serve to resolve the issue).
Case 1: Default Configuration (TP > 1
& No GPU Assigned)
Even if Ray does not allocate any GPUs to the Replica Actor (i.e., the RayDistributedExecutor
within the Serve framework), CUDA_VISIBLE_DEVICES
will still not be empty.
This happens because of this line of code, which calls self.driver_worker
and modifies the environment variables of the current process.
As a result, in the default configuration, the code functions correctly, allowing a process to access GPUs without directly occupying them.
Case 2: TP = 1
Changes the Behavior
When TP = 1, vLLM switches to using UniprocExecutor
, as seen in this line of code.
In this case, if CUDA_VISIBLE_DEVICES
is empty, it will cause an error, as UniprocExecutor
does not inherit the same environment variable handling as the multi-process setup.
Supplementary Notes on Ray Serve and Ray LLM
After an initial review of the source code and conducting simple experiments, I believe that the new and old APIs of Ray Serve are fundamentally the same, except for the addition of a router and deeper integration with vLLM.
The core interaction between Ray and vLLM still revolves around the placement group (PG) allocation during deployment.
Therefore, these two approaches are essentially equivalent:
- Manually integrating
vllm.entrypoints.openai.serving_completion
into Ray Serve. - Using the
ray[llm]
library for deployment.
Related Issues
Based on my preliminary review, the following issues are all related to the analysis presented here:
- vLLM Issue #12983
- vLLM Issue #13521
- vLLM Issue #14415
- vLLM Issue #14456
- Ray Issue #51154
- Ray Issue #51193
- Ray Issue #50275
Versions / Dependencies
vllm>=0.7.2
ray[serve,llm,default] -U
Reproduction script
Demo code in following
- Ray LLM: Ray LLM Documentation
- Ray Serve: Ray Serve vLLM Example
Issue Severity
High: It blocks me from completing my task.