[Serve] Detailed Analysis of Errors Related to 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available'

### What happened + What you expected to happen

When deploying platforms based on the Ray framework, such as Ray Serve and Ray LLM, together with vLLM's OpenAI server, the errors "No CUDA GPUs are available" or "Ray does not allocate any GPUs on the driver node" have become recurring issues.
In this issue, I will provide a detailed analysis of these problems, along with a brief solution, experimental records. I sincerely invite developers from the Ray and vLLM communities to participate in the discussion, point out any shortcomings, and share your suggestions!

<h2>Quick Troubleshoot</h2>

![Image](https://github.com/user-attachments/assets/1c9e1a4c-65b4-4ae9-807d-d88ff07f058c)

For older versions of vLLM, I have also provided a hack to temporarily resolve this issue. Please refer to: <a href="https://github.com/ray-project/ray/issues/51154">Ray Issue #51154</a>.
For Ray LLM and Ray Serve documentation:
<ul>
<li>Ray LLM: <a href="https://docs.ray.io/en/latest/ray-llm/index.html">Ray LLM Documentation</a></li>
<li>Ray Serve: <a href="https://docs.ray.io/en/latest/serve/index.html">Ray Serve vLLM Example</a></li>
</ul>
A proper configuration for TP=1 involves modifying the <code inline="">build_app</code> function in the example code from the Ray Serve documentation by replacing the following content.

```diff
 pg_resources = []
- pg_resources.append({"CPU": 1}) # for the deployment replica
 for i in range(tp):
 pg_resources.append({"CPU": 1, accelerator: 1}) # for the vLLM actors

 # We use the "STRICT_PACK" strategy below to ensure all vLLM actors are placed on
 # the same Ray node.
 return VLLMDeployment.options(
+ ray_actor_options={"num_gpus": 1,"num_cpus": 1},
 placement_group_bundles=pg_resources, placement_group_strategy="STRICT_PACK"
 ).bind(

```
<hr>
<h2>Introduction</h2>
The issue can be summarized simply: the framework design of vLLM does not fully accommodate <code inline="">LLMEngine</code> running within a placement group.
The process that creates <code inline="">RayDistributedExecutor</code>, which serves as the entry point, must have access to a GPU while not occupying GPU resources within Ray. This conflicts with the typical configuration of Ray Serve. Additionally, since vLLM always requests a whole number of GPUs when <code inline="">world_size &gt; 1</code>, it is not possible to work around this limitation by allocating fractional GPUs.


![Image](https://github.com/user-attachments/assets/e7c2ae0c-b825-4699-9985-8fd1657c4f89)


Regardless of whether using <code inline="">LLM</code> (offline inference) or <code inline="">OpenAIServingCompletion</code> (online deployment), both are considered entry points. The class responsible for managing the specific processes during initialization is called an <code inline="">Executor</code>. The <code inline="">Executor</code> itself creates a local actor to use the GPU and also spawns a dummy actor to reserve resources in the placement group.

![Image](https://github.com/user-attachments/assets/8086f254-c8ad-4a77-a10f-cc9385574b7d)

However, when integrating this framework into Ray, several issues arise:
<ul>
<li>In Ray, the <code inline="">Executor</code> itself also runs within an Actor and uses the first bundle of the placement group.
<ul>
<li>If no GPU resources are assigned to it, <code inline="">CUDA_VISIBLE_DEVICES</code> will be an empty string, leading to the "No CUDA GPUs are available" error when trying to call <code inline="">set_device</code>.</li>
<li>On the other hand, if we do allocate a GPU to it, vLLM will still use a <code inline="">dummy_driver_worker</code> that occupies a GPU, which causes the total number of requested workers to exceed the placement group capacity.</li>
<li>Since vLLM does not allocate resources based on bundles but instead forces each worker to use exactly one GPU when <code inline="">world_size &gt; 1</code>, we cannot work around this limitation by assigning fractional GPUs.</li>
</ul>
</li>
</ul>
<h3>A Deadlock!</h3>
<hr>
<h2>Experiments</h2>
Due to the specific feature of the code, there are actually two executable scenarios. I will first present the experimental table and then analyze each case one by one.

VLLM Version | Placement Group Configuration | TP | Status | Notes
-- | -- | -- | -- | --
VLLM 0.7.3 | [{'CPU':1} + {'GPU':1} * TP] | >1 | ✅ Works | Replica actor has no GPU but gains access via update_environment_variables
VLLM 0.7.3 | [{'GPU':1} * TP] | >1 | ❌ Fails | Extra worker creation causes deadlock due to loop in ray_distributed_executor.py#L187
VLLM 0.7.3 | [{'CPU':1} + {'GPU':1} * TP] | 1 | ❌ Fails | Replica actor has no GPU, and Executor can no longer "borrow" CUDA_VISIBLE_DEVICES
VLLM 0.7.3 | [{'GPU':1} * TP] | 1 | ✅ Works | Replica actor has no GPU, but uniproc_executor avoids dummy worker creation


<hr>
<h2>Analysis</h2>
In the existing code, there are actually two scenarios where execution is possible:
<ol>
<li>TP &gt; 1 without explicitly assigning GPUs (this is the default setting in Ray Serve). This explains why the issue has not become a critical blocker—under the current configuration, execution is still possible.</li>
<li>TP = 1 with GPU assignment (as mentioned earlier, using an appropriate configuration combined with Ray Serve to resolve the issue).</li>
</ol>
<h3>Case 1: Default Configuration (<code inline="">TP &gt; 1</code> &amp; No GPU Assigned)</h3>

![Image](https://github.com/user-attachments/assets/61b8d9c1-4520-4e19-887a-7135af52a6a8)

Even if Ray does not allocate any GPUs to the Replica Actor (i.e., the <code inline="">RayDistributedExecutor</code> within the Serve framework), <code inline="">CUDA_VISIBLE_DEVICES</code> will still not be empty.
This happens because of this line of code, which calls <code inline="">self.driver_worker</code> and modifies the environment variables of the current process.
As a result, in the default configuration, the code functions correctly, allowing a process to access GPUs without directly occupying them.
<h3>Case 2: <code inline="">TP = 1</code> Changes the Behavior</h3>
When TP = 1, vLLM switches to using <code inline="">UniprocExecutor</code>, as seen in this line of code.
In this case, if <code inline="">CUDA_VISIBLE_DEVICES</code> is empty, it will cause an error, as <code inline="">UniprocExecutor</code> does not inherit the same environment variable handling as the multi-process setup.

![Image](https://github.com/user-attachments/assets/2030e201-56d3-46f9-8175-b193e3229bd7)


<hr>
<h2>Supplementary Notes on Ray Serve and Ray LLM</h2>
After an initial review of the source code and conducting simple experiments, I believe that the new and old APIs of Ray Serve are fundamentally the same, except for the addition of a router and deeper integration with vLLM.
The core interaction between Ray and vLLM still revolves around the placement group (PG) allocation during deployment.
Therefore, these two approaches are essentially equivalent:
<ol>
<li>Manually integrating <code inline="">vllm.entrypoints.openai.serving_completion</code> into Ray Serve.</li>
<li>Using the <code inline="">ray[llm]</code> library for deployment.</li>
</ol>
<hr>
<h2>Related Issues</h2>
Based on my preliminary review, the following issues are all related to the analysis presented here:
<ul>
<li><a href="https://github.com/vllm-project/vllm/issues/12983">vLLM Issue #12983</a></li>
<li><a href="https://github.com/vllm-project/vllm/issues/13521">vLLM Issue #13521</a></li>
<li><a href="https://github.com/vllm-project/vllm/issues/14415">vLLM Issue #14415</a></li>
<li><a href="https://github.com/vllm-project/vllm/issues/14456">vLLM Issue #14456</a></li>
<li><a href="https://github.com/ray-project/ray/issues/51154">Ray Issue #51154</a></li>
<li><a href="https://github.com/ray-project/ray/issues/51193">Ray Issue #51193</a></li>
<li><a href="https://github.com/ray-project/ray/issues/50275">Ray Issue #50275</a></li>
</ul></body></html>

### Versions / Dependencies

vllm>=0.7.2
ray[serve,llm,default] -U

### Reproduction script

Demo code in following 
- Ray LLM: [Ray LLM Documentation](https://docs.ray.io/en/latest/serve/llm/overview.html)
- Ray Serve: [Ray Serve vLLM Example](https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html)

### Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] Detailed Analysis of Errors Related to 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available' #51242

What happened + What you expected to happen

Quick Troubleshoot

Introduction

A Deadlock!

Experiments

Analysis

Case 1: Default Configuration (`TP > 1` & No GPU Assigned)

Case 2: `TP = 1` Changes the Behavior

Supplementary Notes on Ray Serve and Ray LLM

Related Issues

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VLLM Version	Placement Group Configuration	TP	Status	Notes
VLLM 0.7.3	[{'CPU':1} + {'GPU':1} * TP]	>1	✅ Works	Replica actor has no GPU but gains access via update_environment_variables
VLLM 0.7.3	[{'GPU':1} * TP]	>1	❌ Fails	Extra worker creation causes deadlock due to loop in ray_distributed_executor.py#L187
VLLM 0.7.3	[{'CPU':1} + {'GPU':1} * TP]	1	❌ Fails	Replica actor has no GPU, and Executor can no longer "borrow" CUDA_VISIBLE_DEVICES
VLLM 0.7.3	[{'GPU':1} * TP]	1	✅ Works	Replica actor has no GPU, but uniproc_executor avoids dummy worker creation

[Serve] Detailed Analysis of Errors Related to 'Ray does not allocate any GPUs on the driver node' && 'No CUDA GPUs are available' #51242

Description

What happened + What you expected to happen

Quick Troubleshoot

Introduction

A Deadlock!

Experiments

Analysis

Case 1: Default Configuration (TP > 1 & No GPU Assigned)

Case 2: TP = 1 Changes the Behavior

Supplementary Notes on Ray Serve and Ray LLM

Related Issues

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Case 1: Default Configuration (`TP > 1` & No GPU Assigned)

Case 2: `TP = 1` Changes the Behavior