Unable to run distributed inference on ray with llama-65B, tensor_parallel_size > 1

**Issue Description:**

When I tried to deploy the llama-hf-65B model on an 8-GPU machine, I followed the example in Distributed Inference and Serving ([link](https://docs.vllm.ai/en/latest/serving/distributed_serving.html)) and wrote the following code:

```python
from vllm import LLM
llm = LLM("/mnt/llm_dataset/evaluation_pretrain/models/sota/llama-hf-65b/", trust_remote_code=True, tensor_parallel_size=4)
```

However, Ray raised an OOM exception, as shown in the attached image. 
Note that setting `tensor_parallel_size=8` results in the same exception.

![vllm_ray](https://github.com/vllm-project/vllm/assets/21336062/bd2f6de3-10ea-40bd-a044-328a5c66294b)

Even when I replaced the model_dir with the llama-13B model, setting tensor_parallel_size=8 still triggers a Ray OOM exception.

When I set the model directory to llama-13B and `tensor_parallel_size=4`, the model *sometimes* can loads and infers successfully. However, it takes a considerable amount of time for initializing the Ray environment and paged attention memory, and it's uncertain whether the program is stuck.

Here is information about my local environment:
- ubuntu 22.04
-  Driver Version: 470.182.03   CUDA Version: 12.3
- 8xA800 with 80GB on local machine
- Python 3.8.18
- transformers: 4.38.2
- vllm: 0.3.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unable to run distributed inference on ray with llama-65B, tensor_parallel_size > 1 #3196

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unable to run distributed inference on ray with llama-65B, tensor_parallel_size > 1 #3196

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions