-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Description
Issue Description:
When I tried to deploy the llama-hf-65B model on an 8-GPU machine, I followed the example in Distributed Inference and Serving (link) and wrote the following code:
from vllm import LLM
llm = LLM("/mnt/llm_dataset/evaluation_pretrain/models/sota/llama-hf-65b/", trust_remote_code=True, tensor_parallel_size=4)
However, Ray raised an OOM exception, as shown in the attached image.
Note that setting tensor_parallel_size=8
results in the same exception.
Even when I replaced the model_dir with the llama-13B model, setting tensor_parallel_size=8 still triggers a Ray OOM exception.
When I set the model directory to llama-13B and tensor_parallel_size=4
, the model sometimes can loads and infers successfully. However, it takes a considerable amount of time for initializing the Ray environment and paged attention memory, and it's uncertain whether the program is stuck.
Here is information about my local environment:
- ubuntu 22.04
- Driver Version: 470.182.03 CUDA Version: 12.3
- 8xA800 with 80GB on local machine
- Python 3.8.18
- transformers: 4.38.2
- vllm: 0.3.3