Skip to content

Unable to run distributed inference on ray with llama-65B, tensor_parallel_size > 1 #3196

@hxer7963

Description

@hxer7963

Issue Description:

When I tried to deploy the llama-hf-65B model on an 8-GPU machine, I followed the example in Distributed Inference and Serving (link) and wrote the following code:

from vllm import LLM
llm = LLM("/mnt/llm_dataset/evaluation_pretrain/models/sota/llama-hf-65b/", trust_remote_code=True, tensor_parallel_size=4)

However, Ray raised an OOM exception, as shown in the attached image.
Note that setting tensor_parallel_size=8 results in the same exception.

vllm_ray

Even when I replaced the model_dir with the llama-13B model, setting tensor_parallel_size=8 still triggers a Ray OOM exception.

When I set the model directory to llama-13B and tensor_parallel_size=4, the model sometimes can loads and infers successfully. However, it takes a considerable amount of time for initializing the Ray environment and paged attention memory, and it's uncertain whether the program is stuck.

Here is information about my local environment:

  • ubuntu 22.04
  • Driver Version: 470.182.03 CUDA Version: 12.3
  • 8xA800 with 80GB on local machine
  • Python 3.8.18
  • transformers: 4.38.2
  • vllm: 0.3.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions