Skip to content

Distributed inference on multi machine (error Invalid peer device id)  #2795

@bieenr

Description

@bieenr

I'm a newbie, and I'm running an example at https://docs.vllm.ai/en/latest/serving/distributed_serving.html locally with 2 machines, each with an RTX 3090 GPU. I changed tensor_parallel_size to 2 and model to "vinai/PhoGPT-4B".
On the head node, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --head.
On the other nodes, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --address='10.0.0.1'.
Then, on the head node, when I run the example code: python main.py, I get the following error:

Traceback (most recent call last):
  File "/data2/bientd/vllm/test.py", line 25, in <module>
    llm = LLM(model="facebook/opt-13b", tensor_parallel_size=2,download_dir='/data2/bientd/')#,pipeline_parallel_size=3 don't support
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args
    engine = cls(*engine_configs,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/worker/worker.py", line 87, in init_model
    init_custom_ar()
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 44, in init_custom_ar
    if not _can_p2p(rank, world_size):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 137, in _can_p2p
    if not torch.cuda.can_device_access_peer(rank, i):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 464, in can_device_access_peer
    raise AssertionError("Invalid peer device id")
AssertionError: Invalid peer device id

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions