Distributed inference on multi machine (error Invalid peer device id) 

I'm a newbie, and I'm running an example at https://docs.vllm.ai/en/latest/serving/distributed_serving.html locally with 2 machines, each with an RTX 3090 GPU. I changed tensor_parallel_size to 2 and model to "vinai/PhoGPT-4B".
On the head node, I run:
 NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --head. 
On the other nodes, I run:
 NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --address='10.0.0.1'.
Then, on the head node, when I run the example code: python main.py, I get the following error:
```
Traceback (most recent call last):
  File "/data2/bientd/vllm/test.py", line 25, in <module>
    llm = LLM(model="facebook/opt-13b", tensor_parallel_size=2,download_dir='/data2/bientd/')#,pipeline_parallel_size=3 don't support
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args
    engine = cls(*engine_configs,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
    self._init_workers_ray(placement_group)
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
    self._run_workers("init_model")
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/worker/worker.py", line 87, in init_model
    init_custom_ar()
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 44, in init_custom_ar
    if not _can_p2p(rank, world_size):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 137, in _can_p2p
    if not torch.cuda.can_device_access_peer(rank, i):
  File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 464, in can_device_access_peer
    raise AssertionError("Invalid peer device id")
AssertionError: Invalid peer device id
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Distributed inference on multi machine (error Invalid peer device id) #2795

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Distributed inference on multi machine (error Invalid peer device id) #2795

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions