-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I'm a newbie, and I'm running an example at https://docs.vllm.ai/en/latest/serving/distributed_serving.html locally with 2 machines, each with an RTX 3090 GPU. I changed tensor_parallel_size to 2 and model to "vinai/PhoGPT-4B".
On the head node, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --head.
On the other nodes, I run:
NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0 ray start --address='10.0.0.1'.
Then, on the head node, when I run the example code: python main.py, I get the following error:
Traceback (most recent call last):
File "/data2/bientd/vllm/test.py", line 25, in <module>
llm = LLM(model="facebook/opt-13b", tensor_parallel_size=2,download_dir='/data2/bientd/')#,pipeline_parallel_size=3 don't support
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 109, in __init__
self.llm_engine = LLMEngine.from_engine_args(engine_args)
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 356, in from_engine_args
engine = cls(*engine_configs,
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 109, in __init__
self._init_workers_ray(placement_group)
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 271, in _init_workers_ray
self._run_workers("init_model")
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 983, in _run_workers
driver_worker_output = getattr(self.driver_worker,
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/worker/worker.py", line 87, in init_model
init_custom_ar()
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 44, in init_custom_ar
if not _can_p2p(rank, world_size):
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/vllm/model_executor/parallel_utils/custom_all_reduce.py", line 137, in _can_p2p
if not torch.cuda.can_device_access_peer(rank, i):
File "/data2/bientd/anaconda3/envs/vllm/lib/python3.9/site-packages/torch/cuda/__init__.py", line 464, in can_device_access_peer
raise AssertionError("Invalid peer device id")
AssertionError: Invalid peer device id
umarbutler
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working