Cuda failure 'peer access is not supported between these two devices'

Usage stats collection is enabled. To disable this, run the following command: `ray disable-usage-stats` before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-07-08 23:11:34,236	INFO worker.py:1610 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
INFO 07-08 23:11:35 llm_engine.py:60] Initializing an LLM engine with config: model='openlm-research/open_llama_13b', tokenizer='openlm-research/open_llama_13b', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
INFO 07-08 23:11:35 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(Worker pid=4225) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.__init__() (pid=4225, ip=172.31.68.176, actor_id=5dc662848f950df8d330eb8a01000000, repr=<vllm.worker.worker.Worker object at 0x7f4e9ea814e0>)
(Worker pid=4225)   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 40, in __init__
(Worker pid=4225)     _init_distributed_environment(parallel_config, rank,
(Worker pid=4225)   File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 307, in _init_distributed_environment
(Worker pid=4225)     torch.distributed.all_reduce(torch.zeros(1).cuda())
(Worker pid=4225)   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
(Worker pid=4225)     return func(*args, **kwargs)
(Worker pid=4225)   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
(Worker pid=4225)     work = default_pg.allreduce([tensor], opts)
(Worker pid=4225) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
(Worker pid=4225) ncclInternalError: Internal check failed.
(Worker pid=4225) Last error:
(Worker pid=4225) Cuda failure 'peer access is not supported between these two devices'

Code:
llm = LLM(model="openlm-research/open_llama_13b", tensor_parallel_size=4)

Env: 
Single EC2 instance G5.12xlarge with 4 A10G GPU

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cuda failure 'peer access is not supported between these two devices' #406

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Cuda failure 'peer access is not supported between these two devices' #406

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions