Skip to content

Cuda failure 'peer access is not supported between these two devices' #406

Closed
@colorzhang

Description

@colorzhang

Usage stats collection is enabled. To disable this, run the following command: ray disable-usage-stats before starting Ray. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2023-07-08 23:11:34,236 INFO worker.py:1610 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
INFO 07-08 23:11:35 llm_engine.py:60] Initializing an LLM engine with config: model='openlm-research/open_llama_13b', tokenizer='openlm-research/open_llama_13b', tokenizer_mode=auto, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=4, seed=0)
INFO 07-08 23:11:35 tokenizer.py:28] For some LLaMA-based models, initializing the fast tokenizer may take a long time. To eliminate the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer.
(Worker pid=4225) Exception raised in creation task: The actor died because of an error raised in its creation task, ray::Worker.init() (pid=4225, ip=172.31.68.176, actor_id=5dc662848f950df8d330eb8a01000000, repr=<vllm.worker.worker.Worker object at 0x7f4e9ea814e0>)
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 40, in init
(Worker pid=4225) _init_distributed_environment(parallel_config, rank,
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 307, in _init_distributed_environment
(Worker pid=4225) torch.distributed.all_reduce(torch.zeros(1).cuda())
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
(Worker pid=4225) return func(*args, **kwargs)
(Worker pid=4225) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1700, in all_reduce
(Worker pid=4225) work = default_pg.allreduce([tensor], opts)
(Worker pid=4225) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
(Worker pid=4225) ncclInternalError: Internal check failed.
(Worker pid=4225) Last error:
(Worker pid=4225) Cuda failure 'peer access is not supported between these two devices'

Code:
llm = LLM(model="openlm-research/open_llama_13b", tensor_parallel_size=4)

Env:
Single EC2 instance G5.12xlarge with 4 A10G GPU

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions