Skip to content

使用openke2.0中的train_rotate_FB15K237_dist.py进行分布式训练时报错 #410

Open
@pipiyapi

Description

@pipiyapi

你好,我在使用openke2.0中的train_rotate_FB15K237_dist.py时出现以下报错,请问有什么解决办法吗?非常希望得到帮助。
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
The total of train triples is 2849846.
The total of train triples is 2849846.
Input Files Path : ./benchmarks/data-390/
Input Files Path : ./benchmarks/data-390/
The total of test triples is 258713.
The total of valid triples is 1293564.
The total of test triples is 258713.
The total of valid triples is 1293564.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 2646564) of binary: /home/jupyter-xingcheng/.conda/envs/openke/bin/python3.8
Traceback (most recent call last):
File "/home/jupyter-xingcheng/.conda/envs/openke/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_rotate_data_390_dist.py FAILED

Failures:
[1]:
time : 2024-06-17_13:53:46
host : dell
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 2646565)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646565

Root Cause (first observed failure):
[0]:
time : 2024-06-17_13:53:46
host : dell
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 2646564)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646564

运行的命令是:WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port 1234 train_rotate_data_390_dist.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions