Open
Description
你好,我在使用openke2.0中的train_rotate_FB15K237_dist.py时出现以下报错,请问有什么解决办法吗?非常希望得到帮助。
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
Input Files Path : ./benchmarks/data-390/
The toolkit is importing datasets.
The total of relations is 28.
The total of entities is 700324.
The total of train triples is 2849846.
The total of train triples is 2849846.
Input Files Path : ./benchmarks/data-390/
Input Files Path : ./benchmarks/data-390/
The total of test triples is 258713.
The total of valid triples is 1293564.
The total of test triples is 258713.
The total of valid triples is 1293564.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 2646564) of binary: /home/jupyter-xingcheng/.conda/envs/openke/bin/python3.8
Traceback (most recent call last):
File "/home/jupyter-xingcheng/.conda/envs/openke/bin/torchrun", line 8, in
sys.exit(main())
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jupyter-xingcheng/.conda/envs/openke/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_rotate_data_390_dist.py FAILED
Failures:
[1]:
time : 2024-06-17_13:53:46
host : dell
rank : 1 (local_rank: 1)
exitcode : -11 (pid: 2646565)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646565
Root Cause (first observed failure):
[0]:
time : 2024-06-17_13:53:46
host : dell
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 2646564)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 2646564
运行的命令是:WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port 1234 train_rotate_data_390_dist.py
Metadata
Metadata
Assignees
Labels
No labels