-
-
Notifications
You must be signed in to change notification settings - Fork 17.2k
Labels
StaleStale and schedule for closing soonStale and schedule for closing soonbugSomething isn't workingSomething isn't working
Description
Hello, when I try to training using multi gpu based on docker file images. I got the below error. I use Ubuntu 18.04, python 3.8.
<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
root@5a70a5f2d489:/usr/src/app# python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data data.yaml --weights yolov5s.pt --device 0,1
WARNING:__main__:*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Traceback (most recent call last):
File "train.py", line 620, in <module>
main(opt)
File "train.py", line 497, in main
check_file(opt.data), check_yaml(opt.cfg), check_yaml(opt.hyp), str(opt.weights), str(opt.project) # checks
File "/usr/src/app/utils/general.py", line 326, in check_file
assert len(files), f'File not found: {file}' # assert file was found
AssertionError: File not found: data.yaml
wandb: (1) Create a W&B account
wandb: (2) Use an existing W&B account
wandb: (3) Don't visualize my results
wandb: Enter your choice: (30 second timeout) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 405) of binary: /opt/conda/bin/python
/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py:367: UserWarning:
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
**********************************************************************
CHILD PROCESS FAILED WITH NO ERROR_FILE
Child process 405 (local_rank 1) FAILED (exitcode 1)
Error msg: Process failed with exitcode 1
Without writing an error file to <N/A>.
While this DOES NOT affect the correctness of your application,
no trace information about the error will be available for inspection.
Consider decorating your top level entrypoint function with
torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record
def trainer_main(args):
# do train
**********************************************************************
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 361, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
train.py FAILED
=======================================
Root Cause:
[0]:
time: 2021-10-13_04:30:25
rank: 1 (local_rank: 1)
exitcode: 1 (pid: 405)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
<NO_OTHER_FAILURES>
***************************************
root@5a70a5f2d489:/usr/src/app#
Metadata
Metadata
Assignees
Labels
StaleStale and schedule for closing soonStale and schedule for closing soonbugSomething isn't workingSomething isn't working