-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
argparse (removed)Related to argument parsing (argparse, Hydra, ...)Related to argument parsing (argparse, Hydra, ...)bugSomething isn't workingSomething isn't workingpriority: 1Medium priority taskMedium priority taskstrategy: ddpDistributedDataParallelDistributedDataParallel
Milestone
Description
π Bug
Running DDP with Hydra multirun ends up with "Killed" error message when launching the second task:
Epoch 0 ββββββββββββββββββββββββββββββββββ 0/939 0:00:00 β’ -:--:-- 0.00it/s [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0 ββββββββββββββββ 939/939 0:00:13 β’ 70.53it/s loss: 0.142
0:00:00 v_num:
[2022-01-03 15:21:38,513][src.train][INFO] - Starting testing!
[2022-01-03 15:21:38,514][pytorch_lightning.utilities.distributed][INFO] - Restoring states from the checkpoint path at /home/user/lightning-hydra-template/logs/multiruns/2022-01-03/15-21-17/0/checkpoints/epoch_000.ckpt
[2022-01-03 15:21:38,535][pytorch_lightning.accelerators.gpu][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[2022-01-03 15:21:41,523][HYDRA] #1 : trainer.max_epochs=1 datamodule.batch_size=64 trainer.gpus=2 +trainer.strategy=ddp
Killed
I experience this ONLY when passing the dirpath
parameter to checkpoint callback:
ModelCheckpoint(dirpath="checkpoints/")
Tested for lightning v1.5.7. I believe this issue wasn't around in one of the previous releases.
This probably has something to do with the way hydra changes working directory for each new run - the directory for storing checkpoints also gets changed. If I remember correctly, there was some workaround implemented in lightning which made DDP possible despite that.
cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta
Metadata
Metadata
Assignees
Labels
argparse (removed)Related to argument parsing (argparse, Hydra, ...)Related to argument parsing (argparse, Hydra, ...)bugSomething isn't workingSomething isn't workingpriority: 1Medium priority taskMedium priority taskstrategy: ddpDistributedDataParallelDistributedDataParallel