Skip to content

DDP with Hydra multirun doesn't work when dirpath in checkpoint callback is specifiedΒ #11300

@ashleve

Description

@ashleve

πŸ› Bug

Running DDP with Hydra multirun ends up with "Killed" error message when launching the second task:

Epoch 0    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0/939 0:00:00 β€’ -:--:-- 0.00it/s [W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1158] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
Epoch 0    ━━━━━━━━━━━━━━━━ 939/939 0:00:13 β€’        70.53it/s loss: 0.142      
                                    0:00:00                    v_num:           
[2022-01-03 15:21:38,513][src.train][INFO] - Starting testing!
[2022-01-03 15:21:38,514][pytorch_lightning.utilities.distributed][INFO] - Restoring states from the checkpoint path at /home/user/lightning-hydra-template/logs/multiruns/2022-01-03/15-21-17/0/checkpoints/epoch_000.ckpt
[2022-01-03 15:21:38,535][pytorch_lightning.accelerators.gpu][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
[2022-01-03 15:21:41,523][HYDRA]        #1 : trainer.max_epochs=1 datamodule.batch_size=64 trainer.gpus=2 +trainer.strategy=ddp
Killed

I experience this ONLY when passing the dirpath parameter to checkpoint callback:

ModelCheckpoint(dirpath="checkpoints/")

Tested for lightning v1.5.7. I believe this issue wasn't around in one of the previous releases.

This probably has something to do with the way hydra changes working directory for each new run - the directory for storing checkpoints also gets changed. If I remember correctly, there was some workaround implemented in lightning which made DDP possible despite that.

cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta

Metadata

Metadata

Assignees

No one assigned

    Labels

    argparse (removed)Related to argument parsing (argparse, Hydra, ...)bugSomething isn't workingpriority: 1Medium priority taskstrategy: ddpDistributedDataParallel

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions