Open
Description
Bug description
Multi-node / multi-GPU training fails mid-way through because ddp_weakref
is not being set correctly during the backward pass. This appears to be similar to the issue reported in #20390. I was unable to reproduce this with a small model. Also the exact moment it fails (epoch/step) can vary between training runs. Any ideas? 🙏
[rank13]: Traceback (most recent call last):
[rank13]: ^^^^^^
[rank13]: File "/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
[rank13]: _run_hydra(
[rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
[rank13]: _run_app(
[rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
[rank13]: run_and_report(
[rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
[rank13]: raise ex
[rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
[rank13]: File "/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
[rank13]: lambda: hydra.run(
[rank13]: _ = ret.return_value
[rank13]: DOPTrainer(config).train()
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 539, in fit
[rank13]: call._call_and_handle_interrupt(
[rank13]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank13]: return function(*args, **kwargs)
[rank13]: self._run(model, ckpt_path=ckpt_path)
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 982, in _run
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 1026, in _run_stage
[rank13]: self.fit_loop.run()
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/fit_loop.py", line 216, in run
[rank13]: self.advance(data_fetcher)
[rank13]: self._optimizer_step(batch_idx, closure)
[rank13]: output = fn(*args, **kwargs)
[rank13]: ^^^^^^^^^^^^^^^^^^^
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/core/optimizer.py", line 154, in step
[rank13]: step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
[rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/strategies/ddp.py", line 270, in optimizer_step
[rank13]: optimizer_output = super().optimizer_step(optimizer, closure, model, **kwargs)
[rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]: return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
[rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 146, in __call__
[rank13]: self._result = self.closure(*args, **kwargs)
[rank13]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank13]: File "/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank13]: return func(*args, **kwargs)
[rank13]: ^^^^^^^^^^^^^^^^^^^^^
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 140, in closure
[rank13]: self._backward_fn(step_output.closure_loss)
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 241, in backward_fn
[rank13]: call._call_strategy_hook(self.trainer, "backward", loss, optimizer)
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 323, in _call_strategy_hook
[rank13]: output = fn(*args, **kwargs)
[rank13]: ^^^^^^^^^^^^^^^^^^^
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/strategies/strategy.py", line 213, in backward
[rank13]: self.precision_plugin.backward(closure_loss, self.lightning_module, optimizer, *args, **kwargs)
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/plugins/precision/precision.py", line 73, in backward
[rank13]: model.backward(tensor, *args, **kwargs)
[rank13]: File "/lib/python3.11/site-packages/pytorch_lightning/core/module.py", line 1097, in backward
[rank13]: loss.backward(*args, **kwargs)
[rank13]: File "/lib/python3.11/site-packages/torch/_tensor.py", line 626, in backward
[rank13]: torch.autograd.backward(
[rank13]: File "/lib/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank13]: _engine_run_backward(
[rank13]: File "/lib/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank13]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank13]: File "/lib/python3.11/site-packages/torch/autograd/function.py", line 307, in apply
[rank13]: return user_fn(self, *args)
[rank13]: ^^^^^^^^^^^^^^^^^^^^
[rank13]: File "/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 260, in backward
[rank13]: reducer = ddp_weakref.reducer
[rank13]: ^^^^^^^^^^^^^^^^^^^
[rank13]: AttributeError: 'NoneType' object has no attribute 'reducer'
[rank12]:[E410 06:07:50.769590035 ProcessGroupNCCL.cpp:629] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=253173, OpType=ALLGATHER, NumelIn=91356, NumelOut=730848, Timeout(ms)=600000) ran for 600009 milliseconds before timing out.
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response