Backward problem with using DDP #375

clearlyzero · 2024-01-09T12:11:00Z

Encountered when using DDP. How should I locate the warning at this location?

Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
/home/ps/anaconda3/envs/py38/lib/python3.8/site-packages/torch/autograd/init.py:197: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [32, 64, 1, 1, 1], strides() = [64, 1, 64, 64, 64]
bucket_view.sizes() = [32, 64, 1, 1, 1], strides() = [64, 1, 1, 1, 1] (Triggered internally at ../torch/csrc/distributed/c10d/reducer.cpp:325.)

Siddharth-Latthe-07 · 2024-07-26T13:47:03Z

Here are some steps to locate and address the source of this warning:

Check Tensor Creation:
Ensure that the tensors are created and manipulated in a way that respects their memory layout. Avoid operations that may inadvertently change the tensor strides.
Verify DDP Initialization:
Make sure that the DDP module is initialized after all model parameters have been correctly set up and that no operations change the parameter strides after DDP initialization.
Consistent Tensor Manipulation:
Ensure that all operations on the tensors are consistent and do not change the underlying memory layout or strides. For example, avoid in-place operations that might alter the tensor's strides.
Use Contiguous Tensors:
If you suspect that the strides might have been changed, you can make tensors contiguous before passing them to the DDP module. You can do this by calling .contiguous() on the tensors before using them in the backward pass.

sample code snippet to make tensors contiguous:

for param in model.parameters():
    param.grad = param.grad.contiguous()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backward problem with using DDP #375

Backward problem with using DDP #375

clearlyzero commented Jan 9, 2024

Siddharth-Latthe-07 commented Jul 26, 2024

Backward problem with using DDP #375

Backward problem with using DDP #375

Comments

clearlyzero commented Jan 9, 2024

Siddharth-Latthe-07 commented Jul 26, 2024