DDP + static graph can result in garbage data returned by all_gather
#18872
Labels
3rd party
Related to a 3rd-party
bug
Something isn't working
repro needed
The issue is missing a reproducible example
ver: 2.0.x
Bug description
When I use self.all_gather in LightningModule with strategies.DDPStrategy(static_graph=True) for multi-node inference,
the returned values are partially corrupted.
What version are you seeing the problem on?
v2.0
How to reproduce the bug
And it is called by
Error messages and logs
It should return results as:
torch.tensor([0, 1, 2, ..., world_size-1])
But while most of values are right, a few comes back with corrupted data with very large numbers, like
torch.tensor([0, 1, 2, 3, 913478191043, 5, ..., world_size -1])
Environment
Current environment
Addtional Information
I do notice the warning that trainer.validate should not be called with DDPStrategy, which makes LightningModule copies some datapoints for the last round of validation. Actually, this is exactly why I use all_gather during validation - to implement a drop-last validation.
It looks like it's caused by failure to block all processes during all_gather. I've tried to investigate why this happens myself, but I can't find any clues.
The text was updated successfully, but these errors were encountered: