DDP + static graph can result in garbage data returned by `all_gather` #18872

mooninrain · 2023-10-26T11:38:52Z

Bug description

When I use self.all_gather in LightningModule with strategies.DDPStrategy(static_graph=True) for multi-node inference,
the returned values are partially corrupted.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

class ModelWrapper(pl.LightningModule):
    ...
    def on_validation_epoch_end(self) -> None:
        super().on_validation_epoch_end()
    
        test_tensor = torch.tensor([self.global_rank]).to(current_device)
        self.trainer.strategy.barrier()
        test_all = self.all_gather(test_tensor)
        self.trainer.strategy.barrier()
        print(test_all)

And it is called by

    # set up a datamodule
    ...

    trainer = pl.Trainer(
        accelerator="gpu",
        devices=args.devices,
        num_nodes=args.num_nodes,
        strategy=strategies.DDPStrategy(static_graph=True),
    )
    trainer.validate(model, datamodule=datamodule)

Error messages and logs

It should return results as:
torch.tensor([0, 1, 2, ..., world_size-1])

But while most of values are right, a few comes back with corrupted data with very large numbers, like
torch.tensor([0, 1, 2, 3, 913478191043, 5, ..., world_size -1])

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): LightningModule
#- PyTorch Lightning Version (e.g., 1.5.0): 2.0.3
#- PyTorch Version (e.g., 2.0): 1.12.1
#- Python version (e.g., 3.9): 3.9
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version: 11.6
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source): conda

Addtional Information

I do notice the warning that trainer.validate should not be called with DDPStrategy, which makes LightningModule copies some datapoints for the last round of validation. Actually, this is exactly why I use all_gather during validation - to implement a drop-last validation.
It looks like it's caused by failure to block all processes during all_gather. I've tried to investigate why this happens myself, but I can't find any clues.

The text was updated successfully, but these errors were encountered:

adaruna3 · 2024-02-07T00:06:30Z

Any updates on this? I'm observing the same problem today. all_gather works fine for 2 GPUs with DDP strategy, but 3 or 4 GPU and some returned values are suddenly corrupted.

mooninrain · 2024-02-18T16:03:47Z

Any updates on this? I'm observing the same problem today. all_gather works fine for 2 GPUs with DDP strategy, but 3 or 4 GPU and some returned values are suddenly corrupted.

Sadly, nothing from my side. We still face the same problem in PyTorch 2.0 & python 3.10.

mooninrain added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Oct 26, 2023

github-actions bot added the ver: 2.0.x label Oct 26, 2023

awaelchli changed the title ~~All_gather gets incorrect data within on_validation_epoch_end~~ DDP + static graph can result in garbage data returned by all_gather Nov 22, 2023

awaelchli added 3rd party Related to a 3rd-party repro needed The issue is missing a reproducible example and removed needs triage Waiting to be triaged by maintainers labels Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDP + static graph can result in garbage data returned by `all_gather` #18872

DDP + static graph can result in garbage data returned by `all_gather` #18872

mooninrain commented Oct 26, 2023 •

edited

Loading

adaruna3 commented Feb 7, 2024

mooninrain commented Feb 18, 2024

DDP + static graph can result in garbage data returned by all_gather #18872

DDP + static graph can result in garbage data returned by all_gather #18872

Comments

mooninrain commented Oct 26, 2023 • edited Loading

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

Addtional Information

adaruna3 commented Feb 7, 2024

mooninrain commented Feb 18, 2024

DDP + static graph can result in garbage data returned by `all_gather` #18872

DDP + static graph can result in garbage data returned by `all_gather` #18872

mooninrain commented Oct 26, 2023 •

edited

Loading