Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDP + static graph can result in garbage data returned by all_gather #18872

Open
mooninrain opened this issue Oct 26, 2023 · 2 comments
Open
Labels
3rd party Related to a 3rd-party bug Something isn't working repro needed The issue is missing a reproducible example ver: 2.0.x

Comments

@mooninrain
Copy link

mooninrain commented Oct 26, 2023

Bug description

When I use self.all_gather in LightningModule with strategies.DDPStrategy(static_graph=True) for multi-node inference,
the returned values are partially corrupted.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

class ModelWrapper(pl.LightningModule):
    ...
    def on_validation_epoch_end(self) -> None:
        super().on_validation_epoch_end()
    
        test_tensor = torch.tensor([self.global_rank]).to(current_device)
        self.trainer.strategy.barrier()
        test_all = self.all_gather(test_tensor)
        self.trainer.strategy.barrier()
        print(test_all)

And it is called by

    # set up a datamodule
    ...

    trainer = pl.Trainer(
        accelerator="gpu",
        devices=args.devices,
        num_nodes=args.num_nodes,
        strategy=strategies.DDPStrategy(static_graph=True),
    )
    trainer.validate(model, datamodule=datamodule)

Error messages and logs

It should return results as:
torch.tensor([0, 1, 2, ..., world_size-1])

But while most of values are right, a few comes back with corrupted data with very large numbers, like
torch.tensor([0, 1, 2, 3, 913478191043, 5, ..., world_size -1])

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): LightningModule
#- PyTorch Lightning Version (e.g., 1.5.0): 2.0.3
#- PyTorch Version (e.g., 2.0): 1.12.1
#- Python version (e.g., 3.9): 3.9
#- OS (e.g., Linux): Linux
#- CUDA/cuDNN version: 11.6
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source): conda

Addtional Information

I do notice the warning that trainer.validate should not be called with DDPStrategy, which makes LightningModule copies some datapoints for the last round of validation. Actually, this is exactly why I use all_gather during validation - to implement a drop-last validation.
It looks like it's caused by failure to block all processes during all_gather. I've tried to investigate why this happens myself, but I can't find any clues.

@mooninrain mooninrain added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Oct 26, 2023
@awaelchli awaelchli changed the title All_gather gets incorrect data within on_validation_epoch_end DDP + static graph can result in garbage data returned by all_gather Nov 22, 2023
@awaelchli awaelchli added 3rd party Related to a 3rd-party repro needed The issue is missing a reproducible example and removed needs triage Waiting to be triaged by maintainers labels Nov 22, 2023
@adaruna3
Copy link

adaruna3 commented Feb 7, 2024

Any updates on this? I'm observing the same problem today. all_gather works fine for 2 GPUs with DDP strategy, but 3 or 4 GPU and some returned values are suddenly corrupted.

@mooninrain
Copy link
Author

Any updates on this? I'm observing the same problem today. all_gather works fine for 2 GPUs with DDP strategy, but 3 or 4 GPU and some returned values are suddenly corrupted.

Sadly, nothing from my side. We still face the same problem in PyTorch 2.0 & python 3.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party bug Something isn't working repro needed The issue is missing a reproducible example ver: 2.0.x
Projects
None yet
Development

No branches or pull requests

3 participants