Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling trainer.fit twice with spawn strategies won't work as expected #18775

Open
carmocca opened this issue Oct 10, 2023 · 0 comments
Open
Labels
bug Something isn't working priority: 1 Medium priority task strategy: ddp DistributedDataParallel strategy: xla ver: 2.0.x
Milestone

Comments

@carmocca
Copy link
Contributor

carmocca commented Oct 10, 2023

Bug description

Since data in the spawned region is not shared with the main process, the spawn launcher saves a checkpoint of the weights before finishing that is then loaded on the main process:

https://github.com/Lightning-AI/lightning/blob/984f49f7195ddc67e961c7c498ee6e19fc0cecb5/src/lightning/pytorch/strategies/launchers/multiprocessing.py#L190-L195 https://github.com/Lightning-AI/lightning/blob/984f49f7195ddc67e961c7c498ee6e19fc0cecb5/src/lightning/pytorch/strategies/launchers/multiprocessing.py#L162-L168

This means that the optimizer states are not loaded, as well as any other state in the trainer.

This isn't a problem with calling test/validate/predict after fit.

Solution

Since this is a silent correctness issue. We should raise an error in the short term.

The launcher can check if fit was called and is getting called again, and then raise a NotImplementedError.

In the longer term, we can save a full checkpoint that contains all the relevant data and then lift this restriction.

cc @tchaton @justusschock @awaelchli @carmocca @JackCaoG @Liyang90 @gkroiz

@carmocca carmocca added this to the 2.2 milestone Oct 10, 2023
@carmocca carmocca added the priority: 1 Medium priority task label Oct 10, 2023
@awaelchli awaelchli added strategy: ddp DistributedDataParallel and removed strategy: ddp spawn labels Nov 4, 2023
@awaelchli awaelchli modified the milestones: 2.2, 2.3 Feb 3, 2024
@awaelchli awaelchli modified the milestones: 2.3, future Jun 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: 1 Medium priority task strategy: ddp DistributedDataParallel strategy: xla ver: 2.0.x
Projects
None yet
Development

No branches or pull requests

2 participants