Lightning creates two DeepSpeedEngine instances for the same model #17523
Description
Bug description
Hello Lightning team!
We got serval user reports (e.g. deepspeedai/DeepSpeed#3068) about errors when using Lighning with DeepSpeed. The issue is that Lightning creates two DeepSpeedEngine instances for the same model at https://github.com/Lightning-AI/lightning/blob/6ec9a6bd9e792f505ebc931742d4235f311eb289/src/lightning/pytorch/strategies/deepspeed.py#L447-L450
Yet neither of the DeepSpeedEngine is aware of the existence of the other. So when it comes to zero3 optimization, these two DeepSpeedEngines are going to operate on the same set of parameters on their own management, which leads to the crash.
We tried to tackle this issue from our end by bounding the parameters management to the model so they can be shared among DeepSpeedEngine instances, yet we realize the Lightning creates different wrapper instances for the model before passing it to DeepSpeed so from the DeepSpeed end it looks like different models.
DeepSpeed can do both training and validation on the same DeepSpeedEngine instance. Thus we want to reach out to understand more about the intuition behind using multiple DeepSpeedEngines (or wrappers) and also to check if there is anything we can do on our end to make the same DeepSpeedEngine usable for both the training and validation in your use case.
What version are you seeing the problem on?
master
How to reproduce the bug
There is a pretty nice reproduction script from the user https://github.com/microsoft/DeepSpeed/issues/3068#issuecomment-1486539136
Error messages and logs
No response
Environment
No response
More info
No response
cc @awaelchli