Skip to content

Lightning creates two DeepSpeedEngine instances for the same model #17523

Closed
@HeyangQin

Description

Bug description

Hello Lightning team!

We got serval user reports (e.g. deepspeedai/DeepSpeed#3068) about errors when using Lighning with DeepSpeed. The issue is that Lightning creates two DeepSpeedEngine instances for the same model at https://github.com/Lightning-AI/lightning/blob/6ec9a6bd9e792f505ebc931742d4235f311eb289/src/lightning/pytorch/strategies/deepspeed.py#L447-L450
Yet neither of the DeepSpeedEngine is aware of the existence of the other. So when it comes to zero3 optimization, these two DeepSpeedEngines are going to operate on the same set of parameters on their own management, which leads to the crash.
We tried to tackle this issue from our end by bounding the parameters management to the model so they can be shared among DeepSpeedEngine instances, yet we realize the Lightning creates different wrapper instances for the model before passing it to DeepSpeed so from the DeepSpeed end it looks like different models.
DeepSpeed can do both training and validation on the same DeepSpeedEngine instance. Thus we want to reach out to understand more about the intuition behind using multiple DeepSpeedEngines (or wrappers) and also to check if there is anything we can do on our end to make the same DeepSpeedEngine usable for both the training and validation in your use case.

What version are you seeing the problem on?

master

How to reproduce the bug

There is a pretty nice reproduction script from the user https://github.com/microsoft/DeepSpeed/issues/3068#issuecomment-1486539136

Error messages and logs

No response

Environment

No response

More info

No response

cc @awaelchli

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions