-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deepspeed vs DDP #19246
Comments
Hey @jpatel-bdai
There are not. The deepspeed integration in Lightning has posed challenging to maintain, the deepspeed maintainers themselves don't write any tests for their software. It is unclear what the future of DeepSpeed is in Lightning. In any case, I think one thing to check in your experiment is whether the modules are initialized the same (same random weights). And setting |
I verified that the modules are initialized with same weights and set
In that case, what do you suggest the path moving forward? We ported our codebase to Lightning as it makes using Deepspeed and FSDP strategies easier. |
Bug description
It is expected that on a single GPU, DDP and Deepspeed strategies (i.e.
deepspeed_stage_1
,deepspeed_stage_2
and so on) should give the exact same loss values (if seed is fixed). I have a model that usestorch.nn.Parameter
and the forward pass and gradient updates with these 2 strategies give different loss values as the training progresses. However, the model code is too big to share. I have this basic code where I change the strategies between deepspeed_stage_1 and ddp with different precision values (32 and 16), however I get different results when changing the strategies. Are there tests carried out to ensure deepspeed implementation matches ddp?What version are you seeing the problem on?
v2.1
How to reproduce the bug
Error messages and logs here please
#- Lightning Component (e.g. Trainer, LightningModule):
#- PyTorch Lightning Version : 2.1.0
#- PyTorch Version: 2.1.0+cu121
#- Python version : Python 3.10.12
#- OS (e.g., Linux): Debian
#- CUDA/cuDNN version: NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2
#- GPU models and configuration: NVIDIA L4 (24GB GPU VRAM)
#- How you installed Lightning(
conda
,pip
, source): pip install lightningThe text was updated successfully, but these errors were encountered: