-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zero3 checkpoint frozen params #3205
Conversation
@stas00, FYI |
I tried it out - and when the checkpoint is saved, I get almost all frozen weights saved with
I think they need to be gathered before saving. But we probably shouldn't do that on every process as it'd be quite slow if the model has 50% frozen weights. if it's the same weights saving it once should be enough (at least on the shared fs, it won't work on non-shared fs). the following will do the gathering:
but the saved tensors still appear to be of size 0. so that fix doesn't seem to be it. Ah, I see - the original code will never succeed because frozen params aren't in |
I'm also thinking would this even work if there is a huge model with a lot of frozen params? There might not be enough memory to gather them all. Perhaps should save their fp16 shards instead? that would be much faster. |
…into olruwase/issue_3090
…into olruwase/issue_3090
Hi @stas00 and @tjruwase, thanks for your work on this. I'm just checking to see if this would fix an error I'm getting using DeepSpeed and LoRA. Let me know if this isn't the place to ask. I'm able to train "t5" using DeepSpeed Stage 3 and LoRA, however when I run the Thanks again for all your help! |
@shaankhosla, thanks for your interest. Please open a new ticket for this problem. It would be very helpful to provide more details for reproducing the problem in that ticket. |
Here it is: #3291 :) |
…into olruwase/issue_3090
Thank you for the quick solving and merge, Tunji and the team! |
Enable checkpoint load/save of frozen params in zero stage 3.
Fix [BUG] save/load checkpoint in zero3 fails to preserve frozen weights #3090
Pending task: Update zero_to_fp32.py to recover frozen weights.