-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable universal checkpoint for zero stage 1 #4516
Conversation
@stas00, FYI |
Amazing! Thank you for starting to work on this super-essential feature, Tunji! |
…d logging circular import issue
…o olruwase/ds_2921
* Enable uni_ckpt for z1 * Remove logging fix to seperate PR. Relocate conversion script to avoid logging circular import issue * Formatting fix * PR feedback * Handle replicated params * Detect bf16_optimizer * Docs * Fix docs
* Enable uni_ckpt for z1 * Remove logging fix to seperate PR. Relocate conversion script to avoid logging circular import issue * Formatting fix * PR feedback * Handle replicated params * Detect bf16_optimizer * Docs * Fix docs
os.makedirs(param_base_path, exist_ok=True) | ||
|
||
cnt += 1 | ||
counter = f"{dp_index:0>2d}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dear @tjruwase ,
I'm currently examining a scenario where the maximum dp_index is 127, which, according to alphabetical order, is considered smaller than 13. This raises a question regarding the tensor sorting process in line 144 of our code. Given this context, could there potentially be any issues with how the tensors are sorted due to this ordering?
I appreciate your insight on this matter.
Best regards,
Junfeng
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A specific example is as follows:
./temp/model.transformer_encoder.layers.19.self_attn.in_proj_weight => fp32.100, torch.Size([2187604])
./temp/model.transformer_encoder.layers.19.self_attn.in_proj_weight => fp32.99, torch.Size([48144044])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rgtjf, thanks for sharing this issue. Do you mind creating a new ticket for it? I can see that line 130 may not generalize to larger scales. It would be great if you could share more details in a new ticket. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generalize universal checkpointing in DS:
Tested with Megatron-DS GPT using companion PR microsoft/Megatron-DeepSpeed#265
Fix #2921
[ ] Tutorial: Defer