Enable universal checkpoint for zero stage 1 #4516

tjruwase · 2023-10-16T12:45:41Z

Generalize universal checkpointing in DS:

Enable for zero stage 1
Move conversion script into DS

Tested with Megatron-DS GPT using companion PR microsoft/Megatron-DeepSpeed#265

Fix #2921

readthedocs
~~[ ] Tutorial~~: Defer

tjruwase · 2023-10-16T12:57:42Z

@stas00, FYI

stas00 · 2023-10-16T21:15:46Z

Amazing! Thank you for starting to work on this super-essential feature, Tunji!

…d logging circular import issue

deepspeed/checkpoint/ds_to_universal.py

…o olruwase/ds_2921

deepspeed/runtime/zero/stage_1_and_2.py

…/ds_2921

* Enable uni_ckpt for z1 * Remove logging fix to seperate PR. Relocate conversion script to avoid logging circular import issue * Formatting fix * PR feedback * Handle replicated params * Detect bf16_optimizer * Docs * Fix docs

rgtjf · 2024-03-14T03:23:21Z

deepspeed/checkpoint/ds_to_universal.py

+    os.makedirs(param_base_path, exist_ok=True)
+
+    cnt += 1
+    counter = f"{dp_index:0>2d}"


Dear @tjruwase ,

I'm currently examining a scenario where the maximum dp_index is 127, which, according to alphabetical order, is considered smaller than 13. This raises a question regarding the tensor sorting process in line 144 of our code. Given this context, could there potentially be any issues with how the tensors are sorted due to this ordering?

I appreciate your insight on this matter.

Best regards,
Junfeng

A specific example is as follows:

./temp/model.transformer_encoder.layers.19.self_attn.in_proj_weight => fp32.100, torch.Size([2187604])
./temp/model.transformer_encoder.layers.19.self_attn.in_proj_weight => fp32.99, torch.Size([48144044])

@rgtjf, thanks for sharing this issue. Do you mind creating a new ticket for it? I can see that line 130 may not generalize to larger scales. It would be great if you could share more details in a new ticket. Thanks!

@tjruwase I've opened a ticket to track the issue, #5283. Should there be any missing details or additional information required, please don't hesitate to let me know.

Enable uni_ckpt for z1

2a60f79

tjruwase requested a review from mrwyattii October 16, 2023 12:45

tjruwase requested review from jeffra, samyam, ShadenSmith, duli2012, RezaYazdaniAminabadi, awan-10, cmikeh2 and arashb as code owners October 16, 2023 12:45

Merge branch 'master' into olruwase/ds_2921

3dc989e

tjruwase mentioned this pull request Oct 16, 2023

Enable universal checkpointing microsoft/Megatron-DeepSpeed#265

Merged

3 tasks

tjruwase removed request for arashb, ShadenSmith, cmikeh2, duli2012, samyam, awan-10 and RezaYazdaniAminabadi October 16, 2023 12:57

tjruwase changed the title ~~Enable uni_ckpt for z1~~ Enable universal checkpoint for zero stage 1 Oct 16, 2023

tjruwase added 3 commits October 17, 2023 07:04

Remove logging fix to seperate PR. Relocate conversion script to avoi…

b13006b

…d logging circular import issue

Formatting fix

64d8c0d

Merge branch 'master' into olruwase/ds_2921

f21a5de

mosheisland reviewed Oct 17, 2023

View reviewed changes

deepspeed/checkpoint/ds_to_universal.py Outdated Show resolved Hide resolved

mosheisland reviewed Oct 17, 2023

View reviewed changes

deepspeed/checkpoint/ds_to_universal.py Show resolved Hide resolved

mosheisland reviewed Oct 17, 2023

View reviewed changes

deepspeed/checkpoint/ds_to_universal.py Outdated Show resolved Hide resolved

mosheisland reviewed Oct 17, 2023

View reviewed changes

deepspeed/checkpoint/ds_to_universal.py Outdated Show resolved Hide resolved

PR feedback

f5c6b2d

Merge branch 'olruwase/ds_2921' of github.com:microsoft/DeepSpeed int…

51b3af8

…o olruwase/ds_2921

mrwyattii approved these changes Oct 17, 2023

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Outdated Show resolved Hide resolved

tjruwase added 5 commits October 18, 2023 06:42

Handle replicated params

d737cbc

Merge branch 'master' into olruwase/ds_2921

3b9a384

Detect bf16_optimizer

d1cefd6

Merge branch 'master' into olruwase/ds_2921

507fee8

Docs

f25ff5b

conglongli mentioned this pull request Oct 25, 2023

Can't load GPT Model: ImportError: 'TP_REPLICATED_PARAMETER_PATTERNS' from 'deepspeed.checkpoint' microsoft/Megatron-DeepSpeed#269

Open

tjruwase added 2 commits October 25, 2023 23:50

Merge branch 'master' of github.com:microsoft/DeepSpeed into olruwase…

f638d92

…/ds_2921

Fix docs

a1c41e0

tjruwase enabled auto-merge October 25, 2023 19:29

tjruwase added this pull request to the merge queue Oct 25, 2023

Merged via the queue into master with commit 8fdd9b3 Oct 25, 2023
15 checks passed

stas00 mentioned this pull request Oct 25, 2023

[REQUEST] universal checkpoint for ZeRO - 1,2,3 #2921

Closed

3 tasks

rgtjf reviewed Mar 14, 2024

View reviewed changes

rgtjf mentioned this pull request Mar 15, 2024

[BUG] Order of partitioning in universal checkpoint #5283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable universal checkpoint for zero stage 1 #4516

Enable universal checkpoint for zero stage 1 #4516

tjruwase commented Oct 16, 2023 •

edited

Loading

tjruwase commented Oct 16, 2023

stas00 commented Oct 16, 2023

rgtjf Mar 14, 2024

rgtjf Mar 14, 2024

tjruwase Mar 14, 2024

rgtjf Mar 15, 2024

Enable universal checkpoint for zero stage 1 #4516

Enable universal checkpoint for zero stage 1 #4516

Conversation

tjruwase commented Oct 16, 2023 • edited Loading

tjruwase commented Oct 16, 2023

stas00 commented Oct 16, 2023

rgtjf Mar 14, 2024

Choose a reason for hiding this comment

rgtjf Mar 14, 2024

Choose a reason for hiding this comment

tjruwase Mar 14, 2024

Choose a reason for hiding this comment

rgtjf Mar 15, 2024

Choose a reason for hiding this comment

tjruwase commented Oct 16, 2023 •

edited

Loading