zero3 checkpoint frozen params #3205

tjruwase · 2023-04-12T21:44:15Z

Enable checkpoint load/save of frozen params in zero stage 3.
Fix [BUG] save/load checkpoint in zero3 fails to preserve frozen weights #3090
Pending task: Update zero_to_fp32.py to recover frozen weights.

tjruwase · 2023-04-12T21:44:39Z

…into olruwase/issue_3090

stas00 · 2023-04-13T20:42:32Z

I tried it out - and when the checkpoint is saved, I get almost all frozen weights saved with size[0]

python tools/convert_checkpoint/inspect_checkpoint.py /hf/m4-master-3/save_dir/opt_step-10/accelerator_state/pytorch_model/zero_pp_rank_0_mp_rank_00_model_states.pt
loading checkpoint file: /hf/m4-master-3/save_dir/opt_step-10/accelerator_state/pytorch_model/zero_pp_rank_0_mp_rank_00_model_states.pt
[tensor] module.lm_head.weight = torch.Size([0])
[tensor] module.lm_head.additional_fc.weight = torch.Size([0])
[tensor] module.model.decoder.embed_tokens.weight = torch.Size([0])
[...]

I think they need to be gathered before saving.

But we probably shouldn't do that on every process as it'd be quite slow if the model has 50% frozen weights. if it's the same weights saving it once should be enough (at least on the shared fs, it won't work on non-shared fs).

the following will do the gathering:

diff --git a/deepspeed/runtime/zero/stage3.py b/deepspeed/runtime/zero/stage3.py
index 8c31a9d6..8b91e242 100644
--- a/deepspeed/runtime/zero/stage3.py
+++ b/deepspeed/runtime/zero/stage3.py
@@ -357,7 +357,8 @@ class DeepSpeedZeroOptimizer_Stage3(ZeROOptimizer):
         param_groups = []
         for param_group in self.optimizer.param_groups:
             frozen_params = [p for p in param_group["params"] if not p.requires_grad]
-            param_groups.append(frozen_params)
+            with deepspeed.zero.GatheredParameters(frozen_params, modifier_rank=None):
+                param_groups.append(frozen_params)
         return param_groups

     def _setup_for_real_optimizer(self):

but the saved tensors still appear to be of size 0. so that fix doesn't seem to be it.

Ah, I see - the original code will never succeed because frozen params aren't in optimizer.param_groups

stas00 · 2023-04-13T22:47:14Z

I'm also thinking would this even work if there is a huge model with a lot of frozen params? There might not be enough memory to gather them all. Perhaps should save their fp16 shards instead? that would be much faster.

…into olruwase/issue_3090

shaankhosla · 2023-04-18T16:39:45Z

Hi @stas00 and @tjruwase, thanks for your work on this. I'm just checking to see if this would fix an error I'm getting using DeepSpeed and LoRA. Let me know if this isn't the place to ask.

I'm able to train "t5" using DeepSpeed Stage 3 and LoRA, however when I run the load_state_dict_from_zero_checkpoint command I get an error KeyError: '_forward_module.model.base_model.model.encoder.embed_tokens.weight'

Thanks again for all your help!

tjruwase · 2023-04-18T16:47:53Z

Hi @stas00 and @tjruwase, thanks for your work on this. I'm just checking to see if this would fix an error I'm getting using DeepSpeed and LoRA. Let me know if this isn't the place to ask.

I'm able to train "t5" using DeepSpeed Stage 3 and LoRA, however when I run the load_state_dict_from_zero_checkpoint command I get an error KeyError: '_forward_module.model.base_model.model.encoder.embed_tokens.weight'

Thanks again for all your help!

@shaankhosla, thanks for your interest. Please open a new ticket for this problem. It would be very helpful to provide more details for reproducing the problem in that ticket.

shaankhosla · 2023-04-18T17:24:46Z

Here it is: #3291 :)

…into olruwase/issue_3090

stas00 · 2023-04-20T22:17:54Z

Thank you for the quick solving and merge, Tunji and the team!

zero3 checkpoint frozen params

81afa0a

tjruwase requested review from jeffra and jomayeri April 12, 2023 21:44

tjruwase requested review from samyam, mrwyattii and awan-10 as code owners April 12, 2023 21:44

tjruwase removed request for samyam and awan-10 April 12, 2023 21:44

Merge branch 'master' into olruwase/issue_3090

3de90c3

tjruwase added 3 commits April 13, 2023 02:48

Remove debug prints

e672211

Merge branch 'olruwase/issue_3090' of github.com:microsoft/DeepSpeed …

4c7de69

…into olruwase/issue_3090

Merge branch 'master' into olruwase/issue_3090

29fdbea

tjruwase added 6 commits April 14, 2023 04:21

Move to cpu

378e1ee

Merge branch 'olruwase/issue_3090' of github.com:microsoft/DeepSpeed …

7fbe4bf

…into olruwase/issue_3090

Merge branch 'master' into olruwase/issue_3090

8d2f72b

WIP

438f6ff

Merge branch 'olruwase/issue_3090' of github.com:microsoft/DeepSpeed …

96d55a3

…into olruwase/issue_3090

WIP

feac428

tjruwase added 2 commits April 18, 2023 12:48

Merge branch 'master' into olruwase/issue_3090

fb1a4d5

WIP

8dc6e8a

Merge branch 'master' into olruwase/issue_3090

3ef70ec

tjruwase requested a review from ShijieZZZZ April 18, 2023 20:29

tjruwase added 2 commits April 19, 2023 01:53

Cleanup

5485b10

Merge branch 'olruwase/issue_3090' of github.com:microsoft/DeepSpeed …

835f0f2

…into olruwase/issue_3090

tjruwase added 5 commits April 18, 2023 16:58

Merge branch 'master' into olruwase/issue_3090

71fbd61

Cleanup

be7b54b

Merge branch 'olruwase/issue_3090' of github.com:microsoft/DeepSpeed …

7308cb2

…into olruwase/issue_3090

Extend unit test for frozen params

c616875

API fix

224c370

tjruwase requested review from ShadenSmith and duli2012 as code owners April 18, 2023 23:09

tjruwase added 4 commits April 18, 2023 23:35

Merge branch 'master' into olruwase/issue_3090

9ba69b0

Merge branch 'master' into olruwase/issue_3090

d9711b8

Merge branch 'master' into olruwase/issue_3090

73227c8

Merge branch 'master' into olruwase/issue_3090

4400d72

tjruwase enabled auto-merge (squash) April 20, 2023 18:48

tjruwase disabled auto-merge April 20, 2023 18:48

tjruwase enabled auto-merge (squash) April 20, 2023 18:49

jeffra approved these changes Apr 20, 2023

View reviewed changes

jomayeri approved these changes Apr 20, 2023

View reviewed changes

Merge branch 'master' into olruwase/issue_3090

dfa3eba

tjruwase merged commit dd8df20 into master Apr 20, 2023

conglongli added deepspeed-chat Related to DeepSpeed-Chat and removed deepspeed-chat Related to DeepSpeed-Chat labels Apr 30, 2023

Sanster mentioned this pull request May 4, 2023

[REQUEST] Add a command line argument in zero_to_fp32.py to only merge trainable parameters #3437

Closed

mrwyattii deleted the olruwase/issue_3090 branch July 7, 2023 02:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zero3 checkpoint frozen params #3205

zero3 checkpoint frozen params #3205

tjruwase commented Apr 12, 2023

tjruwase commented Apr 12, 2023

stas00 commented Apr 13, 2023 •

edited

Loading

stas00 commented Apr 13, 2023

shaankhosla commented Apr 18, 2023

tjruwase commented Apr 18, 2023

shaankhosla commented Apr 18, 2023

stas00 commented Apr 20, 2023

zero3 checkpoint frozen params #3205

zero3 checkpoint frozen params #3205

Conversation

tjruwase commented Apr 12, 2023

tjruwase commented Apr 12, 2023

stas00 commented Apr 13, 2023 • edited Loading

stas00 commented Apr 13, 2023

shaankhosla commented Apr 18, 2023

tjruwase commented Apr 18, 2023

shaankhosla commented Apr 18, 2023

stas00 commented Apr 20, 2023

stas00 commented Apr 13, 2023 •

edited

Loading