Skip to content

Conversation

@jeffra
Copy link
Collaborator

@jeffra jeffra commented Aug 24, 2021

In fine-tuning scenarios the user often wants to only load the model parameter checkpoints and may not have the zero optimizer states on disk since they are not needed. This fixes a bug where the weights are not properly restored from the ckpt if the zero optimizer states are not present on disk.

@stas00
Copy link
Collaborator

stas00 commented Aug 24, 2021

I confirm that it solved the problem! Thank you, Jeff!

Good to go to merge this into the big-science branch!

@stas00
Copy link
Collaborator

stas00 commented Aug 24, 2021

BTW, the checkpoint still requires mp_rank_*_model_states.pt and the program crashes without those.

AssertionError
        sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list)sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list)
  File "/gpfsssd/worksf/projects/rech/six/ura81os/code/deepspeed-big-science/deepspeed/runtime/state_dict_factory.py", line 162, in check_ckpt_list
    assert len(self.ckpt_list) > 0

Do we really need those? Or is this another legacy check?

@jeffra jeffra merged commit aa12129 into master Aug 25, 2021
@jeffra jeffra deleted the jeffra/zero-ckpt-fix branch August 25, 2021 14:12
@jeffra
Copy link
Collaborator Author

jeffra commented Aug 25, 2021

BTW, the checkpoint still requires mp_rank_*_model_states.pt and the program crashes without those.

AssertionError
        sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list)sd_loader = SDLoaderFactory.get_sd_loader(ckpt_list)
  File "/gpfsssd/worksf/projects/rech/six/ura81os/code/deepspeed-big-science/deepspeed/runtime/state_dict_factory.py", line 162, in check_ckpt_list
    assert len(self.ckpt_list) > 0

Do we really need those? Or is this another legacy check?

I am pretty sure we still need these, since they’re associated with tensor parallelism model weights. @ShadenSmith to confirm wrt PP checkpoints though?

@stas00
Copy link
Collaborator

stas00 commented Aug 25, 2021

I think it needs those at least for getting the saved args out

@stas00
Copy link
Collaborator

stas00 commented Aug 25, 2021

So now this needs to be replayed to the big-science branch. Thank you!

jeffra added a commit that referenced this pull request Aug 25, 2021
* restore fp16 params if no zero ckpts available

* formatting
@jeffra
Copy link
Collaborator Author

jeffra commented Aug 25, 2021

So now this needs to be replayed to the big-science branch. Thank you!

pushed this commit to big-science now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants