-
Couldn't load subscription status.
- Fork 3.6k
Description
Description & Motivation
I hope this is pure PTL and not NeMo override, but I'm observing that:
checkpoint_callback_params.save_last: True
leads to saving 2 copies of the same checkpoint. Being on a slow NFS I can see it writing one and then the second time.
For training a huge model this is not the most efficient choice as it doubles the time training is blocked from progress during checkpoint saving.
Is there any reason for not using a symlink from the actual checkpoint to the one named foo-last.ckpt which would do the exact same thing but cost 0 time and space?
FWIW, in other frameworks like Megatron-LM and Deepspeed this is implemented completely differently - there is just file called last which contains the last checkpoint's id (or filename), so the resume operation always knows where to resume from and requires nothing from actual checkpoint files.
The reason I mention this other approach to tracking which file to resume from is I've just gotten this:
ls -l checkpoints/mp_rank_00/
total 6.6G
-rw-rw-rw- 1 stas stas 842M Sep 29 01:04 step=10.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=20.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=30.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=40.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=50-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=50.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60.ckpt
I have no idea how it came to be but clearly this is broken - which is the last here? having a single file last would overcome this situation.
edit: actually I think it happened due to a race condition inherent in the current approach - I happened to kill it before it was able to delete the previous last
Related - isn't save_last: True a requirement and not an option - I find that if I set it to False the trainer starts from scratch and doesn't resume from the latest checkpoint. I guess it doesn't know which is the latest, but nevertheless this doesn't seem to be optional.
Related Also this doc is broken - searched for save_last on your docs site, got the first hit, linking to:
which has no mentioning of save_last and I can't find any other doc of this option.
Thank you.
Pitch
No response
Alternatives
No response
Additional context
No response