Skip to content

save_last: True saves 2 checkpoints every time #18670

@stas00

Description

@stas00

Description & Motivation

I hope this is pure PTL and not NeMo override, but I'm observing that:

checkpoint_callback_params.save_last: True

leads to saving 2 copies of the same checkpoint. Being on a slow NFS I can see it writing one and then the second time.

For training a huge model this is not the most efficient choice as it doubles the time training is blocked from progress during checkpoint saving.

Is there any reason for not using a symlink from the actual checkpoint to the one named foo-last.ckpt which would do the exact same thing but cost 0 time and space?

FWIW, in other frameworks like Megatron-LM and Deepspeed this is implemented completely differently - there is just file called last which contains the last checkpoint's id (or filename), so the resume operation always knows where to resume from and requires nothing from actual checkpoint files.

The reason I mention this other approach to tracking which file to resume from is I've just gotten this:

ls -l checkpoints/mp_rank_00/
total 6.6G
-rw-rw-rw- 1 stas stas 842M Sep 29 01:04 step=10.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=20.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:05 step=30.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=40.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=50-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:06 step=50.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60-last.ckpt
-rw-rw-rw- 1 stas stas 842M Sep 29 01:07 step=60.ckpt

I have no idea how it came to be but clearly this is broken - which is the last here? having a single file last would overcome this situation.

edit: actually I think it happened due to a race condition inherent in the current approach - I happened to kill it before it was able to delete the previous last


Related - isn't save_last: True a requirement and not an option - I find that if I set it to False the trainer starts from scratch and doesn't resume from the latest checkpoint. I guess it doesn't know which is the latest, but nevertheless this doesn't seem to be optional.


Related Also this doc is broken - searched for save_last on your docs site, got the first hit, linking to:

https://lightning.ai/docs/pytorch/stable/extensions/callbacks.html#lightning.pytorch.callbacks.ModelCheckpoint.params.save_last

which has no mentioning of save_last and I can't find any other doc of this option.

Thank you.

Pitch

No response

Alternatives

No response

Additional context

No response

cc @Borda @carmocca @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions