Skip to content

Cannot save last checkpoint due to breaking change in new release #18931

@francescocarzaniga

Description

@francescocarzaniga

Bug description

With the breaking change in the behaviour of the save_last flag in ModelCheckpoint (PR) it is now seemingly no longer possible to do a very simple and obvious thing: continue training from the last (actually last) epoch while saving the top_k checkpoints.

Am I missing an obvious flag or did you really remove this functionality? I already lost a few days of GPU time because of this.

I am filing this as a bug because I believe this is an unintended consequence of the above-mentioned change.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.1.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions