Skip to content

Resume training from last.ckpt #19251

@Karesto

Description

@Karesto

Bug description

I was wondering what is the use of the save_last parameter in the checkpoint model.
I assume it is to have a "last.ckpt" that you can always refer to, this file being a symlink, it is linked to the last saved checkpoint.

Now that that is the case, i cannot load the last.ckpt :

Traceback (most recent call last): File "/home/****/****/main.py", line 110, in <module> trainer.fit(model(model_params), dataloader, ckpt_path = path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 955, in _run self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 395, in _restore_modules_and_callbacks self.resume_start(checkpoint_path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 79, in resume_start loaded_checkpoint = self.trainer.strategy.load_checkpoint(checkpoint_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 359, in load_checkpoint return self.checkpoint_io.load_checkpoint(checkpoint_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/fabric/plugins/io/torch_io.py", line 77, in load_checkpoint raise FileNotFoundError(f"Checkpoint file not found: {path}") FileNotFoundError: Checkpoint file not found: /home/****/****/lightning_logs/2.0.0/checkpoints/last.ckpt

However, the last.ckpt file does exist, and so does the checkpoint it points to.

What version are you seeing the problem on?

v2.1

How to reproduce the bug

path = "path/to/last.ckpt"
    trainer = pl.Trainer(**Training_args)
    trainer.fit(model(model_params), dataloader, ckpt_path = path)

Error messages and logs

No response

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions