-
Couldn't load subscription status.
- Fork 3.6k
Description
Bug description
I was wondering what is the use of the save_last parameter in the checkpoint model.
I assume it is to have a "last.ckpt" that you can always refer to, this file being a symlink, it is linked to the last saved checkpoint.
Now that that is the case, i cannot load the last.ckpt :
Traceback (most recent call last): File "/home/****/****/main.py", line 110, in <module> trainer.fit(model(model_params), dataloader, ckpt_path = path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 955, in _run self._checkpoint_connector._restore_modules_and_callbacks(ckpt_path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 395, in _restore_modules_and_callbacks self.resume_start(checkpoint_path) File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/trainer/connectors/checkpoint_connector.py", line 79, in resume_start loaded_checkpoint = self.trainer.strategy.load_checkpoint(checkpoint_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 359, in load_checkpoint return self.checkpoint_io.load_checkpoint(checkpoint_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/****/miniconda3/envs/training/lib/python3.11/site-packages/lightning/fabric/plugins/io/torch_io.py", line 77, in load_checkpoint raise FileNotFoundError(f"Checkpoint file not found: {path}") FileNotFoundError: Checkpoint file not found: /home/****/****/lightning_logs/2.0.0/checkpoints/last.ckpt
However, the last.ckpt file does exist, and so does the checkpoint it points to.
What version are you seeing the problem on?
v2.1
How to reproduce the bug
path = "path/to/last.ckpt"
trainer = pl.Trainer(**Training_args)
trainer.fit(model(model_params), dataloader, ckpt_path = path)Error messages and logs
No response
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response
cc @awaelchli