-
Couldn't load subscription status.
- Fork 3.6k
Description
Bug description
When ModelCheckpoint checkpoint is used with the save_last without a logger pointing to a remote FS, it tries to create a symlink with os but it fails with FileNotFoundError, as os is not able to load remote content.
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L391
This is mainly for the fact the self._fs get the file type from and empty dirpath which is None at the start:
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L452
Thus, even though its a remote fs, the actual filepath that is used is actually a remote
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L662
The issue is that the dirpath is set in the setup again, but the self._fs is not updated:
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L265-L268
By updating the self._fs at that point, the issue will be fixed.
What version are you seeing the problem on?
v2.1, master
How to reproduce the bug
# main.py
"""
Execute:
>>> export FSSPEC_ABFS='{"anon": false}'
>>> pip install pytorch_lightning adlfs
>>> python main.py
"""
import pytorch_lightning as pl
from pytorch_lightning.demos import boring_classes
OUTPUT_DIR = "az://<container-name>@<name>.blob.core.windows.net/tmp/"
class TestModel(boring_classes.BoringModel):
def training_step(self, batch, batch_idx):
loss = self.step(batch)
return {"loss": loss}
model = TestModel()
trainer = pl.Trainer(
default_root_dir=OUTPUT_DIR,
max_epochs=10,
logger=False,
callbacks=[
pl.callbacks.ModelCheckpoint(
filename="best",
save_last=True, # <-- bug here: works for `False`
save_top_k=1,
),
]
)
trainer.fit(model)Error messages and logs
File "/Users/ioangatop/Desktop/kaiko-eng/libs/ml_framework/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 388, in _link_checkpoint
os.symlink(filepath, linkpath)
FileNotFoundError: [Errno 2] No such file or directory: 'az://ml-outputs@kaiko.blob.core.windows.net/tmp1/checkpoints/best.ckpt' -> 'az://ml-outputs@kaiko.blob.core.windows.net/tmp1/checkpoints/last.ckpt'
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response