Skip to content

Callback ModelCheckpoint option save_last without logger fails on remote FS #18865

@ioangatop

Description

@ioangatop

Bug description

When ModelCheckpoint checkpoint is used with the save_last without a logger pointing to a remote FS, it tries to create a symlink with os but it fails with FileNotFoundError, as os is not able to load remote content.
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L391

This is mainly for the fact the self._fs get the file type from and empty dirpath which is None at the start:
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L452

Thus, even though its a remote fs, the actual filepath that is used is actually a remote
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L662

The issue is that the dirpath is set in the setup again, but the self._fs is not updated:
https://github.com/Lightning-AI/lightning/blob/874825857ffc09923407ada36814e11adb66c352/src/lightning/pytorch/callbacks/model_checkpoint.py#L265-L268

By updating the self._fs at that point, the issue will be fixed.

What version are you seeing the problem on?

v2.1, master

How to reproduce the bug

# main.py
"""
Execute:
>>> export FSSPEC_ABFS='{"anon": false}'
>>> pip install pytorch_lightning adlfs
>>> python main.py
"""
import pytorch_lightning as pl
from pytorch_lightning.demos import boring_classes

OUTPUT_DIR = "az://<container-name>@<name>.blob.core.windows.net/tmp/"

class TestModel(boring_classes.BoringModel):
    def training_step(self, batch, batch_idx):
        loss = self.step(batch)
        return {"loss": loss}

model = TestModel()
trainer = pl.Trainer(
    default_root_dir=OUTPUT_DIR,
    max_epochs=10,
    logger=False,
    callbacks=[
        pl.callbacks.ModelCheckpoint(
            filename="best",
            save_last=True,  # <-- bug here: works for `False`
            save_top_k=1,
        ),
    ]
)
trainer.fit(model)

Error messages and logs

  File "/Users/ioangatop/Desktop/kaiko-eng/libs/ml_framework/.venv/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py", line 388, in _link_checkpoint
    os.symlink(filepath, linkpath)
FileNotFoundError: [Errno 2] No such file or directory: 'az://ml-outputs@kaiko.blob.core.windows.net/tmp1/checkpoints/best.ckpt' -> 'az://ml-outputs@kaiko.blob.core.windows.net/tmp1/checkpoints/last.ckpt'

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @carmocca @awaelchli

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions