Skip to content

[BUG] The new symlink feature of ModelCheckpoint is broken when wrong link path #19294

@shenmishajing

Description

@shenmishajing

Bug description

Screenshot 2024-01-16 at 18 30 55

As demonstrated in the above figure, the last.ckpt is linked to a wrong path. The correct path is just epoch:299-val_loss:0.ckpt instead of the long path from work_dirs.

https://github.com/Lightning-AI/pytorch-lightning/blob/628ee0cb61637d01a44884e0d805e238ea6b49bb/src/lightning/pytorch/callbacks/model_checkpoint.py#L692C25-L692C25

If we want to use the soft link, the above code should modify the filepath to start from the dirname of self._last_checkpoint_saved, instead of using the unchanged filepath, which is started from the current working dir.

In addition, though I agree with the idea of using the file link of last.ckpt to reduce the occupation of checkpoints on disk, it will be better to use the hard link instead of the soft link if possible on the local file system since the hard link will be more robust.

And the hard link use the relative path from the working dir, so the fix to this issue will be simply modifying the code in

to

                os.link(filepath, linkpath)

What version are you seeing the problem on?

v2.1

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
      • NVIDIA GeForce RTX 4090
    • available: True
    • version: 12.1
  • Lightning:
    • lightning: 2.1.3
    • lightning-template: 1.5.1
    • lightning-utilities: 0.10.0
    • pytorch-lightning: 2.1.3
    • torch: 2.1.2
    • torchmetrics: 1.2.1
    • torchvision: 0.16.2
  • Packages:
    • accessible-pygments: 0.0.4
    • aiohttp: 3.9.1
    • aiosignal: 1.3.1
    • alabaster: 0.7.16
    • annotated-types: 0.6.0
    • antlr4-python3-runtime: 4.9.3
    • anyascii: 0.3.2
    • anyio: 4.2.0
    • appdirs: 1.4.4
    • argcomplete: 3.1.6
    • astroid: 3.0.2
    • attrs: 23.2.0
    • babel: 2.14.0
    • beautifulsoup4: 4.12.2
    • build: 1.0.3
    • cachetools: 5.3.2
    • certifi: 2023.11.17
    • cfgv: 3.4.0
    • chardet: 5.2.0
    • charset-normalizer: 3.3.2
    • click: 8.1.7
    • colorama: 0.4.6
    • commitizen: 3.13.0
    • contourpy: 1.2.0
    • coverage: 7.4.0
    • cycler: 0.12.1
    • decli: 0.6.1
    • distlib: 0.3.8
    • distro: 1.9.0
    • docker-pycreds: 0.4.0
    • docstring-parser: 0.15
    • docutils: 0.20.1
    • filelock: 3.13.1
    • fonttools: 4.47.0
    • frozenlist: 1.4.1
    • fsspec: 2023.12.2
    • gensim: 4.3.2
    • gitdb: 4.0.11
    • gitpython: 3.1.40
    • googletrans: 2.4.0
    • h11: 0.14.0
    • httpcore: 1.0.2
    • httpx: 0.26.0
    • huggingface-hub: 0.20.2
    • identify: 2.5.33
    • idna: 3.6
    • imagesize: 1.4.1
    • importlib-metadata: 6.11.0
    • importlib-resources: 6.1.1
    • iniconfig: 2.0.0
    • jinja2: 3.1.2
    • joblib: 1.3.2
    • jsonargparse: 4.27.1
    • jsonnet: 0.20.0
    • jsonschema: 4.20.0
    • jsonschema-specifications: 2023.12.1
    • kiwisolver: 1.4.5
    • lightning: 2.1.3
    • lightning-template: 1.5.1
    • lightning-utilities: 0.10.0
    • livereload: 2.6.3
    • logmatic-python: 0.1.7
    • markdown-it-py: 3.0.0
    • markupsafe: 2.1.3
    • matplotlib: 3.8.2
    • mdit-py-plugins: 0.4.0
    • mdurl: 0.1.2
    • medclip: 0.0.3
    • mpmath: 1.3.0
    • multidict: 6.0.4
    • myst-parser: 2.0.0
    • networkx: 3.2.1
    • nltk: 3.8.1
    • nodeenv: 1.8.0
    • numpy: 1.26.3
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 8.9.2.26
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-nccl-cu12: 2.18.1
    • nvidia-nvjitlink-cu12: 12.3.101
    • nvidia-nvtx-cu12: 12.1.105
    • omegaconf: 2.3.0
    • openai: 0.28.0
    • packaging: 23.2
    • pandas: 2.1.4
    • pillow: 10.2.0
    • pip: 23.3.1
    • platformdirs: 4.1.0
    • pluggy: 1.3.0
    • pre-commit: 3.6.0
    • project-template: 0.2.2.dev4+dirty
    • prompt-toolkit: 3.0.36
    • protobuf: 4.25.1
    • psutil: 5.9.7
    • pydantic: 2.5.3
    • pydantic-core: 2.14.6
    • pydata-sphinx-theme: 0.15.1
    • pygments: 2.17.2
    • pyparsing: 3.1.1
    • pyproject-api: 1.6.1
    • pyproject-hooks: 1.0.0
    • pytest: 7.4.4
    • pytest-cov: 4.1.0
    • python-dateutil: 2.8.2
    • python-json-logger: 2.0.7
    • pytorch-lightning: 2.1.3
    • pytz: 2023.3.post1
    • pyyaml: 6.0.1
    • questionary: 2.0.1
    • reconplogger: 4.12.0
    • referencing: 0.32.1
    • regex: 2023.12.25
    • requests: 2.31.0
    • rich: 13.7.0
    • rpds-py: 0.16.2
    • ruyaml: 0.91.0
    • safetensors: 0.4.1
    • scikit-learn: 1.3.2
    • scipy: 1.11.4
    • sentry-sdk: 1.39.1
    • setproctitle: 1.3.3
    • setuptools: 68.2.2
    • shell-command-launcher: 1.1.0
    • six: 1.16.0
    • smart-open: 6.4.0
    • smmap: 5.0.1
    • sniffio: 1.3.0
    • snowballstemmer: 2.2.0
    • socksio: 1.0.0
    • soupsieve: 2.5
    • speed-benchmark: 1.1.1
    • sphinx: 7.2.6
    • sphinx-autoapi: 3.0.0
    • sphinx-autobuild: 2021.3.14
    • sphinx-book-theme: 1.1.0
    • sphinx-design: 0.5.0
    • sphinxcontrib-applehelp: 1.0.7
    • sphinxcontrib-devhelp: 1.0.5
    • sphinxcontrib-htmlhelp: 2.0.4
    • sphinxcontrib-jsmath: 1.0.1
    • sphinxcontrib-qthelp: 1.0.6
    • sphinxcontrib-serializinghtml: 1.1.9
    • sympy: 1.12
    • termcolor: 2.4.0
    • textaugment: 2.0.0
    • textblob: 0.17.1
    • thop: 0.1.1.post2209072238
    • threadpoolctl: 3.2.0
    • timm: 0.9.12
    • tokenizers: 0.13.3
    • tomlkit: 0.12.3
    • torch: 2.1.2
    • torchmetrics: 1.2.1
    • torchvision: 0.16.2
    • tornado: 6.4
    • tox: 4.11.4
    • tqdm: 4.66.1
    • transformers: 4.24.0
    • triton: 2.1.0
    • typeshed-client: 2.4.0
    • typing-extensions: 4.9.0
    • tzdata: 2023.4
    • urllib3: 2.1.0
    • virtualenv: 20.25.0
    • wandb: 0.16.2
    • wcwidth: 0.2.13
    • wget: 3.2
    • wheel: 0.41.2
    • yarl: 1.9.4
    • zipp: 3.17.0
  • System:

More info

By the way, since we already implement the link of last.ckpt, how about adding a new feature on best.ckpt? Although we support saving the topk checkpoints, there is no consistent way to reference the top1 checkpoint. For example, in the above figure, I have to check the files under the folder work_dirs/medclip_resnet50_iu_xray_R2GenGPT_30x/2024.01.16_17.44.40.927262/checkpoints to determine the file name epoch:5-val_loss:3.264.ckpt to reference the best checkpoint if I have not created the best.ckpt checkpoint. It will be more convenient if we implement the hard link of best.ckpt also. It can be implemented with the similar code as the last.ckpt part.

cc @carmocca @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions