-
Couldn't load subscription status.
- Fork 3.6k
Description
Bug description
As demonstrated in the above figure, the last.ckpt is linked to a wrong path. The correct path is just epoch:299-val_loss:0.ckpt instead of the long path from work_dirs.
If we want to use the soft link, the above code should modify the filepath to start from the dirname of self._last_checkpoint_saved, instead of using the unchanged filepath, which is started from the current working dir.
In addition, though I agree with the idea of using the file link of last.ckpt to reduce the occupation of checkpoints on disk, it will be better to use the hard link instead of the soft link if possible on the local file system since the hard link will be more robust.
And the hard link use the relative path from the working dir, so the fix to this issue will be simply modifying the code in
| os.symlink(filepath, linkpath) |
to
os.link(filepath, linkpath)What version are you seeing the problem on?
v2.1
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
- CUDA:
- GPU:
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- NVIDIA GeForce RTX 4090
- available: True
- version: 12.1
- GPU:
- Lightning:
- lightning: 2.1.3
- lightning-template: 1.5.1
- lightning-utilities: 0.10.0
- pytorch-lightning: 2.1.3
- torch: 2.1.2
- torchmetrics: 1.2.1
- torchvision: 0.16.2
- Packages:
- accessible-pygments: 0.0.4
- aiohttp: 3.9.1
- aiosignal: 1.3.1
- alabaster: 0.7.16
- annotated-types: 0.6.0
- antlr4-python3-runtime: 4.9.3
- anyascii: 0.3.2
- anyio: 4.2.0
- appdirs: 1.4.4
- argcomplete: 3.1.6
- astroid: 3.0.2
- attrs: 23.2.0
- babel: 2.14.0
- beautifulsoup4: 4.12.2
- build: 1.0.3
- cachetools: 5.3.2
- certifi: 2023.11.17
- cfgv: 3.4.0
- chardet: 5.2.0
- charset-normalizer: 3.3.2
- click: 8.1.7
- colorama: 0.4.6
- commitizen: 3.13.0
- contourpy: 1.2.0
- coverage: 7.4.0
- cycler: 0.12.1
- decli: 0.6.1
- distlib: 0.3.8
- distro: 1.9.0
- docker-pycreds: 0.4.0
- docstring-parser: 0.15
- docutils: 0.20.1
- filelock: 3.13.1
- fonttools: 4.47.0
- frozenlist: 1.4.1
- fsspec: 2023.12.2
- gensim: 4.3.2
- gitdb: 4.0.11
- gitpython: 3.1.40
- googletrans: 2.4.0
- h11: 0.14.0
- httpcore: 1.0.2
- httpx: 0.26.0
- huggingface-hub: 0.20.2
- identify: 2.5.33
- idna: 3.6
- imagesize: 1.4.1
- importlib-metadata: 6.11.0
- importlib-resources: 6.1.1
- iniconfig: 2.0.0
- jinja2: 3.1.2
- joblib: 1.3.2
- jsonargparse: 4.27.1
- jsonnet: 0.20.0
- jsonschema: 4.20.0
- jsonschema-specifications: 2023.12.1
- kiwisolver: 1.4.5
- lightning: 2.1.3
- lightning-template: 1.5.1
- lightning-utilities: 0.10.0
- livereload: 2.6.3
- logmatic-python: 0.1.7
- markdown-it-py: 3.0.0
- markupsafe: 2.1.3
- matplotlib: 3.8.2
- mdit-py-plugins: 0.4.0
- mdurl: 0.1.2
- medclip: 0.0.3
- mpmath: 1.3.0
- multidict: 6.0.4
- myst-parser: 2.0.0
- networkx: 3.2.1
- nltk: 3.8.1
- nodeenv: 1.8.0
- numpy: 1.26.3
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.18.1
- nvidia-nvjitlink-cu12: 12.3.101
- nvidia-nvtx-cu12: 12.1.105
- omegaconf: 2.3.0
- openai: 0.28.0
- packaging: 23.2
- pandas: 2.1.4
- pillow: 10.2.0
- pip: 23.3.1
- platformdirs: 4.1.0
- pluggy: 1.3.0
- pre-commit: 3.6.0
- project-template: 0.2.2.dev4+dirty
- prompt-toolkit: 3.0.36
- protobuf: 4.25.1
- psutil: 5.9.7
- pydantic: 2.5.3
- pydantic-core: 2.14.6
- pydata-sphinx-theme: 0.15.1
- pygments: 2.17.2
- pyparsing: 3.1.1
- pyproject-api: 1.6.1
- pyproject-hooks: 1.0.0
- pytest: 7.4.4
- pytest-cov: 4.1.0
- python-dateutil: 2.8.2
- python-json-logger: 2.0.7
- pytorch-lightning: 2.1.3
- pytz: 2023.3.post1
- pyyaml: 6.0.1
- questionary: 2.0.1
- reconplogger: 4.12.0
- referencing: 0.32.1
- regex: 2023.12.25
- requests: 2.31.0
- rich: 13.7.0
- rpds-py: 0.16.2
- ruyaml: 0.91.0
- safetensors: 0.4.1
- scikit-learn: 1.3.2
- scipy: 1.11.4
- sentry-sdk: 1.39.1
- setproctitle: 1.3.3
- setuptools: 68.2.2
- shell-command-launcher: 1.1.0
- six: 1.16.0
- smart-open: 6.4.0
- smmap: 5.0.1
- sniffio: 1.3.0
- snowballstemmer: 2.2.0
- socksio: 1.0.0
- soupsieve: 2.5
- speed-benchmark: 1.1.1
- sphinx: 7.2.6
- sphinx-autoapi: 3.0.0
- sphinx-autobuild: 2021.3.14
- sphinx-book-theme: 1.1.0
- sphinx-design: 0.5.0
- sphinxcontrib-applehelp: 1.0.7
- sphinxcontrib-devhelp: 1.0.5
- sphinxcontrib-htmlhelp: 2.0.4
- sphinxcontrib-jsmath: 1.0.1
- sphinxcontrib-qthelp: 1.0.6
- sphinxcontrib-serializinghtml: 1.1.9
- sympy: 1.12
- termcolor: 2.4.0
- textaugment: 2.0.0
- textblob: 0.17.1
- thop: 0.1.1.post2209072238
- threadpoolctl: 3.2.0
- timm: 0.9.12
- tokenizers: 0.13.3
- tomlkit: 0.12.3
- torch: 2.1.2
- torchmetrics: 1.2.1
- torchvision: 0.16.2
- tornado: 6.4
- tox: 4.11.4
- tqdm: 4.66.1
- transformers: 4.24.0
- triton: 2.1.0
- typeshed-client: 2.4.0
- typing-extensions: 4.9.0
- tzdata: 2023.4
- urllib3: 2.1.0
- virtualenv: 20.25.0
- wandb: 0.16.2
- wcwidth: 0.2.13
- wget: 3.2
- wheel: 0.41.2
- yarl: 1.9.4
- zipp: 3.17.0
- System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.11.7
- release: 5.15.0-91-generic
- version: Schedulers step before optimizers #101-Ubuntu SMP Tue Nov 14 13:30:08 UTC 2023
More info
By the way, since we already implement the link of last.ckpt, how about adding a new feature on best.ckpt? Although we support saving the topk checkpoints, there is no consistent way to reference the top1 checkpoint. For example, in the above figure, I have to check the files under the folder work_dirs/medclip_resnet50_iu_xray_R2GenGPT_30x/2024.01.16_17.44.40.927262/checkpoints to determine the file name epoch:5-val_loss:3.264.ckpt to reference the best checkpoint if I have not created the best.ckpt checkpoint. It will be more convenient if we implement the hard link of best.ckpt also. It can be implemented with the similar code as the last.ckpt part.