Skip to content

Async checkpointing race conditions #14035

@awaelchli

Description

@awaelchli

🐛 Bug

We've seen flakiness in the CI due to race conditions in the async checkpointing logic.

To Reproduce

The specific test in our code base: test_async_checkpoint_plugin
A CI run where this occurred: https://github.com/Lightning-AI/lightning/runs/7672029478?check_suite_focus=true
Error:


>       trainer.fit(model)

D:\a\lightning\lightning\tests\tests_pytorch\plugins\test_checkpoint_io_plugin.py:127: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:698: in fit
    self._call_and_handle_interrupt(
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:650: in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:739: in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1179: in _run
    results = self._run_stage()
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1265: in _run_stage
    return self._run_train()
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1296: in _run_train
    self.fit_loop.run()
d:\a\lightning\lightning\src\pytorch_lightning\loops\loop.py:201: in run
    self.on_advance_end()
d:\a\lightning\lightning\src\pytorch_lightning\loops\fit_loop.py:298: in on_advance_end
    self.trainer._call_callback_hooks("on_train_epoch_end")
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1610: in _call_callback_hooks
    fn(self, self.lightning_module, *args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:311: in on_train_epoch_end
    self._save_topk_checkpoint(trainer, monitor_candidates)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:382: in _save_topk_checkpoint
    self._save_monitor_checkpoint(trainer, monitor_candidates)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:662: in _save_monitor_checkpoint
    self._update_best_and_save(current, trainer, monitor_candidates)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:716: in _update_best_and_save
    trainer.strategy.remove_checkpoint(del_filepath)
d:\a\lightning\lightning\src\pytorch_lightning\strategies\strategy.py:460: in remove_checkpoint
    self.checkpoint_io.remove_checkpoint(filepath)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1092: in __call__
    return self._mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1096: in _mock_call
    return self._execute_mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1166: in _execute_mock_call
    return self._mock_wraps(*args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\plugins\io\wrapper.py:61: in remove_checkpoint
    self.checkpoint_io.remove_checkpoint(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1092: in __call__
    return self._mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1096: in _mock_call
    return self._execute_mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1166: in _execute_mock_call
    return self._mock_wraps(*args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\plugins\io\torch_plugin.py:95: in remove_checkpoint
    fs.rm(path, recursive=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <fsspec.implementations.local.LocalFileSystem object at 0x00000180502A8100>
path = ['C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\test_async_checkpoint_plugin0\\epoch=0-step=1.ckpt']
recursive = True, maxdepth = None

    def rm(self, path, recursive=False, maxdepth=None):
        if isinstance(path, str):
            path = [path]
    
        for p in path:
            p = self._strip_protocol(p).rstrip("/")
            if recursive and self.isdir(p):
    
                if osp.abspath(p) == os.getcwd():
                    raise ValueError("Cannot delete current working directory")
                shutil.rmtree(p)
            else:
>               os.remove(p)
E               PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_async_checkpoint_plugin0/epoch=0-step=1.ckpt'

C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\fsspec\implementations\local.py:153: PermissionError

Expected behavior

It is an issue I was concerned about in #11561 (comment). A top-k deletion could happen any time. If the thread is trying to write the same file, there is a race condition.

Environment

lightning 1.8.0dev
pytorch 1.10

Additional context

cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @Borda @akihironitta

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingcheckpointingRelated to checkpointingtests

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions