-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingtests
Milestone
Description
🐛 Bug
We've seen flakiness in the CI due to race conditions in the async checkpointing logic.
To Reproduce
The specific test in our code base: test_async_checkpoint_plugin
A CI run where this occurred: https://github.com/Lightning-AI/lightning/runs/7672029478?check_suite_focus=true
Error:
> trainer.fit(model)
D:\a\lightning\lightning\tests\tests_pytorch\plugins\test_checkpoint_io_plugin.py:127:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:698: in fit
self._call_and_handle_interrupt(
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:650: in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:739: in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1179: in _run
results = self._run_stage()
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1265: in _run_stage
return self._run_train()
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1296: in _run_train
self.fit_loop.run()
d:\a\lightning\lightning\src\pytorch_lightning\loops\loop.py:201: in run
self.on_advance_end()
d:\a\lightning\lightning\src\pytorch_lightning\loops\fit_loop.py:298: in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
d:\a\lightning\lightning\src\pytorch_lightning\trainer\trainer.py:1610: in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:311: in on_train_epoch_end
self._save_topk_checkpoint(trainer, monitor_candidates)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:382: in _save_topk_checkpoint
self._save_monitor_checkpoint(trainer, monitor_candidates)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:662: in _save_monitor_checkpoint
self._update_best_and_save(current, trainer, monitor_candidates)
d:\a\lightning\lightning\src\pytorch_lightning\callbacks\model_checkpoint.py:716: in _update_best_and_save
trainer.strategy.remove_checkpoint(del_filepath)
d:\a\lightning\lightning\src\pytorch_lightning\strategies\strategy.py:460: in remove_checkpoint
self.checkpoint_io.remove_checkpoint(filepath)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1092: in __call__
return self._mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1096: in _mock_call
return self._execute_mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1166: in _execute_mock_call
return self._mock_wraps(*args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\plugins\io\wrapper.py:61: in remove_checkpoint
self.checkpoint_io.remove_checkpoint(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1092: in __call__
return self._mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1096: in _mock_call
return self._execute_mock_call(*args, **kwargs)
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\unittest\mock.py:1166: in _execute_mock_call
return self._mock_wraps(*args, **kwargs)
d:\a\lightning\lightning\src\pytorch_lightning\plugins\io\torch_plugin.py:95: in remove_checkpoint
fs.rm(path, recursive=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <fsspec.implementations.local.LocalFileSystem object at 0x00000180502A8100>
path = ['C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\test_async_checkpoint_plugin0\\epoch=0-step=1.ckpt']
recursive = True, maxdepth = None
def rm(self, path, recursive=False, maxdepth=None):
if isinstance(path, str):
path = [path]
for p in path:
p = self._strip_protocol(p).rstrip("/")
if recursive and self.isdir(p):
if osp.abspath(p) == os.getcwd():
raise ValueError("Cannot delete current working directory")
shutil.rmtree(p)
else:
> os.remove(p)
E PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/test_async_checkpoint_plugin0/epoch=0-step=1.ckpt'
C:\hostedtoolcache\windows\Python\3.9.13\x64\lib\site-packages\fsspec\implementations\local.py:153: PermissionError
Expected behavior
It is an issue I was concerned about in #11561 (comment). A top-k deletion could happen any time. If the thread is trying to write the same file, there is a race condition.
Environment
lightning 1.8.0dev
pytorch 1.10
Additional context
cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @Borda @akihironitta
semaphore-egg
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingcheckpointingRelated to checkpointingRelated to checkpointingtests