Fix espaloma model download race conditions #398
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an attempt to fix race conditions when multiple processes are trying to download espaloma models simultaneously (I believe this was the likely cause of test failures I saw in #397). Specifically, if a process creates
ESPALOMA_MODEL_CACHE_PATHin between another one checking that it doesn't exist and trying to create it, the second process would fail. Also, if a process sees that the PyTorch model file exists, it would try to read it even if another process is still in the middle of downloading it, so it will run into an early EOF.Here
makedirs(exist_ok=True)should fix the first problem, and downloading to a temporary location before doing what should be an atomic rename should fix the second. I suspect CI failures from these problems would be intermittent, but I can reliably reproduce them locally, and verify the fixes, by deleting~/.espalomaand running just the espaloma tests in parallel.