Skip to content

Conversation

@epretti
Copy link
Member

@epretti epretti commented Aug 5, 2025

This is an attempt to fix race conditions when multiple processes are trying to download espaloma models simultaneously (I believe this was the likely cause of test failures I saw in #397). Specifically, if a process creates ESPALOMA_MODEL_CACHE_PATH in between another one checking that it doesn't exist and trying to create it, the second process would fail. Also, if a process sees that the PyTorch model file exists, it would try to read it even if another process is still in the middle of downloading it, so it will run into an early EOF.

Here makedirs(exist_ok=True) should fix the first problem, and downloading to a temporary location before doing what should be an atomic rename should fix the second. I suspect CI failures from these problems would be intermittent, but I can reliably reproduce them locally, and verify the fixes, by deleting ~/.espaloma and running just the espaloma tests in parallel.

@mattwthompson
Copy link
Collaborator

We should consider using an off-the-shelf tool, @mikemhenry liked https://pypi.org/project/pooch/ in the past.

We (OpenFF) are facing issues with rate-limiting in CI when we hammer assets via GitHub Releases

@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 63.63636% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.09%. Comparing base (06edade) to head (7d3e995).

Files with missing lines Patch % Lines
...penmmforcefields/generators/template_generators.py 63.63% 4 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #398      +/-   ##
==========================================
+ Coverage   81.02%   81.09%   +0.06%     
==========================================
  Files           5        5              
  Lines         822      825       +3     
==========================================
+ Hits          666      669       +3     
  Misses        156      156              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@epretti
Copy link
Member Author

epretti commented Aug 6, 2025

In an ideal world, this could be handled by espaloma itself. It looks like there's some capability for downloading models, but not handling automatic caching unless I missed it. Is this fix to the current solution OK for now, though?

@mikemhenry
Copy link
Collaborator

I agree that this should all be handled in espaloma, the capability to download models started in this package, then I added a really basic method into espaloma.

I like this fix since it will make this more robust and I can work upstream on espaloma native solution.

@mikemhenry mikemhenry merged commit e7c1f9e into openmm:main Aug 6, 2025
14 checks passed
@epretti epretti deleted the fix-espaloma-race branch August 6, 2025 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants