-
Notifications
You must be signed in to change notification settings - Fork 735
Optimize the calculation speed of InterRDF_s
#5073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…erRDF_s to optimize redundant np.histogram calls. Additionally, divided by N in _conclude to make the computation consistent with InterRDF.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello there first time contributor! Welcome to the MDAnalysis community! We ask that all contributors abide by our Code of Conduct and that first time contributors introduce themselves on GitHub Discussions so we can get to know you. You can learn more about participating here. Please also add yourself to package/AUTHORS as part of this PR.
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution! With any code rewrite for performance, the first thing is to make sure that the existing tests still pass. Have a look at the test output (you can run the tests locally, too) and modify your code so that it produces the same results as before.
If you have any reason to belief that the new code produces correct output while the old code is incorrect, please start a discussion.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #5073 +/- ##
===========================================
+ Coverage 92.68% 92.69% +0.01%
===========================================
Files 180 180
Lines 22446 22452 +6
Branches 3187 3186 -1
===========================================
+ Hits 20804 20812 +8
+ Misses 1167 1166 -1
+ Partials 475 474 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some minimal performance data (before/after)?
Can you add an entry to CHANGELOG where you summarize your changes? It looks to me that the current version does not change normalization and simply addresses performance? (That's fine and makes it easier to merge the PR... if results change then that's more tricky.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the calculation speed of InterRDF_s by refactoring bin index calculation to avoid redundant np.histogram calls and ensures result consistency with InterRDF by adding normalization.
- Replaced multiple
np.histogramcalls with direct bin index calculation in_single_frame - Added normalization by dividing by
Nin_concludemethod to matchInterRDFbehavior
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments, in particular can you rewrite with np.digitize?
@orbeckst |
performance
test code>>> rdf1 = before(u, [[ag1, ag2]]) # alias InterRDF_s as before and after
>>> %timeit rdf1.run(start=0,stop=10)
27.1 s ± 55.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> rdf2 = after(u, [[ag1, ag2]])
>>> %timeit rdf2.run(start=0,stop=10)
158 ms ± 513 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
Added new contributors and optimized performance of RDF calculation.
…and update CHANGELOG.
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you please adjust variable names to make the code more readable?
- I think we can boost the speed further by replacing
np.isin().
See comments below. I think that's really it.
Can you do a quick benchmark of the final code version and post the result?
|
We just merged #4884 that enables parallelization. We do want to check that your optimized code also works in the parallel context. |
…to gitzhangch-develop
- corrected version changed dates for PR #4884 in analysis.rdf to 2.10.0 - moved entry in CHANGELOG to 2.10.0
|
@gitzhangch when continuing on this PR, do a |
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks for addressing my last comments.
Have you done a quick before/after benchmark? I'd like to see the improvement before I'll merge.
before/after benchmark@orbeckst Because the atomic pair size set in the benchmark is relatively small, the performance is not as striking as before. The larger the Additionally, is it necessary to update the corresponding files in the benchmark folder? codeimport MDAnalysis
try:
from MDAnalysisTests.datafiles import TPR, XTC
except:
pass
try:
from MDAnalysis.analysis.rdf import InterRDF_s
except:
pass
class SimpleRdfBench(object):
"""Benchmarks for MDAnalysis.analysis.rdf"""
params = ([20, 75, 200], [[0, 5], [0, 15], [0, 20]], [10, 100], [1, 3, 9])
param_names = ["nbins", "range_val", "natoms", "npairs"]
def setup(self, nbins, range_val, natoms, npairs):
self.sel_str = "name OW"
self.u = MDAnalysis.Universe(TPR, XTC)
try:
self.sel = self.u.select_atoms(self.sel_str)[:natoms]
except AttributeError:
self.sel = self.u.selectAtoms(self.sel_str)[:natoms]
ags = [[self.sel, self.sel]] * npairs
self.rdfs = InterRDF_s(u=self.u, ags=ags, nbins=nbins, range=range_val)
def time_interrdf_s(self, nbins, range_val, natoms, npairs):
"""Benchmark a full trajectory parse
by MDAnalysis.analysis.rdf.InterRDF_s
"""
self.rdfs.run(stop=5)resultbefore after |
InterRDF_s to ensure that the results are consistent with InterRDFInterRDF_s
orbeckst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also add the timing benchmark to benchmarks/analysis/rdf.py? That's a great addition!
Please check that you're also in AUTHORS – I didn't look.
Thank you!
|
I toned down the claim of "100-fold" speedup to "> 20-fold", simply because that's what you demonstrate here. It may well be much faster for larger N and that's great but let's be conservative. I really like that you created a new benchmark case for InterRDF_s and I really like you to add it to the code base, please. It's really good practice, especially for anything performance-related. I'll just wait with merging until you had a chance to look at the PR and I'll do it as soon as you're done here. Excellent work @gitzhangch ! |
|
@gitzhangch in order to get these changes into the next release, would you be able to respond by tomorrow, Tuesday? |
|
@orbeckst I update AUTHORS and rdf benchmark. |
|
Fantastic, will merge when CI is all green. |
|
Thank you for your contribution @gitzhangch , it will be in release 2.10.0! 🎉 |
Refactored bin index calculation in the _single_frame function of InterRDF_s to optimize redundant np.histogram calls.
Fixes #5067
Changes made in this Pull Request:
InterRDF_s._single_frameto avoid multiple redundantnp.histogramcalls, improving performance.PR Checklist
package/CHANGELOGfile updated?package/AUTHORS? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5073.org.readthedocs.build/en/5073/