-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Fluxion match performance #1007
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Edit: sorry I didn't see all the slack conversation still ongoing, so review is a bit premature.
5b06eeb
to
5177b81
Compare
@grondo and @garlick : @trws, @tpatki, @JaeseungYeom, and Zeke discussed We decided that those improvements will take longer to implement and will create further PRs. The contributions here will speed up most policies and as such can be reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM and already tested without issue on hetchy.
FYI - I pulled these 2 patches into our latest RPM build for TOSS 4 so we'll get these into the next release.
Just noticed a formatting error in the 2nd commit message, but otherwise this should get merged, IMO.
Perhaps we should also reword the PR title before merging since the title will be included in release notes (since this doesn't fully fix the perf degradation)
I'll rename to "Improve Fluxion match performance" since the changes improve performance for every policy I tried. Of course, the |
Problem: `perf` profiling together with a test replacement of `plus` with an in-place addition reveals that the `plus` function is not inlined by the compiler. Add inline keyword to coerce the compiler to inline the function and others in the namespace. The change improves match performance across a run of the issue reproducer by a factor of >100 for several policies.
Problem: the current design of the multilevel ID match policy makes use of a vector of factors (`m_multilevel_factors`) to keep track of all the added factors and accumulates them to generate a prefix for the overall vertex score. The O(n) operation is repeated for each vertex. Furthermore, the vector is not cleared between match traversals, so the operation becomes increasingly expensive. Replace the vector with a single `int64_t` which contains the sum of all vertex factors. Instead of popping the current vertex's factor from a vector, simply subtract its factor stored in `score_factor_t`. The change accelerates exclusive matching by a factor of about 6.
Codecov Report
@@ Coverage Diff @@
## master #1007 +/- ##
========================================
- Coverage 74.4% 74.4% -0.1%
========================================
Files 86 86
Lines 9429 9434 +5
========================================
+ Hits 7019 7021 +2
- Misses 2410 2413 +3
|
Thanks! Feel free to set MWP when you are ready. |
As reported in issue #1001
flux mini submit
experiences pronounced slowdown for most of the Fluxion match policies exceptfirst
. With the debugging help of @grondo and @trws we traced one source to an un-inlined accumulate operation.Adding the
inline
hint to the function allows the compiler to properly inline it and performance improves by a factor of >100 for all policies exceptlonodex
andhinodex
(or, e.g.,lonode
with--exclusive -N1
.Another hotspot we encountered in debugging is the
accumulate
operation incalc_multilevel_scores
:flux-sched/resource/policies/dfu_match_multilevel_id_impl.hpp
Line 71 in f7bdcb5
This O(n)
accumulate
operation is called for eachfinish_vtx
, and even worse, currently scales linearly with the number of matches. This is because the accumulated vector isn't cleared after each traversal. We can replace theaccumulate
operation by a singleint64_t
which stores the prefix value.While these changes improve performance for all policies,
lonodex
andhinodex
are still slow. Future PRs will further improve performance by optimizing graph filtration and lookups on edge iteration.