-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Optimize multi-dimensional array access #70271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize multi-dimensional array access #70271
Conversation
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsCurrently, multi-dimensional (MD) array access operations are treated as opaque to most of This change moves the expansion of The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.
For example, a simple 2-dimensional array access like This is replaced by: plus the appropriate In IR, this is: before being morphed by the usual morph transformations. Some things to consider:
Remaining work:
The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting Fixes #60785.
|
src/coreclr/jit/morph.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm re-morphing the tree here, which seems like the most targeted thing to do. But I've introduced GT_ASG nodes, and the GTF_ASG flag needs to propagate to the root. As a result, I'm also (currently) re-morphing changed trees at the statement level, below. Should I just stop re-morphing here and let the statement-level re-morph do its thing? Or should I re-morph here, exactly what was changed, and then do something else to propagate flags up the tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.
|
[edit] Added a 2nd run to validate. MDRomer regression evaporated. MDMulMatrix regression validated. Some perf results from the MDBenchI/MDBenchF suite Run 1
Run 2
|
|
As seen above, MDRomer has a small regression that could be investigated. Almost all spmi asmdiffs are improvements. There are a few outlier regressions that should be investigated, all in |
|
@AndyAyersMS @dotnet/jit-contrib PTAL |
|
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
/azp run runtime-coreclr gcstress0x3-gcstress0xc |
|
Azure Pipelines successfully started running 1 pipeline(s). |
src/coreclr/jit/morph.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not an immediate concern with this change, but I wonder if you have some ideas on how to approach adding VN support for this early expansion.
The SZ case utilizes a parser utility that tries to reconstruct whatever morph left, the (significantly) more complex MD trees look less amenable to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an excellent question, and needs more thought.
|
Any idea why the |
Yeah, I was looking at those too. Seems we have cases like |
MDMulMatrix also has a (big?) regression |
I need to investigate. Maybe related to the large size regressions in the cases I mentioned and Kunal pointed out. Over 7% on win-x64 is pretty extreme given that I doubt many tests even have MD arrays. |
AndyAyersMS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall.
You might consider keeping a temp cache in fgMorphArrayOps and mark all temps as not in use after each statement.
We do this in other places (eg for struct arg passing, and importer box temps) to try and keep the total number of temps reasonable.
SSA will see these recycled temps as having many distinct lifetimes so it should not inihibit opts.
src/coreclr/jit/importer.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems dangerous to only add the cast under DEBUG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops; I moved the "enabling" code from DEBUG to all-flavor last-minute but didn't update this.
src/coreclr/jit/morph.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'd just do it once, at the end, otherwise if there are multiple MD array accesses you are walking to the root multiple times.
src/coreclr/jit/morph.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If idx is side-effect free but nontrivial you will want to use a temp too, otherwise you might duplicate a lot of stuff and force CSE to clean up after you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case without a temp, I just use the idx tree directly, so there is no copy. Reminds me that I should DEBUG_DESTROY_NODE the GT_ARR_ELEM node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, they are single use, it's the effective index that is multiple use.
AndyAyersMS
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM.
We should verify this temp cache fixes the TP issues. Not sure if you did this locally, but I can't tell this yet from CI as you have merge conflicts to resolve.
The temp cache improved CQ of MDMulMatrix (and maybe others), but it's still slower than before. So, I still need to investigate MDMulMatrix CQ, as well as checking the big asm diffs regressions for the |
|
MDMulMatrix: This test has 1 doubly-nested and 6 triply-nested loops. If I split out all the loops to individual benchmarks, they are up to 25% better with this change (one has no perf change: the "lkj" loop). If I reduce the number of loop nests in the function, when there are 3 or more, this change is slower, up to 1.5x the baseline. Seems like there must be issues with quantity of IR/temps. [Edit] Various MDMulMatrix subset perf runs
|
aca4585 to
2c63c4b
Compare
|
Size regressions: With the temp cache implemented, all the Also, the improvements far outweigh the regressions, e.g., for coreclr_tests: MDMulMatrix is an outlier for how large the size regression is: |
|
TP diffs still shows significant regression on coreclr_tests spmi run, but that's probably just because that's where we actually have MD array accesses and the most asm code diffs. E.g., for win-x64:
|
|
One thing you can consider is hacking SPMI to produce a table of functions with the # instructions executed for each context. That might help narrow into if it's expected.
and diffMetrics.NumExecutedInstructions here:runtime/src/coreclr/tools/superpmi/superpmi/superpmi.cpp Lines 396 to 397 in 0d0fd75
and then post process this into something. |
Thanks for the suggestion. I was going to figure out which method contexts are affected (at all) by my change, and extract them into a separate mch, then use JitTimeLogCsv. A perfview diff of a spmi replay (of the full coreclr_tests collection) with/without my change shows a significant increase in fgMorphSmpOp and GenTreeVisitor::WalkTree, so I should look at total IR size before/after as well. |
Didn't know about this, looks useful. Maybe I should add the precise instruction count data in this mechanism too. |
Configuration variables: 1. `COMPlus_JitEarlyExpandMDArrays`. Set to zero to disable early MD expansion. Default is 1 (enabled). 2. If `COMPlus_JitEarlyExpandMDArrays=0`, use `COMPlus_JitEarlyExpandMDArraysFilter` to selectively enable early MD expansion for a method set (e.g., syntax like `JitDump`)
b2203da to
a0d0812
Compare
|
/azp run runtime-coreclr outerloop, runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
|
Azure Pipelines successfully started running 3 pipeline(s). |
|
Improvements on Linux-x64 dotnet/perf-autofiling-issues#6721 |

Currently, multi-dimensional (MD) array access operations are treated as opaque to most of
the JIT; they pass through the optimization pipeline untouched. Lowering expands the
GT_ARR_ELEMnode (representing a
a[i,j]operation, for example) toGT_ARR_OFFSETandGT_ARR_INDEXtrees,to expand the register requirements of the operation. These are then directly used to generate code.
This change moves the expansion of
GT_ARR_ELEMto a new pass that follows loop optimization but precedesValue Numbering, CSE, and the rest of the optimizer. This placement allows for future improvement to
loop cloning to support cloning loops with MD references, but allows the optimizer to kick in on the new
expansion. One nice feature of this change: there is no machine-dependent code required; all the nodes
get lowered to machine-independent nodes before code generation.
The MDBenchI and MDBenchF micro-benchmarks (very targeted to this work) improve about 10% to 60%.
GT_ARR_ELEMnodes are morphed to appropriate trees. Note that an MD arrayGet,Set, orAddressoperation is imported as a call, and, if all required conditions are satisfied, is treated as an intrinsic
and replaced by IR nodes, especially
GT_ARR_ELEMnodes, inimpArrayAccessIntrinsic().For example, a simple 2-dimensional array access like
a[i,j]looks like:This is replaced by:
plus the appropriate
iandjbounds checks.In IR, this is:
before being morphed by the usual morph transformations.
Some things to consider:
lower bound)
GT_MDARR_LOWER_BOUND(dim)represents the lower-bound value for a particular array dimension. The "effectiveindex" for a dimension is the index minus the lower bound.
GT_MDARR_LENGTH(dim)represents the length value (number of elements in a dimension) for a particulararray dimension.
TYP_INT).the array object to the beginning of the array data is added.
to preserve exception ordering.
address
byref. As usual, we need to be careful to not create an illegal byref by adding any partial index.calculation.
OMF_HAS_MDARRAYREFflag if there are anyMD array expressions to expand. Also, the block flag
BBF_HAS_MDARRAYREFis set to blocks where these exist,so only those blocks are processed.
Remaining work:
optEarlyPropsupport for MD arrays.GT_ARR_OFFSETandGT_ARR_INDEXnodes and related code, as well asGT_ARR_ELEMcode used after the new expansion.
The new early expansion is enabled by default. It can be disabled (even in Release, currently), by setting
COMPlus_JitEarlyExpandMDArrays=0. If disabled, it can be selectively enabled usingCOMPlus_JitEarlyExpandMDArraysFilter=<method_set>(e.g., as specified forJitDump).Fixes #60785.