-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Try to fold basic arithmetic operations as part of import #97901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis resolves a user reported issue where despite inlining, the JIT may not see that a given value is constant during import due to these operations not getting folded: SixLabors/ImageSharp#2654 (comment) This resulted in massively pessimized codegen since hardware intrinsics have to fallback to a call and a very large jump table instead to ensure the right behavior is executed. The fix was simple and just involves calling It is not a "complete" fix and there may still be other places where something becomes a constant after importation. That fix, however, is much more involved and requires us transforming an intrinsic node back into a call to managed fairly late (rationalization or lowering).
|
// Fold result, if possible. | ||
op1 = gtFoldExpr(op1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is hit for MATH_OP2
, so cases like ADD
, SUB
, MUL
, DIV
, REM
, AND
, OR
, XOR
// Fold result, if possible. | ||
op1 = gtFoldExpr(op1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is hit for SH_OP2
, so cases like SHL
, SHR
, SHR_UN
// Fold result, if possible. | ||
op1 = gtFoldExpr(op1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is hit for NOT
// Fold result, if possible. | ||
op1 = gtFoldExpr(op1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is hit for NEG
If the TP impact of this is bad or unacceptable, an alternative would be to have the HWIntrinsic import logic call However, doing it here is going to catch more cases and enable more early optimization opportunities, so I thought it the better approach to try first. Notably |
Diff results for #97901Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.13% to +0.02%)
MinOpts (-0.98% to +0.05%)
FullOpts (-0.25% to +0.02%)
Throughput diffs for linux/x64 ran on windows/x64Overall (-0.12% to +0.02%)
MinOpts (-1.05% to +0.06%)
FullOpts (-0.24% to +0.02%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.04%)
FullOpts (-0.04% to +0.03%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.05%)
FullOpts (-0.04% to +0.03%)
Throughput diffs for windows/x64 ran on windows/x64Overall (-0.13% to +0.02%)
MinOpts (-1.08% to +0.06%)
FullOpts (-0.26% to +0.02%)
Details here |
Diff results for #97901Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,506,606 contexts (1,007,092 MinOpts, 1,499,514 FullOpts). MISSED contexts: base: 258 (0.01%), diff: 712 (0.03%) Overall (-477,608 bytes)
MinOpts (-487,164 bytes)
FullOpts (+9,556 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,517,106 contexts (991,070 MinOpts, 1,526,036 FullOpts). MISSED contexts: base: 389 (0.02%), diff: 803 (0.03%) Overall (-297,323 bytes)
MinOpts (-310,501 bytes)
FullOpts (+13,178 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,270,502 contexts (932,669 MinOpts, 1,337,833 FullOpts). MISSED contexts: base: 26 (0.00%), diff: 368 (0.02%) Overall (-474,772 bytes)
MinOpts (-486,704 bytes)
FullOpts (+11,932 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,340,391 contexts (938,449 MinOpts, 1,401,942 FullOpts). MISSED contexts: base: 365 (0.02%), diff: 726 (0.03%) Overall (-477,704 bytes)
MinOpts (-487,284 bytes)
FullOpts (+9,580 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,511,270 contexts (997,391 MinOpts, 1,513,879 FullOpts). MISSED contexts: base: 478 (0.02%), diff: 942 (0.04%) Overall (-301,848 bytes)
MinOpts (-303,461 bytes)
FullOpts (+1,613 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,293,125 contexts (839,657 MinOpts, 1,453,468 FullOpts). MISSED contexts: base: 47 (0.00%), diff: 369 (0.02%) Overall (-134,970 bytes)
MinOpts (-149,310 bytes)
FullOpts (+14,340 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.04% to +0.02%)
MinOpts (-0.92% to +0.03%)
FullOpts (-0.04% to +0.02%)
Throughput diffs for windows/x86 ran on windows/x86Overall (-0.07% to +0.02%)
MinOpts (-1.08% to +0.05%)
FullOpts (-0.06% to +0.02%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.02% to +0.03%)
FullOpts (-0.27% to +0.01%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.10% to +0.02%)
FullOpts (-0.26% to +0.01%)
Details here |
Diff results for #97901Assembly diffsAssembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,293,125 contexts (839,657 MinOpts, 1,453,468 FullOpts). MISSED contexts: base: 47 (0.00%), diff: 369 (0.02%) Overall (-134,970 bytes)
MinOpts (-149,310 bytes)
FullOpts (+14,340 bytes)
Details here |
Diff results for #97901Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.13% to +0.02%)
MinOpts (-0.98% to +0.05%)
FullOpts (-0.25% to +0.02%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.12% to +0.02%)
MinOpts (-1.05% to +0.06%)
FullOpts (-0.24% to +0.02%)
Throughput diffs for osx/arm64 ran on linux/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.04%)
FullOpts (-0.04% to +0.03%)
Throughput diffs for windows/arm64 ran on linux/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.05%)
FullOpts (-0.04% to +0.03%)
Throughput diffs for windows/x64 ran on linux/x64Overall (-0.13% to +0.02%)
MinOpts (-1.08% to +0.06%)
FullOpts (-0.26% to +0.02%)
Details here |
Diff results for #97901Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,505,432 contexts (1,007,092 MinOpts, 1,498,340 FullOpts). MISSED contexts: base: 1,433 (0.06%), diff: 1,886 (0.08%) Overall (-477,640 bytes)
MinOpts (-487,164 bytes)
FullOpts (+9,524 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,515,914 contexts (991,070 MinOpts, 1,524,844 FullOpts). MISSED contexts: base: 1,584 (0.06%), diff: 1,995 (0.08%) Overall (-297,309 bytes)
MinOpts (-310,501 bytes)
FullOpts (+13,192 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,269,757 contexts (932,669 MinOpts, 1,337,088 FullOpts). MISSED contexts: base: 772 (0.03%), diff: 1,113 (0.05%) Overall (-474,764 bytes)
MinOpts (-486,704 bytes)
FullOpts (+11,940 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,339,448 contexts (938,449 MinOpts, 1,400,999 FullOpts). MISSED contexts: base: 1,309 (0.06%), diff: 1,669 (0.07%) Overall (-477,696 bytes)
MinOpts (-487,284 bytes)
FullOpts (+9,588 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,510,378 contexts (997,391 MinOpts, 1,512,987 FullOpts). MISSED contexts: base: 1,370 (0.05%), diff: 1,834 (0.07%) Overall (-301,854 bytes)
MinOpts (-303,461 bytes)
FullOpts (+1,607 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,724 contexts (839,657 MinOpts, 1,452,067 FullOpts). MISSED contexts: base: 1,452 (0.06%), diff: 1,770 (0.08%) Overall (-134,992 bytes)
MinOpts (-149,310 bytes)
FullOpts (+14,318 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (-0.04% to +0.02%)
MinOpts (-0.92% to +0.03%)
FullOpts (-0.04% to +0.02%)
Throughput diffs for windows/x86 ran on windows/x86Overall (-0.07% to +0.02%)
MinOpts (-1.08% to +0.05%)
FullOpts (-0.06% to +0.02%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.02% to +0.03%)
FullOpts (-0.27% to +0.01%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.10% to +0.02%)
FullOpts (-0.26% to +0.01%)
Details here |
Diff results for #97901Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.13% to +0.02%)
MinOpts (-0.98% to +0.05%)
FullOpts (-0.25% to +0.02%)
Throughput diffs for linux/x64 ran on windows/x64Overall (-0.12% to +0.02%)
MinOpts (-1.05% to +0.06%)
FullOpts (-0.24% to +0.02%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.04%)
FullOpts (-0.04% to +0.03%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.05%)
FullOpts (-0.04% to +0.03%)
Throughput diffs for windows/x64 ran on windows/x64Overall (-0.13% to +0.02%)
MinOpts (-1.08% to +0.06%)
FullOpts (-0.26% to +0.02%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.02% to +0.03%)
FullOpts (-0.27% to +0.01%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.10% to +0.02%)
FullOpts (-0.26% to +0.01%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (-0.04% to +0.02%)
MinOpts (-0.92% to +0.03%)
FullOpts (-0.04% to +0.02%)
Throughput diffs for windows/x86 ran on windows/x86Overall (-0.07% to +0.02%)
MinOpts (-1.08% to +0.05%)
FullOpts (-0.07% to +0.02%)
Details here |
Diff results for #97901Throughput diffsThroughput diffs for windows/x64 ran on linux/x64Overall (-0.13% to +0.02%)
MinOpts (-1.08% to +0.06%)
FullOpts (-0.26% to +0.02%)
Details here Throughput diffs for windows/x86 ran on linux/x86Overall (-0.07% to +0.02%)
MinOpts (-1.08% to +0.05%)
FullOpts (-0.07% to +0.02%)
Details here |
Diff results for #97901Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,505,432 contexts (1,007,092 MinOpts, 1,498,340 FullOpts). MISSED contexts: base: 1,433 (0.06%), diff: 1,886 (0.08%) Overall (-477,640 bytes)
MinOpts (-487,164 bytes)
FullOpts (+9,524 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,515,914 contexts (991,070 MinOpts, 1,524,844 FullOpts). MISSED contexts: base: 1,584 (0.06%), diff: 1,995 (0.08%) Overall (-297,309 bytes)
MinOpts (-310,501 bytes)
FullOpts (+13,192 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,269,757 contexts (932,669 MinOpts, 1,337,088 FullOpts). MISSED contexts: base: 772 (0.03%), diff: 1,113 (0.05%) Overall (-474,764 bytes)
MinOpts (-486,704 bytes)
FullOpts (+11,940 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,339,448 contexts (938,449 MinOpts, 1,400,999 FullOpts). MISSED contexts: base: 1,309 (0.06%), diff: 1,669 (0.07%) Overall (-477,696 bytes)
MinOpts (-487,284 bytes)
FullOpts (+9,588 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,510,378 contexts (997,391 MinOpts, 1,512,987 FullOpts). MISSED contexts: base: 1,370 (0.05%), diff: 1,834 (0.07%) Overall (-301,854 bytes)
MinOpts (-303,461 bytes)
FullOpts (+1,607 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,726 contexts (839,658 MinOpts, 1,452,068 FullOpts). MISSED contexts: base: 1,452 (0.06%), diff: 1,770 (0.08%) Overall (-134,992 bytes)
MinOpts (-149,310 bytes)
FullOpts (+14,318 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.13% to +0.02%)
MinOpts (-0.98% to +0.05%)
FullOpts (-0.25% to +0.02%)
Throughput diffs for linux/x64 ran on windows/x64Overall (-0.12% to +0.02%)
MinOpts (-1.05% to +0.06%)
FullOpts (-0.24% to +0.03%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.04%)
FullOpts (-0.04% to +0.03%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.06% to +0.03%)
MinOpts (-1.00% to +0.05%)
FullOpts (-0.04% to +0.03%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (-0.04% to +0.02%)
MinOpts (-0.92% to +0.03%)
FullOpts (-0.04% to +0.02%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.02% to +0.03%)
FullOpts (-0.27% to +0.01%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.14% to +0.01%)
MinOpts (-1.10% to +0.02%)
FullOpts (-0.26% to +0.01%)
Details here |
Looks like this particularly helps inlining eliminate dead branches and that in turn allows forward sub to do more. Without it, for a case like
The fact that we couldn't remove it earlier means we missed opportunities like the following: [######] hA-XG------ * STOREIND ref
[######] ---X------- +--* FIELD_ADDR byref System.Formats.Tar.TarHeader:_magic
[######] ----------- | \--* LCL_VAR ref V05 tmp2
- [######] ----------- \--* LCL_VAR ref V10 tmp7 (last use)
+ [######] ----------- \--* CNS_STR ref <string constant> and [######] hA-XG+----- * STOREIND ref
[######] -----+----- +--* ADD byref
[######] -----+----- | +--* LCL_VAR ref V05 tmp2
[######] -----+----- | \--* CNS_INT long 32 Fseq[_magic]
- [######] -----+----- \--* LCL_VAR ref V10 tmp7 (last use)
+ [######] H----+----- \--* CNS_INT(h) ref '' This also means, in this particular case, we're tracking 6 additional basic blocks all the way up through In both cases, we end up with CSE deciding to CSE Before this PR we get (noting it looks like handle constants have an extra newline putting disconnecting the - N002 ( 7, 13) [000152] DA--------- * STORE_LCL_VAR ref V12 tmp9 d:1 $VN.Void
- N001 ( 3, 10) CSE #02 (use)[000182] H---------- \--* CNS_INT(h) ref ''
+ N005 ( 8, 14) [000133] DA--------- * STORE_LCL_VAR ref V10 tmp7 d:1 $VN.Void
+ N004 ( 4, 11) [000207] -A--------- \--* COMMA ref $207
+ N002 ( 3, 10) [000205] DA--------- +--* STORE_LCL_VAR ref V17 cse1 d:1 $VN.Void
+ N001 ( 3, 10) [000178] H---------- | \--* CNS_INT(h) ref ''
+ $207
+ N003 ( 1, 1) [000206] ----------- \--* LCL_VAR ref V17 cse1 u:1 $207 After this PR we instead get: -N005 ( 8, 15) [000115] hA-XG------ * STOREIND ref $VN.Void
-N003 ( 2, 2) [000176] -------N--- +--* ADD byref $2c7
-N001 ( 1, 1) [000110] ----------- | +--* LCL_VAR ref V05 tmp2 u:1 $1c2
-N002 ( 1, 1) [000175] ----------- | \--* CNS_INT long 40 Fseq[_version] $289
-N004 ( 3, 10) CSE #02 (use)[000177] H---------- \--* CNS_INT(h) ref ''
+N008 ( 9, 16) [000109] hA-XG---R-- * STOREIND ref $VN.Void
+N007 ( 2, 2) [000173] -------N--- +--* ADD byref $2c6
+N005 ( 1, 1) [000104] ----------- | +--* LCL_VAR ref V05 tmp2 u:1 $1c2
+N006 ( 1, 1) [000172] ----------- | \--* CNS_INT long 32 Fseq[_magic] $288
+N004 ( 4, 11) [000199] -A--------- \--* COMMA ref $207
+N002 ( 3, 10) [000197] DA--------- +--* STORE_LCL_VAR ref V17 cse1 d:1 $VN.Void
+N001 ( 3, 10) [000174] H---------- | \--* CNS_INT(h) ref ''
+ $207
+N003 ( 1, 1) [000198] ----------- \--* LCL_VAR ref V17 cse1 u:1 $207 We then hit N005 ( 8, 7) [000109] hA-XG------ * STOREIND ref $VN.Void
N003 ( 2, 2) [000181] -------N--- +--* ADD byref $2c6
N001 ( 1, 1) [000104] ----------- | +--* LCL_VAR ref V05 tmp2 u:1 $1c2
N002 ( 1, 1) [000180] ----------- | \--* CNS_INT long 32 Fseq[_magic] $288
-N004 ( 3, 2) [000126] ----------- \--* LCL_VAR ref V10 tmp7 u:1 (last use) $207
+ [000209] H---------- \--* CNS_INT(h) long ''
+ $207 but where we don't get the same with the PR, presumably because of the This then causes us to emit additional |
src/coreclr/jit/assertionprop.cpp
Outdated
if (indir->OperIs(GT_STOREIND)) | ||
{ | ||
return false; | ||
// We like CSE to happen for handles, as the codegen for loading a 64-bit constant can be pretty heavy | ||
// and this is particularly true on platforms with a fixed-width instruction encoding. However, this | ||
// pessimizes stores as we can no longer optimize around some object handles that would allow us to | ||
// bypass the write barrier. | ||
// | ||
// In order to handle that, we'll propagate the IND_TGT_NOT_HEAP flag onto the store if the handle is | ||
// directly or if the underlying value number is an applicable object handle. | ||
|
||
const GenTree* data = indir->AsIndir()->Data(); | ||
if (data->IsIconHandle(GTF_ICON_OBJ_HDL) || vnStore->IsVNObjHandle(data->gtVNPair.GetConservative())) | ||
{ | ||
indir->gtFlags |= GTF_IND_TGT_NOT_HEAP; | ||
updated = true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ended up being a simple workaround to the issue described above using the existing flag.
It even fixes some cases that we weren't handling today and ensures that we can continue using the CSE, without relying on LSRA making propagated constants reused itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(actually change is courtesy EgorBo, from a discussion on Discord)
a97584e
to
0236d0b
Compare
0236d0b
to
a1b4287
Compare
Diff results for #97901Assembly diffsAssembly diffs for linux/arm64 ran on linux/x64Diffs are based on 2,505,433 contexts (1,007,092 MinOpts, 1,498,341 FullOpts). MISSED contexts: base: 1,433 (0.06%), diff: 1,885 (0.08%) Overall (-941,616 bytes)
MinOpts (-487,164 bytes)
FullOpts (-454,452 bytes)
Assembly diffs for osx/arm64 ran on linux/x64Diffs are based on 2,269,758 contexts (932,669 MinOpts, 1,337,089 FullOpts). MISSED contexts: base: 772 (0.03%), diff: 1,112 (0.05%) Overall (-923,512 bytes)
MinOpts (-486,704 bytes)
FullOpts (-436,808 bytes)
Assembly diffs for windows/arm64 ran on linux/x64Diffs are based on 2,339,449 contexts (938,449 MinOpts, 1,401,000 FullOpts). MISSED contexts: base: 1,309 (0.06%), diff: 1,668 (0.07%) Overall (-949,664 bytes)
MinOpts (-487,284 bytes)
FullOpts (-462,380 bytes)
Assembly diffs for windows/x64 ran on linux/x64Diffs are based on 2,510,378 contexts (997,391 MinOpts, 1,512,987 FullOpts). MISSED contexts: base: 1,370 (0.05%), diff: 1,834 (0.07%) Overall (-741,486 bytes)
MinOpts (-303,461 bytes)
FullOpts (-438,025 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (-0.12% to +0.06%)
MinOpts (-0.98% to +0.05%)
FullOpts (-0.23% to +0.06%)
Throughput diffs for linux/x64 ran on windows/x64Overall (-0.11% to +0.06%)
MinOpts (-1.05% to +0.06%)
FullOpts (-0.23% to +0.06%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.01% to +0.08%)
MinOpts (-1.00% to +0.04%)
FullOpts (-0.00% to +0.08%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.00% to +0.08%)
MinOpts (-1.00% to +0.05%)
FullOpts (+0.00% to +0.08%)
Throughput diffs for windows/x64 ran on windows/x64Overall (-0.12% to +0.06%)
MinOpts (-1.08% to +0.06%)
FullOpts (-0.24% to +0.06%)
Details here Throughput diffs for windows/x86 ran on linux/x86Overall (-0.02% to +0.05%)
MinOpts (-1.08% to +0.05%)
FullOpts (-0.06% to +0.05%)
Details here |
Diff results for #97901Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,515,915 contexts (991,070 MinOpts, 1,524,845 FullOpts). MISSED contexts: base: 1,584 (0.06%), diff: 1,994 (0.08%) Overall (-706,269 bytes)
MinOpts (-310,501 bytes)
FullOpts (-395,768 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,726 contexts (839,658 MinOpts, 1,452,068 FullOpts). MISSED contexts: base: 1,452 (0.06%), diff: 1,770 (0.08%) Overall (-134,992 bytes)
MinOpts (-149,310 bytes)
FullOpts (+14,318 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (+0.00% to +0.07%)
MinOpts (-0.92% to +0.03%)
FullOpts (-0.01% to +0.07%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (-0.13% to +0.04%)
MinOpts (-1.02% to +0.03%)
FullOpts (-0.25% to +0.04%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.13% to +0.04%)
MinOpts (-1.10% to +0.02%)
FullOpts (-0.25% to +0.04%)
Details here |
CC. @dotnet/jit-contrib This should be ready for review now. For For For the size, we between For the size on x86 we see around |
Can you characterize the C# code that leads to new folding in MinOpts code? We need to make sure we do not regress the ability to set breakpoints (probably not if this is just folding trees only, but it would be good to double check). |
// nullptr here, but the more conservative assert can help avoid JIT bugs | ||
|
||
assert(a->GetKind() == FieldSeq::FieldKind::SimpleStaticKnownAddress); | ||
assert(b->GetKind() == FieldSeq::FieldKind::SimpleStaticKnownAddress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you hit that assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, its what was discussed further up in #97901 (comment)
Basically, we have an IL test that explicitly adds two ldsflda
. It does so after a conv.i4
, but that is a nop
on 32-bit and a user can write such IL without the conv.i4
as well, so the assert was already somewhat incorrect just not hit due to us not constant folding this early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Yeah I think we discussed it with SingleAccretion at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a couple of nits and Jakob's request
My understanding that this should not affect debuggability because: |
This is my understanding as well. We explicitly avoid constant folding like this in true MinOpts (different from T0) or if generating DbgCode. To give some explicit examples, however: [MethodImpl(MethodImplOptions.AggressiveInlining)]
private static nint GetCharVector256SpanLength(nint offset, nint length)
=> (length - offset) & ~(Vector256<ushort>.Count - 1); Gets the following: - mov ecx, 16
- dec ecx
- not ecx
- movsxd rcx, ecx
- and rax, rcx
- ;; size=23 bbWeight=1 PerfScore 4.25
+ and rax, -16
+ ;; size=12 bbWeight=1 PerfScore 3.25 It happens because There is maybe an interesting case with https://github.com/dotnet/runtime/blob/main/src/tests/JIT/Regression/VS-ia64-JIT/M00/b85316/b85316_M00.il where the diff is as follows: -G_M4754_IG01: ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
+G_M4754_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
push rbp
- vzeroupper
mov rbp, rsp
- ;; size=7 bbWeight=1 PerfScore 2.25
-G_M4754_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
- vmovsd xmm0, qword ptr [reloc @RWD00]
- vxorps xmm0, xmm0, xmmword ptr [reloc @RWD16]
- vucomisd xmm0, qword ptr [reloc @RWD32]
- jp SHORT G_M4754_IG03
- je SHORT G_M4754_IG05
- ;; size=28 bbWeight=1 PerfScore 11.00
-G_M4754_IG03: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
- mov eax, 5
- ;; size=5 bbWeight=1 PerfScore 0.25
-G_M4754_IG04: ; bbWeight=1, epilog, nogc, extend
- pop rbp
- ret
- ;; size=2 bbWeight=1 PerfScore 1.50
-G_M4754_IG05: ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
+ ;; size=4 bbWeight=1 PerfScore 1.25
+G_M4754_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
mov eax, 100
;; size=5 bbWeight=1 PerfScore 0.25
-G_M4754_IG06: ; bbWeight=1, epilog, nogc, extend
+G_M4754_IG03: ; bbWeight=1, epilog, nogc, extend
pop rbp
ret
;; size=2 bbWeight=1 PerfScore 1.50
-RWD00 dq 4058C66666666666h ; 99.1
-RWD08 dd 00000000h, 00000000h
-RWD16 dq 8000000000000000h, 8000000000000000h
-RWD32 dq C058C66666666666h ; -99.1
-
-; Total bytes of code 49, prolog size 7, PerfScore 16.75, instruction count 14, allocated bytes for code 49 (MethodHash=9460ed6d) for method _n:main():int (Tier0)
+; Total bytes of code 11, prolog size 4, PerfScore 3.00, instruction count 5, allocated bytes for code 11 (MethodHash=9460ed6d) for method _n:main():int (Tier0) In this scenario, the JIT is able to constant fold the comparison between the two doubles and treat it as constant true. This even allows it to eliminate a dead branch in T0, but that is pre-existing and the same would happen if the user had done something that didn't involve All the examples I'm finding are pretty similar, where its just trivial arithmetic involving constants that can now be folded. |
Thanks for checking. |
Diff results for #97901Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,505,433 contexts (1,007,092 MinOpts, 1,498,341 FullOpts). MISSED contexts: base: 1,433 (0.06%), diff: 1,885 (0.08%) Overall (-941,616 bytes)
MinOpts (-487,164 bytes)
FullOpts (-454,452 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,515,915 contexts (991,070 MinOpts, 1,524,845 FullOpts). MISSED contexts: base: 1,584 (0.06%), diff: 1,994 (0.08%) Overall (-706,269 bytes)
MinOpts (-310,501 bytes)
FullOpts (-395,768 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,269,758 contexts (932,669 MinOpts, 1,337,089 FullOpts). MISSED contexts: base: 772 (0.03%), diff: 1,112 (0.05%) Overall (-923,512 bytes)
MinOpts (-486,704 bytes)
FullOpts (-436,808 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,339,449 contexts (938,449 MinOpts, 1,401,000 FullOpts). MISSED contexts: base: 1,309 (0.06%), diff: 1,668 (0.07%) Overall (-949,664 bytes)
MinOpts (-487,284 bytes)
FullOpts (-462,380 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,510,378 contexts (997,391 MinOpts, 1,512,987 FullOpts). MISSED contexts: base: 1,370 (0.05%), diff: 1,834 (0.07%) Overall (-741,486 bytes)
MinOpts (-303,461 bytes)
FullOpts (-438,025 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64Overall (-0.13% to +0.04%)
MinOpts (-1.02% to +0.03%)
FullOpts (-0.25% to +0.04%)
Throughput diffs for linux/x64 ran on linux/x64Overall (-0.13% to +0.04%)
MinOpts (-1.10% to +0.02%)
FullOpts (-0.25% to +0.04%)
Details here Throughput diffs for linux/arm64 ran on windows/x64Overall (-0.12% to +0.06%)
MinOpts (-0.98% to +0.05%)
FullOpts (-0.23% to +0.06%)
Throughput diffs for linux/x64 ran on windows/x64Overall (-0.11% to +0.06%)
MinOpts (-1.05% to +0.05%)
FullOpts (-0.23% to +0.06%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (-0.01% to +0.08%)
MinOpts (-1.00% to +0.04%)
FullOpts (-0.00% to +0.08%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (-0.00% to +0.08%)
MinOpts (-1.00% to +0.05%)
FullOpts (+0.00% to +0.08%)
Throughput diffs for windows/x64 ran on windows/x64Overall (-0.12% to +0.06%)
MinOpts (-1.08% to +0.06%)
FullOpts (-0.24% to +0.06%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.00% to +0.07%)
MinOpts (-0.92% to +0.03%)
FullOpts (-0.01% to +0.07%)
Throughput diffs for windows/x86 ran on windows/x86Overall (-0.02% to +0.05%)
MinOpts (-1.08% to +0.05%)
FullOpts (-0.06% to +0.05%)
Details here |
Diff results for #97901Assembly diffsAssembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,291,726 contexts (839,658 MinOpts, 1,452,068 FullOpts). MISSED contexts: base: 1,452 (0.06%), diff: 1,770 (0.08%) Overall (-134,992 bytes)
MinOpts (-149,310 bytes)
FullOpts (+14,318 bytes)
Details here |
This resolves a user reported issue where despite inlining, the JIT may not see that a given value is constant during import due to these operations not getting folded: SixLabors/ImageSharp#2654 (comment)
This resulted in massively pessimized codegen since hardware intrinsics have to fallback to a call and a very large jump table instead to ensure the right behavior is executed.
The fix was simple and just involves calling
gtFoldExpr
before pushing the resulting node on the stack, as we're already doing for several other scenarios such as comparisons and similar cases.It is not a "complete" fix and there may still be other places where something becomes a constant after importation. That fix, however, is much more involved and requires us transforming an intrinsic node back into a call to managed fairly late (rationalization or lowering).