-
Notifications
You must be signed in to change notification settings - Fork 5k
Optimize 'x & cns == cns' pattern #103868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@dotnet-policy-service agree |
src/coreclr/jit/morph.cpp
Outdated
|
||
ssize_t cnsVal = op2->AsIntCon()->IconValue(); | ||
|
||
if (andOp1->TypeIs(TYP_INT, TYP_LONG) && andOp2->IsIntegralConst() && andOp2->AsIntCon()->IconValue() == cnsVal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IsIntegralConst should be replaced with IsIntCns or whatever it's called, I forgot. Otherwise AsIntCon()
might fails for what is actually AsLngCon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, you can just do GenTree::Compare(op2, andOp2)
src/coreclr/jit/morph.cpp
Outdated
GenTree* tmpNode = andOp2; | ||
op1->AsOp()->gtOp2 = gtNewOperNode(GT_NOT, andOp1->TypeGet(), andOp1); | ||
op1->AsOp()->gtOp1 = tmpNode; | ||
op2->AsIntConCommon()->SetIconValue(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be op2->gtBashToCns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to where this function is located? I can only find a gtBashToNOP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found a gtBashToZeroConst function, so using that instead
src/coreclr/jit/morph.cpp
Outdated
op2->AsIntConCommon()->SetIconValue(0); | ||
#if defined(TARGET_ARM) || defined(TARGET_ARM64) | ||
// If set for XARCH it will prevent ANDN instruction from being emitted | ||
op1->gtFlags |= GTF_SET_FLAGS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afair, GTF_SET_FLAGS is a lowering concept and shouldn't be used in morph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea how I can get bic to become bics then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and the change in codegen are not quite the right way to accomplish reusing the flags. Instead you should take a look at Lowering::TryLowerConditionToFlagsNode
, specifically the part that checks SupportsSettingZeroFlags
, and try to update that logic to handle these cases as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Jakob! This was a great hint and I figured out a generic solution.
It looks like this is a regression overall: https://dev.azure.com/dnceng-public/public/_build/results?buildId=726440&view=ms.vss-build-web.run-extensions-tab Can you look into why that is and see if you can fix them? |
I have looked at the diffs and I am trying a fix with my new commit. May I ask how you identified there were problems? The job you referenced ran successfully so I assumed nothing was wrong. Do the presence of artifacts suggest something is wrong? Thanks! |
Yes, the superpmi-diffs job runs the compiler on a large number of methods and produces codegen diffs that can be used to evaluate changes in the JIT. The diffs are available under the "Extensions" page when you access the pipeline run on Azure Pipelines. Your new change looks better diffs wise, but still looks like it regresses some common cases, so you may want to take another look at some of the regressions. |
Hi Jakob, thank you for all your help so far, it has been of great assistance for working on this issue. I have a few problems that I can't seem to fix, I'm hoping you can provide some insight: For x64: The expected codegen based on the raised issue was that the instruction sequence One regression was this: movzx rdx, byte ptr [rax+0x18]
- and edx, 7
- cmp edx, 7
+ mov r8d, 7
+ andn edx, edx, r8d
jne SHORT G_M31552_IG05 Below is the example in the original linked issue. and edx, 0xC000000
cmp edx, 0xC000000
jne SHORT G_M37282_IG04 This required a mov for the constant hex value 0xC000000. Why was a mov necessary unlike the above value 7, which was directly used? The value 0xC000000 is 28 bits long, and as far as I know, the Then for x86: When Then for ARM64: ; byrRegs +[x0]
ldr w0, [x0]
; byrRegs -[x0]
- and w0, w0, #0xD1FFAB1E
- cmp w0, #48, LSL #12
+ mov w1, #0xD1FFAB1E
+ bic w0, w1, w0
+ cmp w0, #0
cset x0, eq I found this peculiar example. My code change makes sure that the constant used in the 0x48 logical shifted left 0x12 times is 0x1200000 Am I interpreting the instruction format wrong? Thank you very much in advance! |
Does anyone know how this encoding of ; byrRegs +[x0]
ldr w0, [x0]
; byrRegs -[x0]
- and w0, w0, #0xD1FFAB1E
- cmp w0, #48, LSL #12
+ mov w1, #0xD1FFAB1E
+ bic w0, w1, w0
+ cmp w0, #0
cset x0, eq If I'm not mistaken, 0xD1FFAB1E is not possible to be represented as a bitmask immediate. |
@quantumhu 0xD1FFAB1E ("diffable") is simply a string that the JIT replaces the real immediate by when the JIT is asked to generate diffable codegen. The real immediate is something else -- you can invoke the JIT locally to see what it is. For some information on how to run asmdiffs locally see https://github.com/dotnet/runtime/blob/main/src/coreclr/scripts/superpmi.md. Note that For example, in your case it will end up being something like:
with adjusted paths depending on Windows/Linux and location on the file system. Also the context index may be different when you download the newest Note that to run cross diffs (say you are on x64 and want to run win-arm64 diffs) you pass something like |
src/coreclr/jit/lower.cpp
Outdated
// Exception to this transformation is if relopOp1 can set flags and we expect TryLowerConditionToFlagsNode will later | ||
// transform the tree to reuse the zero flag, then we can avoid changing to compare + branch here. | ||
// TryLowerConditionToFlagsNode will only transform if optimizations enabled so don't bother avoiding this change if optimizations is off | ||
if (!comp->opts.OptimizationEnabled() || !relopOp1->SupportsSettingZeroFlag() || !IsInvariantInRange(relopOp1, jtrue)) | ||
{ | ||
newOper = GT_JCMP; | ||
cc = GenCondition::FromRelop(cond); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give an example of the cases this improves? Generally I would not expect this to be helpful -- IIUC it will just result in changing something like and x, 123; cbz x, <target>
to ands x, 123; beq <target>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is generally only and/bic, cbz -> ands/bics, beq
I am expecting that doing a compare after a result that already sets the zero flag is extra work, so ands/bics + branch should be faster. Do I have a mistaken understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way to figure that out is to measure it on some hardware. It's hard to say what the impact of something like that may be. I would expect both patterns to be handled equally efficiently by modern ARM64 hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea I will do that, thank you.
I have a couple of other questions:
-
on x64, I tried to turn
and + cmp
intonot + test
instead ofandn
, but I found that the data (already in a register) that will benot
'd will be copied to a scratch register first. I think this defeats the purpose of thenot + test
structure. If the register can be used, it would be better to keepandn
. -
on arm64, it is also similarly more efficient (in my mind) that if the constant value can be used in an
and
instruction, it would be better to keepand + cmp
as opposed tobics
becausebics
requires register arguments, so getting the constant into a register requires amov
.
For these both, would your recommendation to also test it on hardware? I don't have much variety when it comes to testing hardware, so not sure if I can extrapolate that data to all cases.
d3fe58b
to
50de620
Compare
This is optimization, not correctness issue. We don't have time to work on this for .NET 9, so we will review it in .NET 10. |
@quantumhu could you please rebase (merge main) into your PR? |
@EgorBo done! |
Sorry for the delayed response, we've been a bit busy lately. From a quick look at SPMI diffs it seems like it's a size regression across most collections, is this expected? a minimal example seems to be static void Foo(int a)
{
if ((a & 7) == 7)
Console.WriteLine();
} ; Method Benchmarks:Foo(int) (FullOpts)
- and ecx, 7
- cmp ecx, 7
+ mov eax, 7
+ andn eax, ecx, eax
je SHORT G_M35312_IG04
ret
tail.jmp [System.Console:WriteLine()]
-; Total bytes of code: 15
+; Total bytes of code: 19 It looks like Lower (or maybe emitter?) is a better place for this peephole unlike Morph |
Is it possible to know during the lowering / emit stage whether or not a constant is being loaded to a register? On a side note, I am unable to use DOTNET_JitDisasm since rebasing. Any quick things I can try to do to fix it? My workflow looks like this:
I know this was working before because this was the setup I was using to test. |
You may try to also declare
In lower you can rely on |
Unfortunately the ReadyToRun didn't help. I assume PRs isn't the best place to get help, where can I get more help for the disasm problem? Thanks! |
Also, better put |
I already had the no inlining attribute, here's what the code looks like.
|
never mind, looks like it's working now! I think my previously failed attempt at installing the newest rc of .net 9.0 was messing with existing .net 8.0 on my computer. Thanks for your help! |
Hi @EgorBo, I tried taking a look at the Lowering code, I identified this as a good spot to put the peephole. https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lower.cpp#L4217 However, further in, there's a note here: that this optimization cannot be applied if the AND is in a JTRUE node, or the AND is involved in a conditional. Both of these exclusions are pertinent to code example we are trying to apply the optimization to:
Any suggestions on how I can proceed? |
@EgorBo, please follow up on this community PR. |
I think something like this might work: From aeb873b824db3dc4f2bce8c7209fc7fde70b4b88 Mon Sep 17 00:00:00 2001
From: EgorBo <egorbo@gmail.com>
Date: Sun, 1 Dec 2024 21:59:49 +0100
Subject: [PATCH] Test
---
src/coreclr/jit/lower.cpp | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/src/coreclr/jit/lower.cpp b/src/coreclr/jit/lower.cpp
index ce829a0730d..37339b3e251 100644
--- a/src/coreclr/jit/lower.cpp
+++ b/src/coreclr/jit/lower.cpp
@@ -4288,6 +4288,15 @@ GenTree* Lowering::OptimizeConstCompare(GenTree* cmp)
}
#endif
}
+ else if ((andOp2->IsIntegralConst()) && GenTree::Compare(andOp2, op2))
+ {
+ GenTree* notNode = comp->gtNewOperNode(GT_NOT, andOp1->TypeGet(), andOp1);
+ cmp->gtGetOp1()->AsOp()->gtOp1 = notNode;
+ BlockRange().InsertAfter(andOp1, notNode);
+ op2->BashToZeroConst(op2->TypeGet());
+ andOp1 = notNode;
+ op2Value = 0;
+ }
}
#ifdef TARGET_XARCH
--
2.45.2.windows.1 Looks like this transformation definitely shouldn't be done in morph. And the current ASM Diffs don't look good in this PR |
@quantumhu, please follow up with the recommendation from EgorBo, and share a new diff result with us. |
Hi @JulieLeeMSFT, sorry for the late reply. I have been tied up with life commitments as of late, and won't be able to dedicate further time to this. |
Thanks for trying @quantumhu! Feel free to reopen if you get some more time on your hands. |
Resolves #101000
Code emitted looks like this:
ARM/ARM64
x64
x64 without BMI support