Optimize 'x & cns == cns' pattern #103868

quantumhu · 2024-06-23T18:14:58Z

Code emitted looks like this:

ARM/ARM64

mov     w1, #0xC000000
bics    w0, w1, w0
bne     G_M12719_IG05

x64

mov      eax, 0xC000000
andn     edi, edi, eax
jne      SHORT G_M12719_IG05

x64 without BMI support

not      edi
test     edi, 0xC000000
jne      SHORT G_M12719_IG05

quantumhu · 2024-06-23T18:15:26Z

@dotnet-policy-service agree

EgorBo · 2024-06-27T19:41:17Z

src/coreclr/jit/morph.cpp

+
+                    ssize_t cnsVal = op2->AsIntCon()->IconValue();
+
+                    if (andOp1->TypeIs(TYP_INT, TYP_LONG) && andOp2->IsIntegralConst() && andOp2->AsIntCon()->IconValue() == cnsVal)


IsIntegralConst should be replaced with IsIntCns or whatever it's called, I forgot. Otherwise AsIntCon() might fails for what is actually AsLngCon

Also, you can just do GenTree::Compare(op2, andOp2)

EgorBo · 2024-06-27T19:42:49Z

src/coreclr/jit/morph.cpp

+                        GenTree* tmpNode = andOp2;
+                        op1->AsOp()->gtOp2 = gtNewOperNode(GT_NOT, andOp1->TypeGet(), andOp1);
+                        op1->AsOp()->gtOp1 = tmpNode;
+                        op2->AsIntConCommon()->SetIconValue(0);


this should be op2->gtBashToCns

Could you point me to where this function is located? I can only find a gtBashToNOP

Found a gtBashToZeroConst function, so using that instead

EgorBo · 2024-06-27T19:44:14Z

src/coreclr/jit/morph.cpp

+                        op2->AsIntConCommon()->SetIconValue(0);
+#if defined(TARGET_ARM) || defined(TARGET_ARM64)
+                        // If set for XARCH it will prevent ANDN instruction from being emitted
+                        op1->gtFlags |= GTF_SET_FLAGS;


Afair, GTF_SET_FLAGS is a lowering concept and shouldn't be used in morph

Any idea how I can get bic to become bics then?

This and the change in codegen are not quite the right way to accomplish reusing the flags. Instead you should take a look at Lowering::TryLowerConditionToFlagsNode, specifically the part that checks SupportsSettingZeroFlags, and try to update that logic to handle these cases as well.

Thank you Jakob! This was a great hint and I figured out a generic solution.

jakobbotsch · 2024-07-02T12:45:16Z

It looks like this is a regression overall: https://dev.azure.com/dnceng-public/public/_build/results?buildId=726440&view=ms.vss-build-web.run-extensions-tab
For example:

Can you look into why that is and see if you can fix them?

quantumhu · 2024-07-03T04:40:25Z

It looks like this is a regression overall: https://dev.azure.com/dnceng-public/public/_build/results?buildId=726440&view=ms.vss-build-web.run-extensions-tab For example:

Can you look into why that is and see if you can fix them?

I have looked at the diffs and I am trying a fix with my new commit. May I ask how you identified there were problems? The job you referenced ran successfully so I assumed nothing was wrong. Do the presence of artifacts suggest something is wrong? Thanks!

jakobbotsch · 2024-07-04T09:18:19Z

May I ask how you identified there were problems? The job you referenced ran successfully so I assumed nothing was wrong. Do the presence of artifacts suggest something is wrong? Thanks!

Yes, the superpmi-diffs job runs the compiler on a large number of methods and produces codegen diffs that can be used to evaluate changes in the JIT. The diffs are available under the "Extensions" page when you access the pipeline run on Azure Pipelines. Your new change looks better diffs wise, but still looks like it regresses some common cases, so you may want to take another look at some of the regressions.

quantumhu · 2024-07-05T14:26:36Z

Hi Jakob, thank you for all your help so far, it has been of great assistance for working on this issue.

I have a few problems that I can't seem to fix, I'm hoping you can provide some insight:

For x64:

The expected codegen based on the raised issue was that the instruction sequence and, mov, cmp would be turned into not + test. I found that the same changes to produce not + test will instead produce mov + andn on machines that support the BMI1 instruction set.

One regression was this:

        movzx    rdx, byte  ptr [rax+0x18]
-       and      edx, 7
-       cmp      edx, 7
+       mov      r8d, 7
+       andn     edx, edx, r8d
        jne      SHORT G_M31552_IG05

Below is the example in the original linked issue.

       and      edx, 0xC000000
       cmp      edx, 0xC000000
       jne      SHORT G_M37282_IG04

This required a mov for the constant hex value 0xC000000. Why was a mov necessary unlike the above value 7, which was directly used? The value 0xC000000 is 28 bits long, and as far as I know, the and instruction on x64 can encode a 32-bit immediate.

Then for x86:

When and + cmp is converted to mov + andn, there was an additional push/pop of a register in the generated function due to the fact that and with a constant does not require a register, whereas andn requires register arguments and we happened to use a callee-saved register. Is there some existing heuristic I can check to prevent the conversion to andn if the "cost" becomes not worth it?

Then for ARM64:

             ; byrRegs +[x0]
             ldr     w0, [x0]
             ; byrRegs -[x0]
-            and     w0, w0, #0xD1FFAB1E
-            cmp     w0, #48, LSL #12
+            mov     w1, #0xD1FFAB1E
+            bic     w0, w1, w0
+            cmp     w0, #0
             cset    x0, eq

I found this peculiar example. My code change makes sure that the constant used in the and is the same as the constant used in cmp. However, if I understand this snippet correctly, the values are different.

0x48 logical shifted left 0x12 times is 0x1200000
Which is totally different from 0xD1FFAB1E

Am I interpreting the instruction format wrong?

Thank you very much in advance!

quantumhu · 2024-07-11T01:26:30Z

Does anyone know how this encoding of and in ARM64 is possible?

             ; byrRegs +[x0]
             ldr     w0, [x0]
             ; byrRegs -[x0]
-            and     w0, w0, #0xD1FFAB1E
-            cmp     w0, #48, LSL #12
+            mov     w1, #0xD1FFAB1E
+            bic     w0, w1, w0
+            cmp     w0, #0
             cset    x0, eq

If I'm not mistaken, 0xD1FFAB1E is not possible to be represented as a bitmask immediate.

jakobbotsch · 2024-07-11T09:51:18Z

@quantumhu 0xD1FFAB1E ("diffable") is simply a string that the JIT replaces the real immediate by when the JIT is asked to generate diffable codegen. The real immediate is something else -- you can invoke the JIT locally to see what it is.

For some information on how to run asmdiffs locally see https://github.com/dotnet/runtime/blob/main/src/coreclr/scripts/superpmi.md. Note that superpmi.py asmdiffs will always set DOTNET_JitDisasmDiffable=1 -- to avoid it being set you have to invoke the superpmi executable yourself. You can use the -c argument to superpmi to specify a particular index of a method to run the JIT over, and set the environment variables (like DOTNET_JitDisasm=* or DOTNET_JitDump=*) before doing so.

For example, in your case it will end up being something like:

C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\superpmi.exe C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\clrjit.dll C:\dev\dotnet\spmi\mch\e6c7a441-8d3c-4135-abc2-bc4dd7c6a4d7.windows.x64\coreclr_tests.run.windows.x64.checked.mch -c 580219

with adjusted paths depending on Windows/Linux and location on the file system. Also the context index may be different when you download the newest coreclr_tests.run collection locally -- you will be able to get the right index after you run superpmi.py asmdiffs locally and find the diff in the summary markdown file it outputs.

Note that to run cross diffs (say you are on x64 and want to run win-arm64 diffs) you pass something like superpmi.py asmdiffs -target_arch arm64 -target_os win.

jakobbotsch · 2024-07-11T10:11:32Z

src/coreclr/jit/lower.cpp

+            // Exception to this transformation is if relopOp1 can set flags and we expect TryLowerConditionToFlagsNode will later 
+            // transform the tree to reuse the zero flag, then we can avoid changing to compare + branch here.
+            // TryLowerConditionToFlagsNode will only transform if optimizations enabled so don't bother avoiding this change if optimizations is off
+            if (!comp->opts.OptimizationEnabled() || !relopOp1->SupportsSettingZeroFlag() || !IsInvariantInRange(relopOp1, jtrue))
+            {
+                newOper = GT_JCMP;
+                cc      = GenCondition::FromRelop(cond);
+            }


Can you give an example of the cases this improves? Generally I would not expect this to be helpful -- IIUC it will just result in changing something like and x, 123; cbz x, <target> to ands x, 123; beq <target>.

yes, it is generally only and/bic, cbz -> ands/bics, beq

I am expecting that doing a compare after a result that already sets the zero flag is extra work, so ands/bics + branch should be faster. Do I have a mistaken understanding?

The way to figure that out is to measure it on some hardware. It's hard to say what the impact of something like that may be. I would expect both patterns to be handled equally efficiently by modern ARM64 hardware.

Good idea I will do that, thank you.

I have a couple of other questions:

on x64, I tried to turn and + cmp into not + test instead of andn, but I found that the data (already in a register) that will be not'd will be copied to a scratch register first. I think this defeats the purpose of the not + test structure. If the register can be used, it would be better to keep andn.

on arm64, it is also similarly more efficient (in my mind) that if the constant value can be used in an and instruction, it would be better to keep and + cmp as opposed to bics because bics requires register arguments, so getting the constant into a register requires a mov.

For these both, would your recommendation to also test it on hardware? I don't have much variety when it comes to testing hardware, so not sure if I can extrapolate that data to all cases.

JulieLeeMSFT · 2024-08-12T17:23:14Z

This is optimization, not correctness issue. We don't have time to work on this for .NET 9, so we will review it in .NET 10.

EgorBo · 2024-09-09T13:32:30Z

@quantumhu could you please rebase (merge main) into your PR?

quantumhu · 2024-09-11T01:53:54Z

@EgorBo done!

EgorBo · 2024-10-06T23:51:43Z

@EgorBo done!

Sorry for the delayed response, we've been a bit busy lately. From a quick look at SPMI diffs it seems like it's a size regression across most collections, is this expected?

a minimal example seems to be

static void Foo(int a)
{
    if ((a & 7) == 7)
        Console.WriteLine();
}

; Method Benchmarks:Foo(int) (FullOpts)
-      and      ecx, 7
-      cmp      ecx, 7
+      mov      eax, 7
+      andn     eax, ecx, eax
       je       SHORT G_M35312_IG04
       ret      
       tail.jmp [System.Console:WriteLine()]
-; Total bytes of code: 15
+; Total bytes of code: 19

It looks like Lower (or maybe emitter?) is a better place for this peephole unlike Morph

quantumhu · 2024-10-07T01:04:52Z

From a quick look at SPMI diffs it seems like it's a size regression across most collections, is this expected?

Is it possible to know during the lowering / emit stage whether or not a constant is being loaded to a register?

On a side note, I am unable to use DOTNET_JitDisasm since rebasing. Any quick things I can try to do to fix it?

My workflow looks like this:

./build.sh --subset clr+libs --ninja --cmakeargs "-DCLR_CMAKE_APPLE_DSYM=TRUE" 

export CORE_LIBRARIES=/path/to/artifacts/bin
export DOTNET_JitDisasm=Foo
cd artifacts/bin/coreclr/osx.arm64.Debug
/path/to/artifacts/bin/coreclr/osx.arm64.Debug/corerun <absolute path to dll>

I know this was working before because this was the setup I was using to test.

EgorBo · 2024-10-07T01:17:22Z

On a side note, I am unable to use DOTNET_JitDisasm since rebasing. Any quick things I can try to do to fix it?

You may try to also declare DOTNET_ReadyToRun=0 just in case. Also, make sure you don't use sudo for the actual command, because it won't pick your export JitDisasm up.

Is it possible to know during the lowering / emit stage whether or not a constant is being loaded to a register?

In lower you can rely on IsContained property. In emitter you typically have all the information including which regs are used etc.

quantumhu · 2024-10-07T01:20:43Z

Unfortunately the ReadyToRun didn't help. I assume PRs isn't the best place to get help, where can I get more help for the disasm problem? Thanks!

EgorBo · 2024-10-07T01:23:15Z

Also, better put NoInlining attribute on your method.

quantumhu · 2024-10-07T01:28:46Z

I already had the no inlining attribute, here's what the code looks like.
The code definitely runs because I see the output.

using System.Runtime.CompilerServices;

namespace JitTesting;

class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)] 
    static void Foo(int x)
    {
        if ((x & 0xC000000) == 0xC000000)
            Console.WriteLine("hit");
    }

    static void TestWrapper()
    {
        for (int i = 0x0; i <= 0xF000000; i += 0x1000000)
        {
            Console.WriteLine("Input: {0:X}", i);
            Foo(i);
        }
    }

    static void Main(string[] args)
    {
        Console.WriteLine("Hello, World!");
        TestWrapper();
    }
}

quantumhu · 2024-10-07T01:44:41Z

never mind, looks like it's working now! I think my previously failed attempt at installing the newest rc of .net 9.0 was messing with existing .net 8.0 on my computer. Thanks for your help!

quantumhu · 2024-10-08T00:22:17Z

Hi @EgorBo, I tried taking a look at the Lowering code, I identified this as a good spot to put the peephole.

https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lower.cpp#L4217

However, further in, there's a note here:
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lower.cpp#L4249

that this optimization cannot be applied if the AND is in a JTRUE node, or the AND is involved in a conditional. Both of these exclusions are pertinent to code example we are trying to apply the optimization to:

if ((x & 0xC000000) == 0xC000000)
    Console.WriteLine("hit");

Any suggestions on how I can proceed?

JulieLeeMSFT · 2024-11-11T09:05:53Z

@EgorBo, please follow up on this community PR.

EgorBo · 2024-12-01T21:01:10Z

I think something like this might work:

From aeb873b824db3dc4f2bce8c7209fc7fde70b4b88 Mon Sep 17 00:00:00 2001
From: EgorBo <egorbo@gmail.com>
Date: Sun, 1 Dec 2024 21:59:49 +0100
Subject: [PATCH] Test

---
 src/coreclr/jit/lower.cpp | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/coreclr/jit/lower.cpp b/src/coreclr/jit/lower.cpp
index ce829a0730d..37339b3e251 100644
--- a/src/coreclr/jit/lower.cpp
+++ b/src/coreclr/jit/lower.cpp
@@ -4288,6 +4288,15 @@ GenTree* Lowering::OptimizeConstCompare(GenTree* cmp)
             }
 #endif
         }
+        else if ((andOp2->IsIntegralConst()) && GenTree::Compare(andOp2, op2))
+        {
+            GenTree* notNode = comp->gtNewOperNode(GT_NOT, andOp1->TypeGet(), andOp1);
+            cmp->gtGetOp1()->AsOp()->gtOp1 = notNode;
+            BlockRange().InsertAfter(andOp1, notNode);
+            op2->BashToZeroConst(op2->TypeGet());
+            andOp1   = notNode;
+            op2Value = 0;
+        }
     }
 
 #ifdef TARGET_XARCH
-- 
2.45.2.windows.1

Looks like this transformation definitely shouldn't be done in morph. And the current ASM Diffs don't look good in this PR

JulieLeeMSFT · 2025-01-06T17:48:42Z

@quantumhu, please follow up with the recommendation from EgorBo, and share a new diff result with us.

quantumhu · 2025-01-06T22:10:56Z

Hi @JulieLeeMSFT, sorry for the late reply. I have been tied up with life commitments as of late, and won't be able to dedicate further time to this.

jakobbotsch · 2025-01-27T17:29:36Z

Thanks for trying @quantumhu! Feel free to reopen if you get some more time on your hands.

ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 23, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 23, 2024

quantumhu mentioned this pull request Jun 24, 2024

JIT: Optimize "x & cns == cns" pattern #101000

Closed

EgorBo reviewed Jun 27, 2024

View reviewed changes

quantumhu force-pushed the issue-101000-2 branch from 2cbc517 to 2a46490 Compare July 1, 2024 18:26

This was referenced Jul 6, 2024

Build failure: Static graph-based restore failed with exit code .* but did not log an error. #103526

Open

Build failure: Static graph-based restore failed with exit code .* but did not log an error. dotnet/dnceng#3139

Closed

quantumhu force-pushed the issue-101000-2 branch from fd36a51 to a96ebee Compare July 6, 2024 19:26

quantumhu requested review from jakobbotsch and EgorBo July 11, 2024 01:25

jakobbotsch reviewed Jul 11, 2024

View reviewed changes

Quantum Hu added 6 commits July 30, 2024 22:59

Optimize 'x & cns == cns' pattern

85d970d

Update to not modify codegen based on set flags variable

aabcfc1

Only transform non-0 and non-power-of-2 values

fa24af2

Fix incorrect logic on avoiding compare + branch transformation

fec384f

Keep emitting compare + branch on non-optimized runs

b19aa47

Fix logic for morphing which ended up morphing non-constants

e1bff99

Revert change to make bic set flags

50de620

quantumhu force-pushed the issue-101000-2 branch from d3fe58b to 50de620 Compare July 31, 2024 02:59

quantumhu requested a review from jakobbotsch July 31, 2024 13:22

JulieLeeMSFT assigned quantumhu Aug 12, 2024

JulieLeeMSFT added this to the 10.0.0 milestone Aug 12, 2024

JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 9, 2024

Merge branch 'main' into issue-101000-2

db1ece1

dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 10, 2024

Add deleted comment back

c31f3ee

JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 6, 2025

dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 6, 2025

jakobbotsch closed this Jan 27, 2025

varelen mentioned this pull request Jan 28, 2025

JIT: Optimize bit-wise and operation and compare with same constant value #111933

Merged

github-actions bot locked and limited conversation to collaborators Feb 27, 2025


		ssize_t cnsVal = op2->AsIntCon()->IconValue();

		if (andOp1->TypeIs(TYP_INT, TYP_LONG) && andOp2->IsIntegralConst() && andOp2->AsIntCon()->IconValue() == cnsVal)

Optimize 'x & cns == cns' pattern #103868

Optimize 'x & cns == cns' pattern #103868

Uh oh!

Conversation

quantumhu commented Jun 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quantumhu commented Jun 23, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakobbotsch commented Jul 2, 2024

Uh oh!

quantumhu commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakobbotsch commented Jul 4, 2024

Uh oh!

quantumhu commented Jul 5, 2024

Uh oh!

quantumhu commented Jul 11, 2024

Uh oh!

jakobbotsch commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jakobbotsch Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JulieLeeMSFT commented Aug 12, 2024

Uh oh!

EgorBo commented Sep 9, 2024

Uh oh!

quantumhu commented Sep 11, 2024

Uh oh!

EgorBo commented Oct 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quantumhu commented Oct 7, 2024

Uh oh!

EgorBo commented Oct 7, 2024

Uh oh!

quantumhu commented Oct 7, 2024

Uh oh!

EgorBo commented Oct 7, 2024

Uh oh!

quantumhu commented Oct 7, 2024

Uh oh!

quantumhu commented Oct 7, 2024

Uh oh!

quantumhu commented Oct 8, 2024

Uh oh!

JulieLeeMSFT commented Nov 11, 2024

Uh oh!

EgorBo commented Dec 1, 2024

Uh oh!

JulieLeeMSFT commented Jan 6, 2025

Uh oh!

quantumhu commented Jun 23, 2024 •

edited

Loading

quantumhu commented Jul 3, 2024 •

edited

Loading

jakobbotsch commented Jul 11, 2024 •

edited

Loading

jakobbotsch Jul 11, 2024 •

edited

Loading

EgorBo commented Oct 6, 2024 •

edited

Loading