Skip to content

Optimize 'x & cns == cns' pattern #103868

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Conversation

quantumhu
Copy link

@quantumhu quantumhu commented Jun 23, 2024

Resolves #101000

Code emitted looks like this:

ARM/ARM64

mov     w1, #0xC000000
bics    w0, w1, w0
bne     G_M12719_IG05

x64

mov      eax, 0xC000000
andn     edi, edi, eax
jne      SHORT G_M12719_IG05

x64 without BMI support

not      edi
test     edi, 0xC000000
jne      SHORT G_M12719_IG05

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 23, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 23, 2024
@quantumhu
Copy link
Author

@dotnet-policy-service agree


ssize_t cnsVal = op2->AsIntCon()->IconValue();

if (andOp1->TypeIs(TYP_INT, TYP_LONG) && andOp2->IsIntegralConst() && andOp2->AsIntCon()->IconValue() == cnsVal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IsIntegralConst should be replaced with IsIntCns or whatever it's called, I forgot. Otherwise AsIntCon() might fails for what is actually AsLngCon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, you can just do GenTree::Compare(op2, andOp2)

GenTree* tmpNode = andOp2;
op1->AsOp()->gtOp2 = gtNewOperNode(GT_NOT, andOp1->TypeGet(), andOp1);
op1->AsOp()->gtOp1 = tmpNode;
op2->AsIntConCommon()->SetIconValue(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be op2->gtBashToCns

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to where this function is located? I can only find a gtBashToNOP

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found a gtBashToZeroConst function, so using that instead

op2->AsIntConCommon()->SetIconValue(0);
#if defined(TARGET_ARM) || defined(TARGET_ARM64)
// If set for XARCH it will prevent ANDN instruction from being emitted
op1->gtFlags |= GTF_SET_FLAGS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afair, GTF_SET_FLAGS is a lowering concept and shouldn't be used in morph

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how I can get bic to become bics then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the change in codegen are not quite the right way to accomplish reusing the flags. Instead you should take a look at Lowering::TryLowerConditionToFlagsNode, specifically the part that checks SupportsSettingZeroFlags, and try to update that logic to handle these cases as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Jakob! This was a great hint and I figured out a generic solution.

@jakobbotsch
Copy link
Member

It looks like this is a regression overall: https://dev.azure.com/dnceng-public/public/_build/results?buildId=726440&view=ms.vss-build-web.run-extensions-tab
For example:
image

Can you look into why that is and see if you can fix them?

@quantumhu
Copy link
Author

quantumhu commented Jul 3, 2024

It looks like this is a regression overall: https://dev.azure.com/dnceng-public/public/_build/results?buildId=726440&view=ms.vss-build-web.run-extensions-tab For example: image

Can you look into why that is and see if you can fix them?

I have looked at the diffs and I am trying a fix with my new commit. May I ask how you identified there were problems? The job you referenced ran successfully so I assumed nothing was wrong. Do the presence of artifacts suggest something is wrong? Thanks!

@jakobbotsch
Copy link
Member

May I ask how you identified there were problems? The job you referenced ran successfully so I assumed nothing was wrong. Do the presence of artifacts suggest something is wrong? Thanks!

Yes, the superpmi-diffs job runs the compiler on a large number of methods and produces codegen diffs that can be used to evaluate changes in the JIT. The diffs are available under the "Extensions" page when you access the pipeline run on Azure Pipelines. Your new change looks better diffs wise, but still looks like it regresses some common cases, so you may want to take another look at some of the regressions.

@quantumhu
Copy link
Author

Hi Jakob, thank you for all your help so far, it has been of great assistance for working on this issue.

I have a few problems that I can't seem to fix, I'm hoping you can provide some insight:

For x64:

The expected codegen based on the raised issue was that the instruction sequence and, mov, cmp would be turned into not + test. I found that the same changes to produce not + test will instead produce mov + andn on machines that support the BMI1 instruction set.

One regression was this:

        movzx    rdx, byte  ptr [rax+0x18]
-       and      edx, 7
-       cmp      edx, 7
+       mov      r8d, 7
+       andn     edx, edx, r8d
        jne      SHORT G_M31552_IG05

Below is the example in the original linked issue.

       and      edx, 0xC000000
       cmp      edx, 0xC000000
       jne      SHORT G_M37282_IG04

This required a mov for the constant hex value 0xC000000. Why was a mov necessary unlike the above value 7, which was directly used? The value 0xC000000 is 28 bits long, and as far as I know, the and instruction on x64 can encode a 32-bit immediate.

Then for x86:

When and + cmp is converted to mov + andn, there was an additional push/pop of a register in the generated function due to the fact that and with a constant does not require a register, whereas andn requires register arguments and we happened to use a callee-saved register. Is there some existing heuristic I can check to prevent the conversion to andn if the "cost" becomes not worth it?

Then for ARM64:

             ; byrRegs +[x0]
             ldr     w0, [x0]
             ; byrRegs -[x0]
-            and     w0, w0, #0xD1FFAB1E
-            cmp     w0, #48, LSL #12
+            mov     w1, #0xD1FFAB1E
+            bic     w0, w1, w0
+            cmp     w0, #0
             cset    x0, eq

I found this peculiar example. My code change makes sure that the constant used in the and is the same as the constant used in cmp. However, if I understand this snippet correctly, the values are different.

0x48 logical shifted left 0x12 times is 0x1200000
Which is totally different from 0xD1FFAB1E

Am I interpreting the instruction format wrong?

Thank you very much in advance!

@quantumhu
Copy link
Author

Does anyone know how this encoding of and in ARM64 is possible?

             ; byrRegs +[x0]
             ldr     w0, [x0]
             ; byrRegs -[x0]
-            and     w0, w0, #0xD1FFAB1E
-            cmp     w0, #48, LSL #12
+            mov     w1, #0xD1FFAB1E
+            bic     w0, w1, w0
+            cmp     w0, #0
             cset    x0, eq

If I'm not mistaken, 0xD1FFAB1E is not possible to be represented as a bitmask immediate.

@jakobbotsch
Copy link
Member

jakobbotsch commented Jul 11, 2024

@quantumhu 0xD1FFAB1E ("diffable") is simply a string that the JIT replaces the real immediate by when the JIT is asked to generate diffable codegen. The real immediate is something else -- you can invoke the JIT locally to see what it is.

For some information on how to run asmdiffs locally see https://github.com/dotnet/runtime/blob/main/src/coreclr/scripts/superpmi.md. Note that superpmi.py asmdiffs will always set DOTNET_JitDisasmDiffable=1 -- to avoid it being set you have to invoke the superpmi executable yourself. You can use the -c argument to superpmi to specify a particular index of a method to run the JIT over, and set the environment variables (like DOTNET_JitDisasm=* or DOTNET_JitDump=*) before doing so.

For example, in your case it will end up being something like:

C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\superpmi.exe C:\dev\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\clrjit.dll C:\dev\dotnet\spmi\mch\e6c7a441-8d3c-4135-abc2-bc4dd7c6a4d7.windows.x64\coreclr_tests.run.windows.x64.checked.mch -c 580219

with adjusted paths depending on Windows/Linux and location on the file system. Also the context index may be different when you download the newest coreclr_tests.run collection locally -- you will be able to get the right index after you run superpmi.py asmdiffs locally and find the diff in the summary markdown file it outputs.

Note that to run cross diffs (say you are on x64 and want to run win-arm64 diffs) you pass something like superpmi.py asmdiffs -target_arch arm64 -target_os win.

Comment on lines 4320 to 4327
// Exception to this transformation is if relopOp1 can set flags and we expect TryLowerConditionToFlagsNode will later
// transform the tree to reuse the zero flag, then we can avoid changing to compare + branch here.
// TryLowerConditionToFlagsNode will only transform if optimizations enabled so don't bother avoiding this change if optimizations is off
if (!comp->opts.OptimizationEnabled() || !relopOp1->SupportsSettingZeroFlag() || !IsInvariantInRange(relopOp1, jtrue))
{
newOper = GT_JCMP;
cc = GenCondition::FromRelop(cond);
}
Copy link
Member

@jakobbotsch jakobbotsch Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example of the cases this improves? Generally I would not expect this to be helpful -- IIUC it will just result in changing something like and x, 123; cbz x, <target> to ands x, 123; beq <target>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is generally only and/bic, cbz -> ands/bics, beq

I am expecting that doing a compare after a result that already sets the zero flag is extra work, so ands/bics + branch should be faster. Do I have a mistaken understanding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way to figure that out is to measure it on some hardware. It's hard to say what the impact of something like that may be. I would expect both patterns to be handled equally efficiently by modern ARM64 hardware.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea I will do that, thank you.

I have a couple of other questions:

  1. on x64, I tried to turn and + cmp into not + test instead of andn, but I found that the data (already in a register) that will be not'd will be copied to a scratch register first. I think this defeats the purpose of the not + test structure. If the register can be used, it would be better to keep andn.

  2. on arm64, it is also similarly more efficient (in my mind) that if the constant value can be used in an and instruction, it would be better to keep and + cmp as opposed to bics because bics requires register arguments, so getting the constant into a register requires a mov.

For these both, would your recommendation to also test it on hardware? I don't have much variety when it comes to testing hardware, so not sure if I can extrapolate that data to all cases.

@JulieLeeMSFT
Copy link
Member

This is optimization, not correctness issue. We don't have time to work on this for .NET 9, so we will review it in .NET 10.

@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Aug 12, 2024
@EgorBo
Copy link
Member

EgorBo commented Sep 9, 2024

@quantumhu could you please rebase (merge main) into your PR?

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 9, 2024
@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Sep 10, 2024
@quantumhu
Copy link
Author

@EgorBo done!

@EgorBo
Copy link
Member

EgorBo commented Oct 6, 2024

@EgorBo done!

Sorry for the delayed response, we've been a bit busy lately. From a quick look at SPMI diffs it seems like it's a size regression across most collections, is this expected?

a minimal example seems to be

static void Foo(int a)
{
    if ((a & 7) == 7)
        Console.WriteLine();
}
; Method Benchmarks:Foo(int) (FullOpts)
-      and      ecx, 7
-      cmp      ecx, 7
+      mov      eax, 7
+      andn     eax, ecx, eax
       je       SHORT G_M35312_IG04
       ret      
       tail.jmp [System.Console:WriteLine()]
-; Total bytes of code: 15
+; Total bytes of code: 19

It looks like Lower (or maybe emitter?) is a better place for this peephole unlike Morph

@quantumhu
Copy link
Author

From a quick look at SPMI diffs it seems like it's a size regression across most collections, is this expected?

Is it possible to know during the lowering / emit stage whether or not a constant is being loaded to a register?


On a side note, I am unable to use DOTNET_JitDisasm since rebasing. Any quick things I can try to do to fix it?

My workflow looks like this:

./build.sh --subset clr+libs --ninja --cmakeargs "-DCLR_CMAKE_APPLE_DSYM=TRUE" 

export CORE_LIBRARIES=/path/to/artifacts/bin
export DOTNET_JitDisasm=Foo
cd artifacts/bin/coreclr/osx.arm64.Debug
/path/to/artifacts/bin/coreclr/osx.arm64.Debug/corerun <absolute path to dll>

I know this was working before because this was the setup I was using to test.

@EgorBo
Copy link
Member

EgorBo commented Oct 7, 2024

On a side note, I am unable to use DOTNET_JitDisasm since rebasing. Any quick things I can try to do to fix it?

You may try to also declare DOTNET_ReadyToRun=0 just in case. Also, make sure you don't use sudo for the actual command, because it won't pick your export JitDisasm up.

Is it possible to know during the lowering / emit stage whether or not a constant is being loaded to a register?

In lower you can rely on IsContained property. In emitter you typically have all the information including which regs are used etc.

@quantumhu
Copy link
Author

Unfortunately the ReadyToRun didn't help. I assume PRs isn't the best place to get help, where can I get more help for the disasm problem? Thanks!

@EgorBo
Copy link
Member

EgorBo commented Oct 7, 2024

Also, better put NoInlining attribute on your method.

@quantumhu
Copy link
Author

I already had the no inlining attribute, here's what the code looks like.
The code definitely runs because I see the output.

using System.Runtime.CompilerServices;

namespace JitTesting;

class Program
{
    [MethodImpl(MethodImplOptions.NoInlining)] 
    static void Foo(int x)
    {
        if ((x & 0xC000000) == 0xC000000)
            Console.WriteLine("hit");
    }

    static void TestWrapper()
    {
        for (int i = 0x0; i <= 0xF000000; i += 0x1000000)
        {
            Console.WriteLine("Input: {0:X}", i);
            Foo(i);
        }
    }

    static void Main(string[] args)
    {
        Console.WriteLine("Hello, World!");
        TestWrapper();
    }
}

@quantumhu
Copy link
Author

never mind, looks like it's working now! I think my previously failed attempt at installing the newest rc of .net 9.0 was messing with existing .net 8.0 on my computer. Thanks for your help!

@quantumhu
Copy link
Author

Hi @EgorBo, I tried taking a look at the Lowering code, I identified this as a good spot to put the peephole.

https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lower.cpp#L4217

However, further in, there's a note here:
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lower.cpp#L4249

that this optimization cannot be applied if the AND is in a JTRUE node, or the AND is involved in a conditional. Both of these exclusions are pertinent to code example we are trying to apply the optimization to:

if ((x & 0xC000000) == 0xC000000)
    Console.WriteLine("hit");

Any suggestions on how I can proceed?

@JulieLeeMSFT
Copy link
Member

@EgorBo, please follow up on this community PR.

@EgorBo
Copy link
Member

EgorBo commented Dec 1, 2024

I think something like this might work:

From aeb873b824db3dc4f2bce8c7209fc7fde70b4b88 Mon Sep 17 00:00:00 2001
From: EgorBo <egorbo@gmail.com>
Date: Sun, 1 Dec 2024 21:59:49 +0100
Subject: [PATCH] Test

---
 src/coreclr/jit/lower.cpp | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/coreclr/jit/lower.cpp b/src/coreclr/jit/lower.cpp
index ce829a0730d..37339b3e251 100644
--- a/src/coreclr/jit/lower.cpp
+++ b/src/coreclr/jit/lower.cpp
@@ -4288,6 +4288,15 @@ GenTree* Lowering::OptimizeConstCompare(GenTree* cmp)
             }
 #endif
         }
+        else if ((andOp2->IsIntegralConst()) && GenTree::Compare(andOp2, op2))
+        {
+            GenTree* notNode = comp->gtNewOperNode(GT_NOT, andOp1->TypeGet(), andOp1);
+            cmp->gtGetOp1()->AsOp()->gtOp1 = notNode;
+            BlockRange().InsertAfter(andOp1, notNode);
+            op2->BashToZeroConst(op2->TypeGet());
+            andOp1   = notNode;
+            op2Value = 0;
+        }
     }
 
 #ifdef TARGET_XARCH
-- 
2.45.2.windows.1

Looks like this transformation definitely shouldn't be done in morph. And the current ASM Diffs don't look good in this PR

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 6, 2025
@JulieLeeMSFT
Copy link
Member

@quantumhu, please follow up with the recommendation from EgorBo, and share a new diff result with us.

@quantumhu
Copy link
Author

Hi @JulieLeeMSFT, sorry for the late reply. I have been tied up with life commitments as of late, and won't be able to dedicate further time to this.

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 6, 2025
@jakobbotsch
Copy link
Member

Thanks for trying @quantumhu! Feel free to reopen if you get some more time on your hands.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Optimize "x & cns == cns" pattern
4 participants