Skip to content

ARM64 - Optimizing a % b operations part 2 #66407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Apr 13, 2022
Merged

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Mar 9, 2022

Addressing part of this issue: #34937

Description

There are various ways to optimize % for integers on ARM64.

a % b can be transformed into a specific sequence of instructions for ARM64 if the operation is signed and b is a constant with the power of 2.

Acceptance Criteria

ARM64 diffs based on tests

-            asr     w1, w0, #31
-            and     w1, w1, #15
-            add     w1, w1, w0
-            asr     w1, w1, #4
-            lsl     w1, w1, #4
-            sub     w0, w0, w1
-                        ;; bbWeight=1    PerfScore 4.50
+            and     w1, w0, #15
+            negs    w0, w0
+            and     w0, w0, #15
+            csneg   w0, w1, w0, mi
+                        ;; bbWeight=1    PerfScore 2.00

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 9, 2022
@ghost ghost assigned TIHan Mar 9, 2022
@ghost
Copy link

ghost commented Mar 9, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Addressing part of this issue: #34937

Description

There are various ways to optimize % for integers on ARM64.

a % b can be transformed into a specific sequence of instructions for ARM64 if the operation is signed and b is a constant with the power of 2.

        negs   x1, x0
        and    x0, x0, #(n - 1)
        and    x1, x1, #(n - 1)
        csneg  x0, x0, x1, mi

Acceptance Criteria

Author: TIHan
Assignees: TIHan
Labels:

area-CodeGen-coreclr

Milestone: -

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look generally good. I left a couple comments around the possibility of using existing helpers rather than directly querying gtFlags

@TIHan
Copy link
Contributor Author

TIHan commented Apr 11, 2022

@kunalspathak @EgorBo this is ready. CI is failing due to unrelated reasons it looks like.

I think @AndyAyersMS and I reviewed this together a few weeks back as well.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good, just a comment.

Also, can you check why there are regressions in crossgen2 and other collections?

Top method regressions (bytes):
          24 (1.69 % of base) : 85784.dasm - System.Collections.Generic.HashSet`1:SymmetricExceptWithEnumerable(System.Collections.Generic.IEnumerable`1[System.__Canon]):this
          16 (0.60 % of base) : 107306.dasm - System.Text.GB18030Encoding:GetChars(long,int,long,int,System.Text.DecoderNLS):int:this
          12 (0.94 % of base) : 85783.dasm - System.Collections.Generic.HashSet`1:CheckUniqueAndUnfoundElements(System.Collections.Generic.IEnumerable`1[System.__Canon],bool):System.ValueTuple`2[System.Int32, System.Int32]:this
          12 (1.94 % of base) : 146511.dasm - System.Formats.Asn1.AsnWriter:WriteNamedBitList(System.Nullable`1[System.Formats.Asn1.Asn1Tag],long):this
           8 (0.98 % of base) : 5636.dasm - Microsoft.CodeAnalysis.PEModule:IsNoPiaLocalType(System.Reflection.Metadata.TypeDefinitionHandle,byref):bool:this
           8 (7.69 % of base) : 4080.dasm - Microsoft.CodeAnalysis.XmlCharType:SplitSurrogateChar(int,byref,byref)
           8 (0.75 % of base) : 85786.dasm - System.Collections.Generic.HashSet`1:IntersectWithEnumerable(System.Collections.Generic.IEnumerable`1[System.__Canon]):this
           8 (2.47 % of base) : 59774.dasm - System.Data.RBTree`1:FreeNode(int):this
           8 (2.53 % of base) : 62804.dasm - System.Data.RBTree`1:FreeNode(int):this
           8 (0.99 % of base) : 97269.dasm - System.Globalization.HebrewNumber:Append(System.Text.StringBuilder,int)
           8 (0.73 % of base) : 138410.dasm - System.Net.NetworkInformation.PhysicalAddress:TryParse(System.ReadOnlySpan`1[System.Char],byref):bool
           8 (3.92 % of base) : 98078.dasm - System.Net.WebUtility:ConvertSmpToUtf16(int,byref,byref)
           8 (7.69 % of base) : 120074.dasm - System.Xml.XmlCharType:SplitSurrogateChar(int,byref,byref)
           4 (1.59 % of base) : 5634.dasm - Microsoft.CodeAnalysis.PEModule:RecordNoPiaLocalTypeCheck(System.Reflection.Metadata.TypeDefinitionHandle):this
           4 (3.12 % of base) : 85613.dasm - System.Collections.Generic.BitHelper:IsMarked(int):bool:this
           4 (3.23 % of base) : 85614.dasm - System.Collections.Generic.BitHelper:MarkBit(int):this
           4 (1.43 % of base) : 96963.dasm - System.Globalization.OrdinalCasing:InitCasingTable():System.UInt16[][]
           4 (0.83 % of base) : 132782.dasm - System.Net.FtpControlStream:FormatAddress(System.Net.IPAddress,int):System.String
           4 (0.83 % of base) : 138407.dasm - System.Net.NetworkInformation.UnicastIPAddressInformation:PrefixLengthToSubnetMask(ubyte,int):System.Net.IPAddress

@@ -1707,7 +1796,7 @@ void Lowering::ContainCheckMul(GenTreeOp* node)
//
void Lowering::ContainCheckDivOrMod(GenTreeOp* node)
{
assert(node->OperIs(GT_DIV, GT_UDIV));
assert(node->OperIs(GT_DIV, GT_UDIV, GT_MOD));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not also GT_UMOD?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization doesn't impact GT_UMOD, only GT_MOD so that's why the additional GT_MOD check is there.

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Apr 11, 2022
@TIHan
Copy link
Contributor Author

TIHan commented Apr 11, 2022

I will look into the regressions that are there to see what was impacted.

@TIHan
Copy link
Contributor Author

TIHan commented Apr 11, 2022

@kunalspathak , yes it looks like there is a bit of a regression here:
regress

@kunalspathak
Copy link
Member

@kunalspathak , yes it looks like there is a bit of a regression here: ![regress](https://user-

Thanks for confirming offline that these are happening for scenarios where there is both a / b and a % b and earlier, we would CSE a / b operation. I think the way you have it in this PR should be fine for now and we will monitor of the method's code size regression results in any perf diff.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TIHan
Copy link
Contributor Author

TIHan commented Apr 13, 2022

Yea, the regression only occurs in CSE for 'a / b' operations. If we believe the benefits of this optimization outweighs that of CSE for 'a / b', then we should do it.

I'm not sure the best way to resolve it either; I thought about matching 'a - (a / cns) * cns' and turn it into this specific optimization, but it gets more complicated as 'multiply' can be optimized to a LSH operation, 'a - ((a / cns) << shift' if the 'cns' is the power of 2.

As a test, I tried turning 'a - ((a / cns) << shift' back into a mod op so it would get picked up by the lowering optimization:

                if (tree->TypeIs(TYP_INT, TYP_LONG) && !tree->IsUnsigned() && op1->OperIs(GT_LCL_VAR))
                {
                    GenTree* mul = op2;
                    if (mul->OperIs(GT_LSH) && !gtIsActiveCSE_Candidate(mul))
                    {
                        GenTree* div  = mul->gtGetOp1();
                        GenTree* cns1 = mul->gtGetOp2();
                        if (div->OperIs(GT_DIV) && !gtIsActiveCSE_Candidate(div) && cns1->IsIntegralConst())
                        {
                            GenTree* a    = div->gtGetOp1();
                            GenTree* cns2 = div->gtGetOp2();
                            if (a->OperIs(GT_LCL_VAR) && cns2->IsIntegralConstPow2() &&
                                op1->AsLclVar()->GetLclNum() == a->AsLclVar()->GetLclNum())
                            {
                                size_t cnsValue1 = cns1->AsIntConCommon()->IntegralValue();
                                size_t cnsValue2 = cns2->AsIntConCommon()->IntegralValue();
                                if ((cnsValue2 >> cnsValue1) == 1)
                                {
                                    tree->ChangeOper(GT_MOD);
                                    tree->AsOp()->gtOp2 = cns1;
                                    op2                 = tree->gtGetOp2();
                                }
                            }
                        }
                    }
                }

But, I would like to do this after CSE has occurred; that way if there were CSEs that happened in the expression, it wouldn't match this logic.

The next logically thing to look at is lowering if we could do this, but I think we would have to mark multiple nodes as contained - so it still gets more complicated.

@TIHan
Copy link
Contributor Author

TIHan commented Apr 13, 2022

I made an issue regarding this minor regression: #67983

@AndyAyersMS
Copy link
Member

Possible regression: #68624

@TIHan TIHan mentioned this pull request May 5, 2022
2 tasks
@ghost ghost locked as resolved and limited conversation to collaborators May 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants