Skip to content

RyuJIT: Optimize -X and MathF.Abs(X) for floats #42164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Oct 2, 2020

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Sep 12, 2020

Contributes to #1342

public static float XorTest(float x) => -x;

public static float AbsTest(float x) => MathF.Abs(x);

Current codegen:

; Method Test:XorTest(float):float
       C5F877               vzeroupper 
       C5FA100D0D000000     vmovss   xmm1, dword ptr [reloc @RWD00]
       C5F857C1             vxorps   xmm0, xmm1
       C3                   ret      
RWD00  dd	80000000h
; Total bytes of code: 16


; Method Test:AbsTest(float):float
       C5F877               vzeroupper 
       C5FA100D0D000000     vmovss   xmm1, dword ptr [reloc @RWD00]
       C5F854C1             vandps   xmm0, xmm1
       C3                   ret      
RWD00  dd	7FFFFFFFh
; Total bytes of code: 16

New codegen:

; Method XorTest(float):float
       C5F877               vzeroupper 
       C5F8570505000000     vxorps   xmm0, xmm0, dword ptr [reloc @RWD00]
       C3                   ret      
RWD00  dd	80000000h
; Total bytes of code: 12


; Method AbsTest(float):float
       C5F877               vzeroupper 
       C5F8540505000000     vandps   xmm0, xmm0, dword ptr [reloc @RWD00]
       C3                   ret      
RWD00  dd	7FFFFFFFh
; Total bytes of code: 12

godbolt: https://godbolt.org/z/5K5dGr

jit-diff:

C:\prj>jit-diff diff --output C:\prj\jitdiffs -f --core_root C:\prj\runtime-1\artifacts\tests\coreclr\Windows_NT.x64.Release\Tests\Core_Root --base C:\prj\runtime-1\artifacts\bin\coreclr\Windows_NT.x64.Checked_base --diff C:\prj\runtime-1\artifacts\bin\coreclr\Windows_NT.x64.Checked --pmi
Beginning PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies
- Finished 267/267 Base 267/267 Diff [490.6 sec]
Completed PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies in 490.77s
Diffs (if any) can be viewed by comparing: C:\prj\jitdiffs\dasmset_10\base C:\prj\jitdiffs\dasmset_10\diff
Analyzing CodeSize diffs...
Found 14 files with textual diffs.
PMI CodeSize Diffs for System.Private.CoreLib.dll, framework assemblies for  default jit
Summary of Code Size diffs:
(Lower is better)
Total bytes of diff: -1097 (-0.00% of base)
    diff is an improvement.
Top file improvements (bytes):
        -484 : System.Private.CoreLib.dasm (-0.01% of base)
        -153 : Microsoft.VisualBasic.Core.dasm (-0.03% of base)
        -152 : System.Runtime.Numerics.dasm (-0.21% of base)
         -92 : Microsoft.Diagnostics.Tracing.TraceEvent.dasm (-0.00% of base)
         -84 : System.Drawing.Common.dasm (-0.03% of base)
         -36 : Newtonsoft.Json.dasm (-0.00% of base)
         -28 : System.Data.Common.dasm (-0.00% of base)
         -20 : Microsoft.CodeAnalysis.VisualBasic.dasm (-0.00% of base)
         -16 : System.Private.Xml.dasm (-0.00% of base)
          -8 : FSharp.Core.dasm (-0.00% of base)
          -8 : System.Linq.Expressions.dasm (-0.00% of base)
          -8 : System.Private.DataContractSerialization.dasm (-0.00% of base)
          -4 : Microsoft.CodeAnalysis.CSharp.dasm (-0.00% of base)
          -4 : System.Net.Mail.dasm (-0.00% of base)
14 total files with Code Size differences (14 improved, 0 regressed), 253 unchanged.
Top method improvements (bytes):
         -41 (-1.90% of base) : System.Private.CoreLib.dasm - Matrix4x4:<Invert>g__SoftwareFallback|65_1(Matrix4x4,byref):bool
         -36 (-6.19% of base) : System.Private.CoreLib.dasm - Matrix4x4:CreateShadow(Vector3,Plane):Matrix4x4
         -28 (-1.36% of base) : Microsoft.VisualBasic.Core.dasm - ObjectType:InternalNegObj(Object,IConvertible,int):Object
         -25 (-7.84% of base) : System.Private.CoreLib.dasm - MathF:IEEERemainder(float,float):float
         -24 (-1.21% of base) : Microsoft.VisualBasic.Core.dasm - Operators:NegateObject(Object):Object
         -24 (-5.93% of base) : System.Drawing.Common.dasm - Matrix:RotateAt(float,PointF,int):this
         -24 (-14.20% of base) : System.Private.CoreLib.dasm - Matrix3x2:op_UnaryNegation(Matrix3x2):Matrix3x2
         -21 (-7.00% of base) : System.Private.CoreLib.dasm - Math:IEEERemainder(double,double):double
         -20 (-1.43% of base) : Microsoft.VisualBasic.Core.dasm - Financial:IRR(byref,double):double
         -20 (-1.21% of base) : System.Private.CoreLib.dasm - Matrix4x4:Decompose(Matrix4x4,byref,byref,byref):bool
         -20 (-9.39% of base) : System.Runtime.Numerics.dasm - Complex:Asin(Complex):Complex
         -20 (-3.77% of base) : System.Runtime.Numerics.dasm - Complex:Sqrt(Complex):Complex
         -16 (-0.57% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - ExpressionEvaluator:EvaluateUnaryExpression(UnaryExpressionSyntax):CConst:this
         -16 (-9.64% of base) : System.Drawing.Common.dasm - PrinterSettings:CreateMeasurementGraphics(bool):Graphics:this
         -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:op_UnaryNegation(Quaternion):Quaternion
         -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:Negate(Quaternion):Quaternion
         -16 (-13.68% of base) : System.Private.CoreLib.dasm - Vector4:Abs(Vector4):Vector4
         -16 (-2.26% of base) : System.Private.CoreLib.dasm - Vector`1:Abs(Vector`1):Vector`1 (6 methods)
         -16 (-11.51% of base) : System.Runtime.Numerics.dasm - Complex:op_Division(double,Complex):Complex
         -16 (-7.24% of base) : System.Runtime.Numerics.dasm - Complex:Acos(Complex):Complex
Top method improvements (percentages):
          -4 (-25.00% of base) : FSharp.Core.dasm - -cctor@6135-13:Invoke(double):double:this
          -4 (-25.00% of base) : FSharp.Core.dasm - -cctor@6136-15:Invoke(float):float:this
          -4 (-25.00% of base) : System.Private.CoreLib.dasm - MathF:Abs(float):float
         -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:op_UnaryNegation(Quaternion):Quaternion
         -16 (-17.20% of base) : System.Private.CoreLib.dasm - Quaternion:Negate(Quaternion):Quaternion
          -8 (-16.33% of base) : System.Runtime.Numerics.dasm - Complex:Negate(Complex):Complex
          -8 (-16.33% of base) : System.Runtime.Numerics.dasm - Complex:op_UnaryNegation(Complex):Complex
          -8 (-15.38% of base) : Microsoft.VisualBasic.Core.dasm - Conversion:Fix(double):double
         -12 (-14.81% of base) : System.Private.CoreLib.dasm - Quaternion:Conjugate(Quaternion):Quaternion
         -24 (-14.20% of base) : System.Private.CoreLib.dasm - Matrix3x2:op_UnaryNegation(Matrix3x2):Matrix3x2
         -16 (-13.68% of base) : System.Private.CoreLib.dasm - Vector4:Abs(Vector4):Vector4
          -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<ByIDSortedExclusiveMetric>b__15_0(CallTreeNodeBase,CallTreeNodeBase):int:this
          -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<.ctor>b__0_0(CallTreeNodeBase,CallTreeNodeBase):int:this
          -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<.ctor>b__0_1(CallTreeNodeBase,CallTreeNodeBase):int:this
          -8 (-12.50% of base) : Microsoft.Diagnostics.Tracing.TraceEvent.dasm - <>c:<GetCallees>b__5_0(CallTreeNode,CallTreeNode):int:this
         -12 (-12.37% of base) : Newtonsoft.Json.dasm - MathUtils:ApproxEquals(double,double):bool
         -12 (-11.88% of base) : System.Private.CoreLib.dasm - Vector3:Abs(Vector3):Vector3
          -8 (-11.76% of base) : Microsoft.VisualBasic.Core.dasm - Conversion:Fix(float):float
          -4 (-11.76% of base) : Newtonsoft.Json.dasm - JsonValidatingReader:IsZero(double):bool
         -16 (-11.51% of base) : System.Runtime.Numerics.dasm - Complex:op_Division(double,Complex):Complex
127 total methods with Code Size differences (127 improved, 0 regressed), 258503 unchanged.
Completed analysis in 28.55s

/cc @tannergooding

Saving 0.0 constant into a temp reg makes sense when it's used more than once but it doesn't work anyway: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunGAzADYRsGBgFls2ABSDhohGgayRDAJ4BKBgF4AfA0QMAVPtUBuLjwD0lhgEtRQ7ABNcDAEQAGAHQe3DfHgA1gwYAO62YDCKAJIM2PgMAHYQojAAbjCJ/pkYthCJtokA5voIhnCqDLgAFhAArgJODMAwDBAADrn4tgBeME0YEAxlqjQAvkA===

@am11
Copy link
Member

am11 commented Sep 12, 2020

godbolt: https://godbolt.org/z/5K5dGr

@EgorBo, do you know why clang is producing vxorps, for double precision (as well as for float); while gcc seems to be doing the right thing, producing vxorpd with double and vxorps for float?

@EgorBo
Copy link
Member Author

EgorBo commented Sep 12, 2020

godbolt: https://godbolt.org/z/5K5dGr

@EgorBo, do you know why clang is producing vxorps, for double precision (as well as for float); while gcc seems to be doing the right thing, producing vxorpd with double and vxorps for float?

I guess there is no difference for xor what to xor - it xors bits https://godbolt.org/z/9TsWq1 🙂
UPD: it looks like there is a small difference for the 256bit xor: https://stackoverflow.com/questions/26942952/difference-between-the-avx-instructions-vxorpd-and-vpxor

@tannergooding
Copy link
Member

do you know why clang is producing vxorps, for double precision (as well as for float); while gcc seems to be doing the right thing, producing vxorpd with double and vxorps for float?

For the legacy encoding, the ps version is generally 1 byte smaller than the pd or integral versions. For VEX they should be the same size.
movaps, movups, movntps, xorps, orps, andps, and shufps are all other instructions that can also be substituted due to being a byte smaller but functionally operating on bits rather than on "32-bit floats".

@jkotas jkotas added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 12, 2020
@sandreenko
Copy link
Contributor

@dotnet/jit-contrib

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

some tests fail, a simplified repro:

using System;
using System.Runtime.CompilerServices;

class Prog
{
    static void Main()
    {
        Console.WriteLine(
            BitConverter.DoubleToInt64Bits(Egor(0, double.NaN)));
    }

    [MethodImpl(MethodImplOptions.NoInlining)]
    static double Egor(double xmm0, double xmm1) => -xmm1;
}

happens only when AVX (VEX) is not available (e.g R2R).
Output:

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at Prog.Egor(Double xmm0, Double xmm1)
   at Prog.Main()

codegen for Egor:

G_M17335_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M17335_IG02:
       0F28C1               movaps   xmm0, xmm1
       0F57051E000000       xorps    xmm0, qword ptr [reloc @RWD24]
						;; bbWeight=1    PerfScore 2.25
G_M17335_IG03:
       C3                   ret
						;; bbWeight=1    PerfScore 1.00
RWD00  dq	0000000000000000h
RWD08  dq	8000000000000000h
RWD16  dq	0000000000000000h
RWD24  dq	8000000000000000h

@tannergooding
Copy link
Member

Shouldn't that be reloc @RWD00, not @RWD24?

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding 🤔 hm.. shouldn't the constant be 0x8000000000000000 (with sign bit on) but the section looks weird indeed.

fixed my code a bit so now it's:

; Assembly listing for method Prog:Egor(double,double):double
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )  double  ->  zero-ref
;  V01 arg1         [V01,T00] (  3,  3   )  double  ->  mm1
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M17335_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M17335_IG02:
       0F28C1               movaps   xmm0, xmm1
       0F57050E000000       xorps    xmm0, qword ptr [reloc @RWD08]
						;; bbWeight=1    PerfScore 2.25
G_M17335_IG03:
       C3                   ret
						;; bbWeight=1    PerfScore 1.00
RWD00  dq	0000000000000000h
RWD08  dq	8000000000000000h

but it still NREs.
here is the old (working) codegen that works:

; Assembly listing for method Prog:Egor(double,double):double
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0         [V00    ] (  0,  0   )  double  ->  zero-ref
;  V01 arg1         [V01,T00] (  3,  3   )  double  ->  mm1
;# V02 OutArgs      [V02    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;
; Lcl frame size = 0

G_M17335_IG01:
						;; bbWeight=1    PerfScore 0.00
G_M17335_IG02:
       F20F100510000000     movsd    xmm0, qword ptr [reloc @RWD08]
       0F57C1               xorps    xmm0, xmm1
						;; bbWeight=1    PerfScore 2.33
G_M17335_IG03:
       C3                   ret
						;; bbWeight=1    PerfScore 1.00
RWD00  dq	0000000000000000h
RWD08  dq	8000000000000000h


; Total bytes of code 12, prolog size 0, PerfScore 4.53, instruction count 3 (MethodHash=71e0bc48) for method Prog:Egor(double,double):double
; ============================================================

@tannergooding
Copy link
Member

There is no scalar xor operation, just a packed version, so the constant must be 16-bytes with at least element 0 being -0.0 (although we currently set all elements to be -0.0 or -0.0f).

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding I've updated my previous comment - do you see why the second (current) codegen works fine?
(it saves the constant to the targetReg and then does xorps reg/reg instead of xorps reg/mem for the this PR version)

@tannergooding
Copy link
Member

movsd xmm0, qword ptr [reloc @RWD08] reads a scalar double (hence sd), so it will only read 8-bytes and using RWD08 when the constant at that address is 8-bytes is fine.
xorps xmm0, qword ptr [reloc @RWD08] reads a packed single (hence ps), so it reads 16-bytes. You are reading from RWD08 and so it is reading RWD08-RWD24 (where there is no data at RWD16).

The failure, however, isn't due to the overreading (in this case), its because the data isn't aligned. The VEX encoding allows contained memory operands to be unaligned, while the legacy encoding (generally speaking) requires them to be aligned.

Fixing the overreading issue should also fix the alignment issue. You just need to read from RWD00.

@tannergooding
Copy link
Member

Fixing the overreading issue should also fix the alignment issue

  • Assuming you are specifying the constant should be 16-byte aligned when being emitted, there is an option for this (emitAnyCns takes a cnsAlign parameter that is respected).

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding thank you for the explanation!
not sure my fix for it looks good but at least it works now

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

@tannergooding here is the current codegen:

Double:

static double Test(double xmm0, double xmm1) => -xmm1;
; VEX
       C5F877               vzeroupper
       C5F0570505000000     vxorps   xmm0, xmm1, qword ptr [reloc @RWD00]
       C3                   ret
RWD00  dq       8000000000000000h
RWD08  dq       8000000000000000h


; legacy
       0F28C1               movaps   xmm0, xmm1
       0F570516000000       xorps    xmm0, qword ptr [reloc @RWD16]
       C3                   ret
RWD00  db       000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  dq       8000000000000000h
RWD24  dq       8000000000000000h

Float:

static float Test(float xmm0, float xmm1) => -xmm1;
; VEX
       C5F877               vzeroupper
       C5F0570505000000     vxorps   xmm0, xmm1, dword ptr [reloc @RWD00]
       C3                   ret
RWD00  dd       80000000h
RWD04  dd       80000000h
RWD08  dd       80000000h
RWD12  dd       80000000h


; legacy
       0F28C1               movaps   xmm0, xmm1
       0F570516000000       xorps    xmm0, dword ptr [reloc @RWD16]
       C3                   ret
RWD00  db       000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16  dd       80000000h
RWD20  dd       80000000h
RWD24  dd       80000000h
RWD28  dd       80000000h

@tannergooding
Copy link
Member

👍, that looks correct now. However, I'm unsure why there are 16-bytes of padding before the constant. It looks like it should just be able to start at RWD00 (with that being properly aligned).

@EgorBo
Copy link
Member Author

EgorBo commented Sep 14, 2020

👍, that looks correct now. However, I'm unsure why there are 16-bytes of padding before the constant. It looks like it should just be able to start at RWD00 (with that being properly aligned).

@tannergooding Fixed! as a bonus it fixes redundant paddings everywhere, e.g:

static double Test(double x) => x * 10 * 10;
       C5F877               vzeroupper 
       C5FB590515000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD08]
       C5FB59051D000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD24]
       C3                   ret      
RWD00  dq	0000000000000000h
RWD08  dq	4024000000000000h
RWD16  dq	0000000000000000h
RWD24  dq	4024000000000000h

to this:

       C5F877               vzeroupper 
       C5FB59050D000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD00]
       C5FB59050D000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD08]
       C3                   ret      
RWD00  dq	4024000000000000h
RWD08  dq	4024000000000000h

(I guess jit-diff ignores data sections to get the actual impact of this change)

@EgorBo
Copy link
Member Author

EgorBo commented Sep 15, 2020

@dotnet/jit-contrib @tannergooding PTAL, it's ready for final review

Summary:
This PR cleans up genSSE2BitwiseOp and it now emits 3-operands versions (VEX) of xorps/andps instead of movsd+xorps
Also, it fixes alignment for emitter::emitDataGenBeg, it used to emit paddings even if offset is already aligned or zero.
(my System.Private.Corelib.dll(R2R) is now 4kb less in size, jit-diff doesn't take data section into account) it's also responsible for redundant gaps between other types of constants, e.g. vectors.

A sample to cover both problems this PR fixes:

double Test(double x, double y) => -x * 10 * 5;

Current asm:

       C5F877               vzeroupper 
       C5FB100525000000     vmovsd   xmm0, qword ptr [reloc @RWD08] ; <-- can be eliminated for vxorps reg,reg,[mem]
       C5F857C1             vxorps   xmm0, xmm1
       C5FB590529000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD24]
       C5FB590531000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD40]
       C3                   ret      

RWD00  dq	0000000000000000h ; <-- redundant padding
RWD08  dq	8000000000000000h
RWD16  dq	0000000000000000h ; <-- redundant padding
RWD24  dq	4024000000000000h
RWD32  dq	0000000000000000h ; <-- redundant padding
RWD40  dq	4014000000000000h

New asm:

G_M60258_IG01:
       C5F877               vzeroupper 
G_M60258_IG02:
       C5F0570515000000     vxorps   xmm0, xmm1, qword ptr [reloc @RWD00]
       C5FB59051D000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD16]
       C5FB59051D000000     vmulsd   xmm0, xmm0, qword ptr [reloc @RWD24]
G_M60258_IG03:
       C3                   ret      

RWD00  db	000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h
RWD16  dq	4024000000000000h
RWD24  dq	4014000000000000h

New asm for non-VEX cpu:

       0F28C1               movaps   xmm0, xmm1
       0F570516000000       xorps    xmm0, qword ptr [reloc @RWD00]
       F20F59051E000000     mulsd    xmm0, qword ptr [reloc @RWD16]
       F20F59051E000000     mulsd    xmm0, qword ptr [reloc @RWD24]
       C3                   ret      

RWD00  db	000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h
RWD16  dq	4024000000000000h
RWD24  dq	4014000000000000h

RWD00 is a 16bytes pack of masks for xorps (it's also 16bytes aligned)

Co-authored-by: Tanner Gooding <tagoo@outlook.com>
Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks!

@CarolEidt CarolEidt merged commit 271f008 into dotnet:master Oct 2, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants