-
Notifications
You must be signed in to change notification settings - Fork 5.1k
RyuJIT: Optimize -X and MathF.Abs(X) for floats #42164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
214efa9
to
2ec23e6
Compare
@EgorBo, do you know why clang is producing |
I guess there is no difference for |
For the legacy encoding, the |
@dotnet/jit-contrib |
some tests fail, a simplified repro: using System;
using System.Runtime.CompilerServices;
class Prog
{
static void Main()
{
Console.WriteLine(
BitConverter.DoubleToInt64Bits(Egor(0, double.NaN)));
}
[MethodImpl(MethodImplOptions.NoInlining)]
static double Egor(double xmm0, double xmm1) => -xmm1;
} happens only when AVX (VEX) is not available (e.g R2R).
codegen for G_M17335_IG01:
;; bbWeight=1 PerfScore 0.00
G_M17335_IG02:
0F28C1 movaps xmm0, xmm1
0F57051E000000 xorps xmm0, qword ptr [reloc @RWD24]
;; bbWeight=1 PerfScore 2.25
G_M17335_IG03:
C3 ret
;; bbWeight=1 PerfScore 1.00
RWD00 dq 0000000000000000h
RWD08 dq 8000000000000000h
RWD16 dq 0000000000000000h
RWD24 dq 8000000000000000h |
Shouldn't that be |
@tannergooding 🤔 hm.. shouldn't the constant be fixed my code a bit so now it's: ; Assembly listing for method Prog:Egor(double,double):double
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) double -> zero-ref
; V01 arg1 [V01,T00] ( 3, 3 ) double -> mm1
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M17335_IG01:
;; bbWeight=1 PerfScore 0.00
G_M17335_IG02:
0F28C1 movaps xmm0, xmm1
0F57050E000000 xorps xmm0, qword ptr [reloc @RWD08]
;; bbWeight=1 PerfScore 2.25
G_M17335_IG03:
C3 ret
;; bbWeight=1 PerfScore 1.00
RWD00 dq 0000000000000000h
RWD08 dq 8000000000000000h but it still NREs. ; Assembly listing for method Prog:Egor(double,double):double
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;* V00 arg0 [V00 ] ( 0, 0 ) double -> zero-ref
; V01 arg1 [V01,T00] ( 3, 3 ) double -> mm1
;# V02 OutArgs [V02 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00] "OutgoingArgSpace"
;
; Lcl frame size = 0
G_M17335_IG01:
;; bbWeight=1 PerfScore 0.00
G_M17335_IG02:
F20F100510000000 movsd xmm0, qword ptr [reloc @RWD08]
0F57C1 xorps xmm0, xmm1
;; bbWeight=1 PerfScore 2.33
G_M17335_IG03:
C3 ret
;; bbWeight=1 PerfScore 1.00
RWD00 dq 0000000000000000h
RWD08 dq 8000000000000000h
; Total bytes of code 12, prolog size 0, PerfScore 4.53, instruction count 3 (MethodHash=71e0bc48) for method Prog:Egor(double,double):double
; ============================================================
|
There is no scalar |
@tannergooding I've updated my previous comment - do you see why the second (current) codegen works fine? |
The failure, however, isn't due to the overreading (in this case), its because the data isn't aligned. The VEX encoding allows contained memory operands to be unaligned, while the legacy encoding (generally speaking) requires them to be aligned. Fixing the overreading issue should also fix the alignment issue. You just need to read from |
|
@tannergooding thank you for the explanation! |
@tannergooding here is the current codegen: Double: static double Test(double xmm0, double xmm1) => -xmm1; ; VEX
C5F877 vzeroupper
C5F0570505000000 vxorps xmm0, xmm1, qword ptr [reloc @RWD00]
C3 ret
RWD00 dq 8000000000000000h
RWD08 dq 8000000000000000h
; legacy
0F28C1 movaps xmm0, xmm1
0F570516000000 xorps xmm0, qword ptr [reloc @RWD16]
C3 ret
RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16 dq 8000000000000000h
RWD24 dq 8000000000000000h Float: static float Test(float xmm0, float xmm1) => -xmm1; ; VEX
C5F877 vzeroupper
C5F0570505000000 vxorps xmm0, xmm1, dword ptr [reloc @RWD00]
C3 ret
RWD00 dd 80000000h
RWD04 dd 80000000h
RWD08 dd 80000000h
RWD12 dd 80000000h
; legacy
0F28C1 movaps xmm0, xmm1
0F570516000000 xorps xmm0, dword ptr [reloc @RWD16]
C3 ret
RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 000h
RWD16 dd 80000000h
RWD20 dd 80000000h
RWD24 dd 80000000h
RWD28 dd 80000000h |
👍, that looks correct now. However, I'm unsure why there are 16-bytes of padding before the constant. It looks like it should just be able to start at |
@tannergooding Fixed! as a bonus it fixes redundant paddings everywhere, e.g: static double Test(double x) => x * 10 * 10; C5F877 vzeroupper
C5FB590515000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD08]
C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24]
C3 ret
RWD00 dq 0000000000000000h
RWD08 dq 4024000000000000h
RWD16 dq 0000000000000000h
RWD24 dq 4024000000000000h to this: C5F877 vzeroupper
C5FB59050D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD00]
C5FB59050D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD08]
C3 ret
RWD00 dq 4024000000000000h
RWD08 dq 4024000000000000h (I guess jit-diff ignores data sections to get the actual impact of this change) |
@dotnet/jit-contrib @tannergooding PTAL, it's ready for final review Summary: A sample to cover both problems this PR fixes: double Test(double x, double y) => -x * 10 * 5; Current asm: C5F877 vzeroupper
C5FB100525000000 vmovsd xmm0, qword ptr [reloc @RWD08] ; <-- can be eliminated for vxorps reg,reg,[mem]
C5F857C1 vxorps xmm0, xmm1
C5FB590529000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24]
C5FB590531000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD40]
C3 ret
RWD00 dq 0000000000000000h ; <-- redundant padding
RWD08 dq 8000000000000000h
RWD16 dq 0000000000000000h ; <-- redundant padding
RWD24 dq 4024000000000000h
RWD32 dq 0000000000000000h ; <-- redundant padding
RWD40 dq 4014000000000000h New asm: G_M60258_IG01:
C5F877 vzeroupper
G_M60258_IG02:
C5F0570515000000 vxorps xmm0, xmm1, qword ptr [reloc @RWD00]
C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD16]
C5FB59051D000000 vmulsd xmm0, xmm0, qword ptr [reloc @RWD24]
G_M60258_IG03:
C3 ret
RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h
RWD16 dq 4024000000000000h
RWD24 dq 4014000000000000h New asm for non-VEX cpu: 0F28C1 movaps xmm0, xmm1
0F570516000000 xorps xmm0, qword ptr [reloc @RWD00]
F20F59051E000000 mulsd xmm0, qword ptr [reloc @RWD16]
F20F59051E000000 mulsd xmm0, qword ptr [reloc @RWD24]
C3 ret
RWD00 db 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h, 000h, 000h, 000h, 000h, 000h, 000h, 000h, 080h
RWD16 dq 4024000000000000h
RWD24 dq 4014000000000000h
|
Co-authored-by: Tanner Gooding <tagoo@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks!
Contributes to #1342
Current codegen:
New codegen:
godbolt: https://godbolt.org/z/5K5dGr
jit-diff:
/cc @tannergooding
Saving 0.0 constant into a temp reg makes sense when it's used more than once but it doesn't work anyway: https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunGAzADYRsGBgFls2ABSDhohGgayRDAJ4BKBgF4AfA0QMAVPtUBuLjwD0lhgEtRQ7ABNcDAEQAGAHQe3DfHgA1gwYAO62YDCKAJIM2PgMAHYQojAAbjCJ/pkYthCJtokA5voIhnCqDLgAFhAArgJODMAwDBAADrn4tgBeME0YEAxlqjQAvkA===