Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable FEATURE_MULTICASTSTUB_AS_IL for Windows x86 #104192

Merged
merged 13 commits into from
Jul 1, 2024

Conversation

huoyaoyuan
Copy link
Member

Closes #103958 .

Local benchmark result for Action<int> on x86:

Method Job Toolchain Mean Error StdDev Ratio
SingleArg Job-OSFVND \x86-3-TuneBranchPrediction\corerun.exe 3.464 ns 0.0127 ns 0.0119 ns 0.95
SingleArg Job-KBQPJV \x86-main\corerun.exe 3.661 ns 0.0222 ns 0.0208 ns 1.00
ManyCast Job-OSFVND \x86-3-TuneBranchPrediction\corerun.exe 8.188 ns 0.0944 ns 0.0883 ns 0.97
ManyCast Job-KBQPJV \x86-main\corerun.exe 8.483 ns 0.0888 ns 0.0831 ns 1.00
ManyArg_RetFPU Job-OSFVND \x86-3-TuneBranchPrediction\corerun.exe 5.566 ns 0.0338 ns 0.0316 ns 0.96
ManyArg_RetFPU Job-KBQPJV \x86-main\corerun.exe 5.771 ns 0.0212 ns 0.0177 ns 1.00

@huoyaoyuan
Copy link
Member Author

@EgorBot -intel -amd -arm64

using BenchmarkDotNet.Attributes;
using System;

namespace BenchmarkGround
{
    public struct GCStruct
    {
        public object a, b, c, d, e, f, h, g;
    }

    public class Bench
    {
        private readonly object obj = new object();

        private readonly Guid guid = Guid.NewGuid();

        private readonly GCStruct gcStruct = new GCStruct
        {
            a = new object(),
            g = new object()
        };

        private readonly Action<int> singleArg = (Action<int>)delegate { } + delegate { };

        private readonly Action<int> manyCast = (Action<int>)delegate { }
        +
            delegate { }
        + delegate { }
        + delegate { }
        + delegate { }
        + delegate { };

        private readonly Func<int, double, Guid, GCStruct, object, double> manyArg_RetFPU = (Func<int, double, Guid, GCStruct, object, double>)
            delegate { return 123.456; }
        + delegate { return 654.321; };

        [Benchmark]
        public void SingleArg() => singleArg(42);

        [Benchmark]
        public void ManyCast() => manyCast(42);

        [Benchmark]
        public double ManyArg_RetFPU() => manyArg_RetFPU(42, 123.0, guid, gcStruct, obj);
    }
}

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jun 29, 2024
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@EgorBot
Copy link

EgorBot commented Jun 29, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores
  Job-FAGYWG : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-LQQTSO : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Mean Error Ratio
SingleArg Main 8.084 ns 0.0012 ns 1.00
SingleArg PR 6.078 ns 0.0012 ns 0.75
ManyCast Main 19.581 ns 0.0313 ns 1.00
ManyCast PR 14.887 ns 0.1134 ns 0.76
ManyArg_RetFPU Main 14.538 ns 0.0340 ns 1.00
ManyArg_RetFPU PR 11.450 ns 0.0147 ns 0.79

BDN_Artifacts.zip

@EgorBot
Copy link

EgorBot commented Jun 29, 2024

Benchmark results on Arm64
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Unknown processor
  Job-LOUVQC : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
  Job-TLYWZG : .NET 9.0.0 (42.42.42.42424), Arm64 RyuJIT AdvSIMD
Method Toolchain Mean Error Ratio
SingleArg Main 15.831 ns 0.0109 ns 1.00
SingleArg PR 8.772 ns 0.0017 ns 0.55
ManyCast Main 38.106 ns 0.0234 ns 1.00
ManyCast PR 21.135 ns 0.0015 ns 0.55
ManyArg_RetFPU Main 17.636 ns 0.0024 ns 1.00
ManyArg_RetFPU PR 16.759 ns 0.0071 ns 0.95

BDN_Artifacts.zip

@EgorBot
Copy link

EgorBot commented Jun 29, 2024

Benchmark results on Amd
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD EPYC 7763, 1 CPU, 8 logical and 4 physical cores
  Job-DXSXUL : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-BIEGFU : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Method Toolchain Mean Error Ratio
SingleArg Main 13.008 ns 0.1016 ns 1.00
SingleArg PR 6.490 ns 0.0071 ns 0.50
ManyCast Main 100.460 ns 0.5200 ns 1.00
ManyCast PR 18.849 ns 0.0302 ns 0.19
ManyArg_RetFPU Main 14.358 ns 0.0464 ns 1.00
ManyArg_RetFPU PR 12.682 ns 0.0120 ns 0.88

BDN_Artifacts.zip

Comment on lines 2202 to 2209
pCode->EmitBRTRUE(invokeTraceHelper);
pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path

pCode->EmitLabel(invokeTraceHelper);

pCode->EmitLoadThis();
pCode->EmitLDLOC(dwLoopCounterNum);
pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pCode->EmitBRTRUE(invokeTraceHelper);
pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path
pCode->EmitLabel(invokeTraceHelper);
pCode->EmitLoadThis();
pCode->EmitLDLOC(dwLoopCounterNum);
pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0);
pCode->EmitBRTRUE(invokeTraceHelper);

And move the debugging path to be after RET:

        pCode->EmitRET();

#ifdef DEBUGGING_SUPPORTED
        // Emit debugging support at the end of the method for better perf
        pCode->EmitLabel(invokeTraceHelper);

        pCode->EmitLoadThis();
        pCode->EmitLDLOC(dwLoopCounterNum);
        pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0);

        pCode->EmitBR(debuggerCheckEnd);
#endif

This should be even better

Copy link
Member Author

@huoyaoyuan huoyaoyuan Jun 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was exactly what I did locally, and resulted in exact same codegen with current, at least for x86. The compiled asm code block order did follow IL block order exactly.

@dotnet/jit-contrib Do you have any suggestion on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has atypical loop. It looks like the JIT tried to reorder the basic blocks to turn it a more regular loop. I am not sure whether there is anything to fix in the JIT (the JIT would have to have profile data to do better).

It may be still worth it to move TraceHelper call to be at the end in IL. It provides stronger hint about the desired code layout to the JIT and makes this optimization a bit less fragile.

Copy link
Member Author

@huoyaoyuan huoyaoyuan Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well normalizing the loop closer to a for loop results in about 5% improvement for ManyArg case, but 5% regression for ManyCast.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codegen for x64 now:

; Assembly listing for method System.Action:IL_STUB_MulticastDelegate_Invoke():this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      rbx, rcx

G_M000_IG02:                ;; offset=0x0009
       xor      esi, esi
       cmp      qword ptr [rbx+0x30], 0
       jle      SHORT G_M000_IG05

G_M000_IG03:                ;; offset=0x0012
       test     dword ptr [(reloc 0x7ff824c55210)], 512
       jne      SHORT G_M000_IG06

G_M000_IG04:                ;; offset=0x001E
       mov      rcx, gword ptr [rbx+0x28]
       cmp      esi, dword ptr [rcx+0x08]
       jae      SHORT G_M000_IG07
       mov      rax, gword ptr [rcx+8*rsi+0x10]
       mov      rcx, gword ptr [rax+0x08]
       call     [rax+0x18]System.Action:Invoke():this
       inc      esi
       movsxd   rcx, esi
       cmp      rcx, qword ptr [rbx+0x30]
       jl       SHORT G_M000_IG03

G_M000_IG05:                ;; offset=0x003E
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret

G_M000_IG06:                ;; offset=0x0045
       mov      rcx, rbx
       mov      edx, esi
       call     System.StubHelpers.StubHelpers:MulticastDebuggerTraceHelper(System.Object,int)
       jmp      SHORT G_M000_IG04

G_M000_IG07:                ;; offset=0x0051
       call     CORINFO_HELP_RNGCHKFAIL
       int3

; Total bytes of code 87

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Nice simplification!

src/coreclr/vm/comdelegate.cpp Outdated Show resolved Hide resolved
Comment on lines 2202 to 2209
pCode->EmitBRTRUE(invokeTraceHelper);
pCode->EmitBR(debuggerCheckEnd); // Tune branch prediction to prefer non-debugging path

pCode->EmitLabel(invokeTraceHelper);

pCode->EmitLoadThis();
pCode->EmitLDLOC(dwLoopCounterNum);
pCode->EmitCALL(METHOD__STUBHELPERS__MULTICAST_DEBUGGER_TRACE_HELPER, 2, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has atypical loop. It looks like the JIT tried to reorder the basic blocks to turn it a more regular loop. I am not sure whether there is anything to fix in the JIT (the JIT would have to have profile data to do better).

It may be still worth it to move TraceHelper call to be at the end in IL. It provides stronger hint about the desired code layout to the JIT and makes this optimization a bit less fragile.

src/coreclr/vm/comdelegate.cpp Outdated Show resolved Hide resolved

// initialize counter
pCode->EmitLDC(0);
pCode->EmitSTLOC(dwLoopCounterNum);

// Make the shape of the loop similar to what C# compiler emits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new shape seems to be missing the invocation of the trace helper at the end. E.g. if the invocation count is 2, the trace helper should be called 3 times. It is only called 2 times if I am reading the code correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Result codegen:

; Assembly listing for method System.Action:IL_STUB_MulticastDelegate_Invoke():this (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; optimized using Synthesized PGO
; rsp based frame
; partially interruptible
; with Synthesized PGO: fgCalledCount is 100
; No PGO data

G_M000_IG01:                ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      rbx, rcx

G_M000_IG02:                ;; offset=0x0009
       xor      esi, esi
       jmp      SHORT G_M000_IG04

G_M000_IG03:                ;; offset=0x000D
       mov      rcx, gword ptr [rbx+0x28]
       cmp      esi, dword ptr [rcx+0x08]
       jae      SHORT G_M000_IG08
       mov      rax, gword ptr [rcx+8*rsi+0x10]
       mov      rcx, gword ptr [rax+0x08]
       call     [rax+0x18]System.Action:Invoke():this
       inc      esi

G_M000_IG04:                ;; offset=0x0024
       test     dword ptr [(reloc 0x7ff84ea05210)], 512
       jne      SHORT G_M000_IG06

G_M000_IG05:                ;; offset=0x0030
       movsxd   rcx, esi
       cmp      rcx, qword ptr [rbx+0x30]
       jl       SHORT G_M000_IG03
       jmp      SHORT G_M000_IG07

G_M000_IG06:                ;; offset=0x003B
       mov      rcx, rbx
       mov      edx, esi
       call     System.StubHelpers.StubHelpers:MulticastDebuggerTraceHelper(System.Object,int)
       jmp      SHORT G_M000_IG05

G_M000_IG07:                ;; offset=0x0047
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret

G_M000_IG08:                ;; offset=0x004E
       call     CORINFO_HELP_RNGCHKFAIL
       int3

; Total bytes of code 84

JIT reorders the debugging block in front of the ret. Benchmark shows no regression. It's just a bit strange with the unconditional jumps.

@jkotas jkotas merged commit 3dd3488 into dotnet:main Jul 1, 2024
86 of 89 checks passed
@huoyaoyuan huoyaoyuan deleted the multicast-stub-x86 branch July 1, 2024 04:39
@github-actions github-actions bot locked and limited conversation to collaborators Jul 31, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider to enable FEATURE_MULTICASTSTUB_AS_IL for Windows x86
3 participants