Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Implement LoadHigh, LoadLow, and SetScalarVector128 SSE2 HW intrinsics #16736

Merged
merged 2 commits into from
Mar 6, 2018

Conversation

4creators
Copy link

Addresses code review feedback from #15777

@4creators
Copy link
Author

FYI @CarolEidt @tannergooding @fiigii

Implementation of Compiler::numArgsOfHWIntrinsic was further simplified and due to dropping of default nullptr argument what would prevent inlining fast path part of the function I have refactored it to ensure fast path will be inlined.

// node -- GenTreeHWIntrinsic* node with nullptr default value
//
// Return Value:
// number of arguments
//
int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node)
int Compiler::numArgsOfHWIntrinsicSlow(GenTreeHWIntrinsic* node)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need two member functions of numArgsOfHWIntrinsic, please combine them to one.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought there was a general consensus that we should have a single method.

// node -- GenTreeHWIntrinsic* node with nullptr default value
//
// Return Value:
// number of arguments
//
int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node)
int Compiler::numArgsOfHWIntrinsicSlow(GenTreeHWIntrinsic* node)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought there was a general consensus that we should have a single method.

@4creators
Copy link
Author

4creators commented Mar 3, 2018

@fiigii @CarolEidt

Yes, the consensus was reached that we should have one function but it was based on the assumption that we will have usages where function may be called with node with explicit nullptr value. This design should ensure that in the case the function will be called with default node value C++ compiler would optimize away unreachable slow code.

discussion link: #15777 (comment)

Unfortunately, function in the current code will be called only with node argument which is not nullptr and function due to it's size will not be inlined. To avoid that problem I have used optimization pattern which is always used in .NET Core to enable inlining of function which has small fast code path and which could be good inlining candidate. This pattern leaves fast path in the function which is inlinable and moves large, slow path to subroutine which is called rarely. This pattern is used throughout .NET Core managed and native parts and was discussed in a couple of .NET Core team blog posts.

numArgsOfHWIntrinsic function is called around 3 times for every compiled HW intrinsic and difference between two functions and one function in number of CPU cycles will be roughly 5 cycles for current implementation to 130 - 140 cycles for single function implementation (I assume context switch for every call costs 120-130 CPU cycles. In total we would get ~390 - 420 CPU cycles for each compiled intrinsic with your proposal while we could keep it down to ~15 CPU cycles for majority of instructions what is 2 600 % faster.

The last arguments in favor of proposed design are:

  • there is general problem with performance of code generation in .NET Core which takes roughly 15% of compilation time
  • numArgsOfHWIntrinsicSlow can be renamed and hidden as private subroutin

Finally I have introduced signature change by removing NamedIntrinsic intrinsicID argument as this information is passed to function with node. This eliminates the need for assertion assert(node->gtHWIntrinsicId == intrinsic).

@4creators
Copy link
Author

On Ubuntu we are hitting failures due to renamed System.Runtime.Intrinsics assembly:

Unhandled Exception: System.IO.FileNotFoundException: Could not load file or assembly 'System.Runtime.Intrinsics, Version=4.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'. The system cannot find the file specified.

@4creators
Copy link
Author

@fiigii
Copy link

fiigii commented Mar 4, 2018

Yes, the consensus was reached that we should have one function but it was based on the assumption that we will have usages where function may be called with node with explicit nullptr value. This design should ensure that in the case the function will be called with default node value C++ compiler would optimize away unreachable slow code.

Not really, our consensus is that the function always needs the node as passed argument.

due to it's size will not be inlined

current implementation to 130 - 140 cycles for single function implementation (I assume context switch for every call costs 120-130 CPU cycles. In total we would get ~390 - 420 CPU cycles for each compiled intrinsic with your proposal while we could keep it down to ~15 CPU cycles for majority of instructions what is 2 600 % faster

Interesting, how did you get the data?

numArgsOfHWIntrinsicSlow can be renamed and hidden as private subroutin

I am not a fan of early optimization, but if you can make sure that the function cannot be in-lined in the release build, this approach is reasonable to me. But the slower path can be a static local function instead of private member.

@4creators
Copy link
Author

test Windows_NT x64 Checked jitx86hwintrinsicnosimd

@4creators
Copy link
Author

test Windows_NT x86 Checked jitx86hwintrinsicnosimd

@tannergooding
Copy link
Member

I agree with @fiigii, if we have some actual data indicating this is a good perf scenario, we should make this change. However, the slow path should be a static local function, rather than a member of Compiler.

@4creators
Copy link
Author

Reverted to single function due to problems with getting static int numArgsOfHWIntrinsic inlined after splitting into two functions. Opened #16744 to track this work item.

@tannergooding @CarolEidt Could you assign it to me?

@4creators
Copy link
Author

The problem we are hitting now is that numArgsOfHWIntrinsic is never inlined while it was inlined before refactoring:

Disassembly of release build of clrjit.dll CodeGen::genHWIntrinsic

.text:0000000180129C88 CodeGen::genHWIntrinsic proc near 

.text:0000000180129C88 var_B0          = dword ptr -0B0h
.text:0000000180129C88 var_AC          = dword ptr -0ACh
.text:0000000180129C88 var_A8          = qword ptr -0A8h
.text:0000000180129C88 var_A0          = dword ptr -0A0h
.text:0000000180129C88 var_98          = qword ptr -98h
.text:0000000180129C88 var_90          = xmmword ptr -90h
.text:0000000180129C88 var_80          = xmmword ptr -80h
.text:0000000180129C88 var_70          = xmmword ptr -70h
.text:0000000180129C88 var_60          = xmmword ptr -60h
.text:0000000180129C88 var_50          = xmmword ptr -50h
.text:0000000180129C88 var_40          = xmmword ptr -40h
.text:0000000180129C88 var_s18         = dword ptr  18h
.text:0000000180129C88 var_s20         = dword ptr  20h
.text:0000000180129C88 var_s28         = dword ptr  28h
.text:0000000180129C88 arg_0           = qword ptr  40h
.text:0000000180129C88
.text:0000000180129C88                 mov     [rsp-38h+arg_0], rbx
.text:0000000180129C8D                 push    rbp
.text:0000000180129C8E                 push    rsi
.text:0000000180129C8F                 push    rdi
.text:0000000180129C90                 push    r12
.text:0000000180129C92                 push    r13
.text:0000000180129C94                 push    r14
.text:0000000180129C96                 push    r15
.text:0000000180129C98                 lea     rbp, [rsp-27h]
.text:0000000180129C9D                 sub     rsp, 0C0h
.text:0000000180129CA4                 mov     r12d, [rdx+48h]
.text:0000000180129CA8                 mov     r14, rcx
.text:0000000180129CAB                 lea     rcx, hwIntrinsicInfoArray
.text:0000000180129CB2                 mov     rsi, rdx
.text:0000000180129CB5                 lea     eax, [r12-6]
.text:0000000180129CBA                 mov     r11d, eax
.text:0000000180129CBD                 lea     rax, [rax+rax*4]
.text:0000000180129CC1                 add     rax, rax
.text:0000000180129CC4                 mov     r10d, [rcx+rax*8+4Ch]
.text:0000000180129CC9                 mov     r8d, [rcx+rax*8+10h]
.text:0000000180129CCE                 mov     r15d, [rcx+rax*8+48h]
.text:0000000180129CD3                 mov     edi, [rcx+rax*8+14h]
.text:0000000180129CD7                 mov     rcx, rdx
.text:0000000180129CDA                 mov     [rbp+57h+var_A0], r10d
.text:0000000180129CDE                 call    Compiler__numArgsOfHWIntrinsic
.text:0000000180129CE3                 lea     ecx, [r15-3]
.text:0000000180129CE7                 mov     r9d, eax
.text:0000000180129CEA                 mov     eax, 4
.text:0000000180129CEF                 test    ecx, 0FFFFFFFAh
.text:0000000180129CF5                 jnz     short loc_180129D00

@fiigii
Copy link

fiigii commented Mar 4, 2018

that numArgsOfHWIntrinsic is never inlined while it was inlined before refactoring:

That's fine, I do not believe that calling numArgsOfHWIntrinsic could be a bottleneck.

@fiigii
Copy link

fiigii commented Mar 4, 2018

As I said, if a helper function can make numArgsOfHWIntrinsic inlined, we can use a static local function.

HARDWARE_INTRINSIC(SSE2_SetZeroVector128, "SetZeroVector128", SSE2, -1, 16, 0, {INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_invalid, INS_xorpd}, HW_Category_Helper, HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_SetAllVector128, "SetAllVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_SetScalarVector128, "SetScalarVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movsdsse2}, HW_Category_Helper, HW_Flag_MultiIns|HW_Flag_NoRMWSemantics)
HARDWARE_INTRINSIC(SSE2_SetVector128, "SetVector128", SSE2, -1, 16, -1, {INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_invalid, INS_xorpd}, HW_Category_Helper, HW_Flag_NoFlag)
Copy link

@fiigii fiigii Mar 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the intrinsic definitions from this table, if you do not implement in the PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, will do.

@4creators
Copy link
Author

@fiigii Addressed code review feedback

Copy link

@fiigii fiigii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the work.

@4creators
Copy link
Author

@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test

@4creators
Copy link
Author

On Ubuntu there are failures of Windows sub jobs due to:

Test Infrastructure Failure: Insufficient system resources exist to complete the requested service

https://ci.dot.net/job/dotnet_coreclr/job/master/job/checked_windows_nt_prtest/3399/

@4creators
Copy link
Author

@4creators
Copy link
Author

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2

@4creators
Copy link
Author

Again hitting timeouts after 4 hours runs

@4creators
Copy link
Author

test Ubuntu x64 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2

assert(baseType == TYP_DOUBLE);
assert(op2 == nullptr);

instruction ins = Compiler::insOfHWIntrinsic(intrinsicID, node->gtSIMDBaseType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be useful to extract this up-top, like we do in genSSEIntrinsic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Checked the pattern and it seems that in Sse instructions are hard coded:

        case NI_SSE_SetScalarVector128:
        {
            assert(baseType == TYP_FLOAT);
            assert(op2 == nullptr);

            if (op1Reg == targetReg)
            {
                regNumber tmpReg = node->GetSingleTempReg();

                // Ensure we aren't overwriting targetReg
                assert(tmpReg != targetReg);

                emit->emitIns_R_R(INS_movaps, emitTypeSize(TYP_SIMD16), tmpReg, op1Reg);
                op1Reg = tmpReg;
            }

            emit->emitIns_SIMD_R_R_R(INS_xorps, emitTypeSize(TYP_SIMD16), targetReg, targetReg, targetReg);
            emit->emitIns_SIMD_R_R_R(INS_movss, emitTypeSize(TYP_SIMD16), targetReg, targetReg, op1Reg);
            break;
        }

I would prefer to not follow that pattern.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, seems I was thinking of another method.

I was suggesting that it not be hard-coded, but you move the instruction ins = Compiler::insOfHWIntrinsic line to the top of genSSE2Intrinsic, so it can be more generally shared.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I moved it inside case switches due to explicit code review feedback from @CarolEidt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm - I don't recall giving that feedback (though I don't doubt that I did). It makes sense to pull it out in this case, as nearly all the cases depend upon it. If a large number had a one-to-one mapping, it might make sense to push it into the cases that require it.


In reply to: 172245582 [](ancestors = 172245582)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is here: #16262 (comment)

@CarolEidt Pls confirm if I should refactor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. At that time, there weren't that many cases, so it seemed better to push it down into the cases that needed it, but now there are many cases with the majority requiring it, so it would make sense to pull it up. However, I don't think it's critical.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am keen to get better implementation. May I do it together with next PRs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am keen to get better implementation. May I do it together with next PRs?

That would be fine, though we will want to slow the churn in the code as we move closer to release.

{
regNumber tmpReg = node->GetSingleTempReg();

assert(tmpReg != targetReg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ensure you copy over comments as well when copying code (missing // Ensure we aren't overwriting targetReg)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

// node unless it is nullptr
// numArgsOfHWIntrinsic: gets the number of arguments for the hardware intrinsic.
// This attempts to do a table based lookup but will fallback to the number
// of operands in 'node' if the table entry is - 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Remove the space in - 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix

@@ -256,8 +261,8 @@ int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic*
{
if (op1->OperIsList())
{
numArgs = 0;
GenTreeArgList* list = op1->AsArgList();
int numArgs = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why redefine numArgs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad merge - will fix

@4creators 4creators force-pushed the sse2part5 branch 2 times, most recently from a8d4288 to a72b412 Compare March 5, 2018 16:27
Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

I plan on creating a template for the Load/Store tests shortly, which should help in code coverage.

@4creators
Copy link
Author

test Windows_NT x64 Checked jitincompletehwintrinsic
test Windows_NT x64 Checked jitx86hwintrinsicnoavx
test Windows_NT x64 Checked jitx86hwintrinsicnoavx2
test Windows_NT x64 Checked jitx86hwintrinsicnosimd
test Windows_NT x64 Checked jitnox86hwintrinsic

test Windows_NT x86 Checked jitincompletehwintrinsic
test Windows_NT x86 Checked jitx86hwintrinsicnoavx
test Windows_NT x86 Checked jitx86hwintrinsicnoavx2
test Windows_NT x86 Checked jitx86hwintrinsicnosimd
test Windows_NT x86 Checked jitnox86hwintrinsic

test Ubuntu x64 Checked jitincompletehwintrinsic
test Ubuntu x64 Checked jitx86hwintrinsicnoavx
test Ubuntu x64 Checked jitx86hwintrinsicnoavx2
test Ubuntu x64 Checked jitx86hwintrinsicnosimd
test Ubuntu x64 Checked jitnox86hwintrinsic

@@ -218,20 +218,26 @@ unsigned Compiler::simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_I
return simdSize;
}

// TODO_XARCH-CQ - refactoring of numArgsOfHWIntrinsic fast path into inlinable
// function and slow local static function may increase performance significantly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment. As far as I know there is no evidence that this has a measurable impact on throughput.

Copy link
Author

@4creators 4creators Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. May I correct it in next PR?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fine.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants