-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Implement LoadHigh, LoadLow, and SetScalarVector128 SSE2 HW intrinsics #16736
Conversation
FYI @CarolEidt @tannergooding @fiigii Implementation of |
src/jit/hwintrinsicxarch.cpp
Outdated
// node -- GenTreeHWIntrinsic* node with nullptr default value | ||
// | ||
// Return Value: | ||
// number of arguments | ||
// | ||
int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node) | ||
int Compiler::numArgsOfHWIntrinsicSlow(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need two member functions of numArgsOfHWIntrinsic, please combine them to one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought there was a general consensus that we should have a single method.
src/jit/hwintrinsicxarch.cpp
Outdated
// node -- GenTreeHWIntrinsic* node with nullptr default value | ||
// | ||
// Return Value: | ||
// number of arguments | ||
// | ||
int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* node) | ||
int Compiler::numArgsOfHWIntrinsicSlow(GenTreeHWIntrinsic* node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought there was a general consensus that we should have a single method.
Yes, the consensus was reached that we should have one function but it was based on the assumption that we will have usages where function may be called with node with explicit discussion link: #15777 (comment) Unfortunately, function in the current code will be called only with node argument which is not
The last arguments in favor of proposed design are:
Finally I have introduced signature change by removing |
On Ubuntu we are hitting failures due to renamed
|
Not really, our consensus is that the function always needs the node as passed argument.
Interesting, how did you get the data?
I am not a fan of early optimization, but if you can make sure that the function cannot be in-lined in the release build, this approach is reasonable to me. But the slower path can be a static local function instead of private member. |
test Windows_NT x64 Checked jitx86hwintrinsicnosimd |
test Windows_NT x86 Checked jitx86hwintrinsicnosimd |
I agree with @fiigii, if we have some actual data indicating this is a good perf scenario, we should make this change. However, the |
Reverted to single function due to problems with getting @tannergooding @CarolEidt Could you assign it to me? |
The problem we are hitting now is that Disassembly of release build of .text:0000000180129C88 CodeGen::genHWIntrinsic proc near
.text:0000000180129C88 var_B0 = dword ptr -0B0h
.text:0000000180129C88 var_AC = dword ptr -0ACh
.text:0000000180129C88 var_A8 = qword ptr -0A8h
.text:0000000180129C88 var_A0 = dword ptr -0A0h
.text:0000000180129C88 var_98 = qword ptr -98h
.text:0000000180129C88 var_90 = xmmword ptr -90h
.text:0000000180129C88 var_80 = xmmword ptr -80h
.text:0000000180129C88 var_70 = xmmword ptr -70h
.text:0000000180129C88 var_60 = xmmword ptr -60h
.text:0000000180129C88 var_50 = xmmword ptr -50h
.text:0000000180129C88 var_40 = xmmword ptr -40h
.text:0000000180129C88 var_s18 = dword ptr 18h
.text:0000000180129C88 var_s20 = dword ptr 20h
.text:0000000180129C88 var_s28 = dword ptr 28h
.text:0000000180129C88 arg_0 = qword ptr 40h
.text:0000000180129C88
.text:0000000180129C88 mov [rsp-38h+arg_0], rbx
.text:0000000180129C8D push rbp
.text:0000000180129C8E push rsi
.text:0000000180129C8F push rdi
.text:0000000180129C90 push r12
.text:0000000180129C92 push r13
.text:0000000180129C94 push r14
.text:0000000180129C96 push r15
.text:0000000180129C98 lea rbp, [rsp-27h]
.text:0000000180129C9D sub rsp, 0C0h
.text:0000000180129CA4 mov r12d, [rdx+48h]
.text:0000000180129CA8 mov r14, rcx
.text:0000000180129CAB lea rcx, hwIntrinsicInfoArray
.text:0000000180129CB2 mov rsi, rdx
.text:0000000180129CB5 lea eax, [r12-6]
.text:0000000180129CBA mov r11d, eax
.text:0000000180129CBD lea rax, [rax+rax*4]
.text:0000000180129CC1 add rax, rax
.text:0000000180129CC4 mov r10d, [rcx+rax*8+4Ch]
.text:0000000180129CC9 mov r8d, [rcx+rax*8+10h]
.text:0000000180129CCE mov r15d, [rcx+rax*8+48h]
.text:0000000180129CD3 mov edi, [rcx+rax*8+14h]
.text:0000000180129CD7 mov rcx, rdx
.text:0000000180129CDA mov [rbp+57h+var_A0], r10d
.text:0000000180129CDE call Compiler__numArgsOfHWIntrinsic
.text:0000000180129CE3 lea ecx, [r15-3]
.text:0000000180129CE7 mov r9d, eax
.text:0000000180129CEA mov eax, 4
.text:0000000180129CEF test ecx, 0FFFFFFFAh
.text:0000000180129CF5 jnz short loc_180129D00
|
That's fine, I do not believe that calling |
As I said, if a helper function can make |
src/jit/hwintrinsiclistxarch.h
Outdated
HARDWARE_INTRINSIC(SSE2_SetZeroVector128, "SetZeroVector128", SSE2, -1, 16, 0, {INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_invalid, INS_xorpd}, HW_Category_Helper, HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_SetAllVector128, "SetAllVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_SetScalarVector128, "SetScalarVector128", SSE2, -1, 16, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_movsdsse2}, HW_Category_Helper, HW_Flag_MultiIns|HW_Flag_NoRMWSemantics) | ||
HARDWARE_INTRINSIC(SSE2_SetVector128, "SetVector128", SSE2, -1, 16, -1, {INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_pxor, INS_invalid, INS_xorpd}, HW_Category_Helper, HW_Flag_NoFlag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the intrinsic definitions from this table, if you do not implement in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, will do.
@fiigii Addressed code review feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the work.
@dotnet-bot test Tizen armel Cross Checked Innerloop Build and Test |
On Ubuntu there are failures of Windows sub jobs due to:
https://ci.dot.net/job/dotnet_coreclr/job/master/job/checked_windows_nt_prtest/3399/ |
test Ubuntu x64 Checked jitincompletehwintrinsic |
Again hitting timeouts after 4 hours runs |
test Ubuntu x64 Checked jitincompletehwintrinsic |
assert(baseType == TYP_DOUBLE); | ||
assert(op2 == nullptr); | ||
|
||
instruction ins = Compiler::insOfHWIntrinsic(intrinsicID, node->gtSIMDBaseType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It might be useful to extract this up-top, like we do in genSSEIntrinsic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Checked the pattern and it seems that in Sse instructions are hard coded:
case NI_SSE_SetScalarVector128:
{
assert(baseType == TYP_FLOAT);
assert(op2 == nullptr);
if (op1Reg == targetReg)
{
regNumber tmpReg = node->GetSingleTempReg();
// Ensure we aren't overwriting targetReg
assert(tmpReg != targetReg);
emit->emitIns_R_R(INS_movaps, emitTypeSize(TYP_SIMD16), tmpReg, op1Reg);
op1Reg = tmpReg;
}
emit->emitIns_SIMD_R_R_R(INS_xorps, emitTypeSize(TYP_SIMD16), targetReg, targetReg, targetReg);
emit->emitIns_SIMD_R_R_R(INS_movss, emitTypeSize(TYP_SIMD16), targetReg, targetReg, op1Reg);
break;
}
I would prefer to not follow that pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, seems I was thinking of another method.
I was suggesting that it not be hard-coded, but you move the instruction ins = Compiler::insOfHWIntrinsic
line to the top of genSSE2Intrinsic
, so it can be more generally shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I moved it inside case switches due to explicit code review feedback from @CarolEidt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm - I don't recall giving that feedback (though I don't doubt that I did). It makes sense to pull it out in this case, as nearly all the cases depend upon it. If a large number had a one-to-one mapping, it might make sense to push it into the cases that require it.
In reply to: 172245582 [](ancestors = 172245582)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is here: #16262 (comment)
@CarolEidt Pls confirm if I should refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer. At that time, there weren't that many cases, so it seemed better to push it down into the cases that needed it, but now there are many cases with the majority requiring it, so it would make sense to pull it up. However, I don't think it's critical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am keen to get better implementation. May I do it together with next PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am keen to get better implementation. May I do it together with next PRs?
That would be fine, though we will want to slow the churn in the code as we move closer to release.
src/jit/hwintrinsiccodegenxarch.cpp
Outdated
{ | ||
regNumber tmpReg = node->GetSingleTempReg(); | ||
|
||
assert(tmpReg != targetReg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ensure you copy over comments as well when copying code (missing // Ensure we aren't overwriting targetReg
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
src/jit/hwintrinsicxarch.cpp
Outdated
// node unless it is nullptr | ||
// numArgsOfHWIntrinsic: gets the number of arguments for the hardware intrinsic. | ||
// This attempts to do a table based lookup but will fallback to the number | ||
// of operands in 'node' if the table entry is - 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Remove the space in - 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
src/jit/hwintrinsicxarch.cpp
Outdated
@@ -256,8 +261,8 @@ int Compiler::numArgsOfHWIntrinsic(NamedIntrinsic intrinsic, GenTreeHWIntrinsic* | |||
{ | |||
if (op1->OperIsList()) | |||
{ | |||
numArgs = 0; | |||
GenTreeArgList* list = op1->AsArgList(); | |||
int numArgs = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why redefine numArgs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad merge - will fix
a8d4288
to
a72b412
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM.
I plan on creating a template for the Load/Store tests shortly, which should help in code coverage.
test Windows_NT x64 Checked jitincompletehwintrinsic test Windows_NT x86 Checked jitincompletehwintrinsic test Ubuntu x64 Checked jitincompletehwintrinsic |
@@ -218,20 +218,26 @@ unsigned Compiler::simdSizeOfHWIntrinsic(NamedIntrinsic intrinsic, CORINFO_SIG_I | |||
return simdSize; | |||
} | |||
|
|||
// TODO_XARCH-CQ - refactoring of numArgsOfHWIntrinsic fast path into inlinable | |||
// function and slow local static function may increase performance significantly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this comment. As far as I know there is no evidence that this has a measurable impact on throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for review. May I correct it in next PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addresses code review feedback from #15777