-
Notifications
You must be signed in to change notification settings - Fork 5k
ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar #107786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
Checked in a debugger the right functions are now called. @dotnet/arm64-contrib @kunalspathak |
Change-Id: I9579349d76498ed0d73bade48eba66cc23a31ebf
@@ -3846,9 +3846,23 @@ int LinearScan::BuildDelayFreeUses(GenTree* node, | |||
return 0; | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak : As suggested offline, added some additional checks in the delay slot logic. If the register type of the register does not match that of the delay slot register then do not add the delay free use. This fixes where we call BuildDelayFreeUses()
for op2
which is a integer value.
Instead, we could do these checks in BuildHWIntrinsic()
to not call BuildDelayFreeUses()
, but then BuildDelayFreeUses()
would still need the same checks turned into asserts. So it seemed simpler to do inside.
This will work nicely with BuildHWIntrinsic()
rewrite PR too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Difficult to test with an assert because 1) all the hwintrinsic tests will fail at the first assert 2) many of these issues will be due during ConstantExpected
APIs, and for a lot of those we are only testing using hardcoded constants, meaning the assert will never be hit.
Scanning the SVE API, all the methods we have that are RMW
and have a ConstantExpected
are:
AddRotateComplex
DotProductBySelectedScalar
ExtractVector
FusedMultiplyAddBySelectedScalar
FusedMultiplySubtractBySelectedScalar
MultiplyAddRotateComplex
MultiplyAddRotateComplexBySelectedScalar
MultiplyBySelectedScalar
SaturatingDecrementBy16BitElementCount
SaturatingDecrementBy32BitElementCount
SaturatingDecrementBy64BitElementCount
SaturatingDecrementBy8BitElementCount
SaturatingDecrementByActiveElementCount
SaturatingIncrementBy16BitElementCount
SaturatingIncrementBy32BitElementCount
SaturatingIncrementBy64BitElementCount
SaturatingIncrementBy8BitElementCount
SaturatingIncrementByActiveElementCount
ShiftRightArithmeticForDivide
TrigonometricMultiplyAddCoefficient
Then there are AdvSimd ones. Then there are possibly others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much easier to fix properly in the lsra rewrite PR, as the check can just be added into the main for loop in BuildHWIntrinsic()
else if (delayFreeOp != nullptr && TheExtraChecksFromThisPR.....)
{
srcCount += BuildDelayFreeUses(operand, delayFreeOp, candidates);
}
else
{
srcCount += BuildOperandUses(operand, candidates);
}
/azp run Fuzzlyn |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -161,6 +165,7 @@ namespace JIT.HardwareIntrinsics.Arm | |||
for (var i = 0; i < Op1ElementCount; i++) { _data1[i] = {NextValueOp1}; } | |||
Unsafe.CopyBlockUnaligned(ref Unsafe.As<{Op1VectorType}<{Op1BaseType}>, byte>(ref _fld1), ref Unsafe.As<{Op1BaseType}, byte>(ref _data1[0]), (uint)Unsafe.SizeOf<{Op1VectorType}<{Op1BaseType}>>()); | |||
|
|||
_fld2 = {ElementIndex}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lot of the tests that takes immediate value is missing this coverage. We should fix it some day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an issue: #108060
@@ -3846,9 +3846,23 @@ int LinearScan::BuildDelayFreeUses(GenTree* node, | |||
return 0; | |||
} | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally this should be assert and the callers should not be calling delay free uses for different register type. Did you turned it into an assert and see how many places are hit?
// Multi register nodes should no go via this route. | ||
assert(!node->IsMultiRegNode()); | ||
if (rmwNode == nullptr || varTypeUsesSameRegType(rmwNode->TypeGet(), node->TypeGet()) || | ||
(rmwNode->IsMultiRegNode() && varTypeUsesFloatReg(node->TypeGet()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not understand the && varTypeUsesFloatReg(node->TypeGet())
here? did you see code where having delay free is ok to have for multi-reg node that are GPR (if there is such a thing)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the RMW node is a multi register, then it'll always be vector registers.
The normal node could be general register or vector register. We want to discount the case where node is a general register. So, varTypeUsesFloatReg()
is confirming node
uses a FP, vector or sve vector register
Eg:
(Vector128<sbyte> Value1, Vector128<sbyte> Value2) LoadAndInsertScalar((Vector128<sbyte>, Vector128<sbyte>) values, [ConstantExpected(Max = (byte)(15))] byte index, sbyte* address)
The RMW node is the multi register values
.
This code makes sure index
is not delay slotted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering we should have an assert that if it is a multiRegNode, then node
should be floatreg
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some asserts.
That then broke things because for delayFreeMultiple
intrinsics we do:
BuildDelayFreeUses(use.GetNode(), intrinsicTree);
Which is passing the entire intrinsicTree
down as the rmw
node which feels wrong. Fixed that to pass op1
instead.
Other than this and the comment in #107786 (comment), we should be good. |
In base:
Because of this conflict, it allocates the def of in diff:
Which means it is able to allocate |
@a74nh - seems the unit tests are failing. can you take a look? |
I've restored the change to the call to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* ARM64: Fix lsra for AdvSimd_LoadAndInsertScalar * Fix comment * Expand LoadAndInsertScalar testing * Only delayfree when register types match * fix formatting * Fix comment * Check that multiregister nodes use fp registers * Revert rmw assert
Spotted during implementation of #107459
The single register version of AdvSimd LoadAndInsertScalar is:
Vector128<X> LoadAndInsertScalar(Vector128<X> value, byte index, X* address);
Currently, the code builds op1 as an address and op3 as a rmw node:
This instead should be: