-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Arm64/Sve: Fix overzealous assert for embedded mask intrinsics with RMW semantics #105541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
wrap in TARGET_ARM64
@dotnet/jit-contrib @dotnet/arm64-contrib |
<Project Sdk="Microsoft.NET.Sdk"> | ||
<PropertyGroup> | ||
<Optimize>True</Optimize> | ||
</PropertyGroup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add <CLRTestTargetUnsupported Condition="'$(TargetArchitecture)' != 'arm64'">true</CLRTestTargetUnsupported>
to avoid building this test unnecessarily?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but there are lot of tests in this folder that do not have this. So would rather skip it and do it together for all the tests.
src/coreclr/jit/lsraarm64.cpp
Outdated
srcCount += BuildDelayFreeUses(emitOp2, emitOp1); | ||
srcCount += BuildDelayFreeUses(emitOp3, emitOp1); | ||
srcCount += BuildDelayFreeUses(intrin.op3, emitOp1); | ||
srcCount += BuildDelayFreeUses(emitOp2, emitOp1, intrinsicTree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this change, FMLA doesn't have 2x RMW nodes, it has 1.
Given dst = CndSel(mask, fmla(op1, op2, op3), merge)
movprfx (predicated)
fmla (predicated) - contained
movprfx (unpredicated)
fmla (predicated) - contained
fmla (predicated)
sel (predicated) - not contained
In the case of the first one, movprfx (predicated)
works as dst = CndSel(mask, op1, dst)
, so therefore the merge
value is actually the RMW node. We then emit fmla dst, mask, op2, op3
So inactive elements stay as is (values from dst
) and active elements are overwritten (this is the form we emit for merge == zero
and for other merge values in many cases).
In the case of the second one, movprfx (unpredicated)
works simply as dst = op1
, so therefore op1
is the RMW node. We then emit fmla dst, mask, op2, op3
, so inactive elements stay as is (this is the form we emit for mask == ptrue
or merge == op1
)
In the case of the final one we have two independent instruction sequences that get emitted. So the first fmla
only considers with regards to itself and will often execute the movprfx (unpredicated)
path because the mask needs to be considered as all true
only not emitting movprfx (unpredicated)
when dst == op1
already. The sel then independently merges the result
with merge
and there is no RMW consideration
private static void TestMethod2(Vector<double> mask) | ||
{ | ||
var vr1 = Vector128.CreateScalar((double)10).AsVector(); | ||
s_3 = Sve.ConditionalSelect(mask, Sve.FusedMultiplyAdd(vr1, s_3, s_3), s_3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this testing what we want?
s_3
is a static field, so I'd expect this is relying extra on CSE and other opts to get the "right shape"
Sve.FusedMultiplyAdd
should also be capable of emitting either FMAD
(Zdn = Za + Zdn * Zm
) or FMLA
(Zda = Zda + Zn * Zm
), which is to say between these two instructions any operand can be the RMW one, so it might be good to ensure we're covering the range here with some comments or even assertions (explicit validation using the disassembly checking functionality) of what codegen is expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hit an assert with this scenario so added it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding how would we go about explicitly validating the codegen for this? On my machine, the decision to use fmad or fmla for a test changes depending on if we're optimizing or not (though I'm not hitting asserts either way with this change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By utilizing https://github.com/dotnet/runtime/blob/main/docs/workflow/testing/coreclr/disasm-checks.md
We should be able to add tests that validate each of the operands given Sve.FusedMultiplyAdd(a, b, c)
end up as the target. We should then be able to further expand those base tests over ConditionalSelect
to ensure the right behavior still occurs.
It might be a release and TieredCompilation=0 only test, but it should be possible to get setup and validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding thanks for pointing that out -- I've updated the tests to check for expected codegen.
@TIHan the disasm check logic in the test is failing because it cannot find the JitDisasm output file. I can't repro this failure locally, so I think the test script is handling paths correctly. Have you run into issues like this with disasm tests in CI before? |
Is it possible that |
@markples I believe that's right. If the machine doesn't support it, it won't write it. For this scenario, we would need to support alt-jit to support SVE for disasm-checks which should be possible. |
@TIHan @markples good point, thank you for clarifying this -- I'm guessing there's no trivial way to conditionalize the disasm check, right? I tried naively setting @tannergooding since the assert this PR fixes is popping up in Fuzzlyn and Antigen, are you ok with taking this without any disasm checks in the test? I can update the annotations to just be comments documenting the expected codegen. |
For x64 I think you could just disable the overall test. Or, as long as any output file gets generated, none of the I think it's going to be harder for arm64 machines that may or may not have sve support unless there's a way to not run the test at all based on sve rather than arm64. Or Longer term, I wonder if there's a way to add first class support to our testing infrastructure. |
Yeah, this seems to be the primary blocker -- at the moment, I don't think the disasm check infra supports conditionalizing on the ISA, and I don't know of a way to skip this test in CI if the machine doesn't support SVE. I think we should pursue support for disasm checks in this test as a follow-up, since this PR is also fixing a blocking issue. |
I'm fine with not adding it, but I think we should ensure an issue is logged to track the ability to add such support long term. The main goal of requesting it be added was to ensure we're hitting the relevant code paths and generating the expected codegen for each of them, to help ensure no corner cases were missed |
I opened #106093 to track this. |
I agree to get this PR in. If you wanted a short-term fix for here, I think you could change |
@markples thank you for the suggestion! Do you know what subset of the runtime I need to build to pick up changes to |
I spoke with @markples offline, and the @tannergooding are you ok with me merging this as-is? |
ping @tannergooding |
1 similar comment
ping @tannergooding |
Adding @AndyAyersMS as a reviewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is now just changing assert behavior, can we update the description to describe it more accurately?
I'm not at all familiar with this part of the jit, so while I'll conditionally approve this it would be good for somebody who is more familiar to look it over too.
Sure thing; fixed.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
-- Sorry for the delay, missed this one in the swarm of other PRs I had been also reviewing
No worries, thanks! |
For RMW intrinsics, we delay free the SVE operands wrt to the RMW operand node, which means for some embedded mask operations like FMLA, the operands can be given the same register as the target of the parent
CndSel
node. Thus, relax some asserts around the embedded mask operands' registers not matching theCndSel
's target register.Resolves: #105474