-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[AMDGPU] Fix canonicalization of truncated values. #83054
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-llvm-selectiondag @llvm/pr-subscribers-backend-amdgpu Author: Harald van Dijk (hvdijk) ChangesWe were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away. Fixes #82937. Full diff: https://github.com/llvm/llvm-project/pull/83054.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 84ef9679ab9563..f7f64fdc9aae8a 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -12831,6 +12831,20 @@ SDValue SITargetLowering::performFCanonicalizeCombine(
}
}
+ // The TRUNC parameter of FP_ROUND specifies whether the operation may be
+ // optimized away because the operation is known to be exact. Even if the
+ // operation would be considered exact in normal circumstances where we do not
+ // care about SNaN, we do care about SNaN here and must preserve the operation
+ // unless its input is known to be canonicalized.
+ if (SrcOpc == ISD::FP_ROUND || SrcOpc == ISD::STRICT_FP_ROUND) {
+ if (N0.getConstantOperandVal(1) == 0 ||
+ isCanonicalized(DAG, N0.getOperand(0)))
+ return N0;
+ SDLoc SL(N0);
+ return DAG.getNode(SrcOpc, SL, VT, N0.getOperand(0),
+ DAG.getTargetConstant(0, SL, MVT::i32));
+ }
+
return isCanonicalized(DAG, N0) ? N0 : SDValue();
}
diff --git a/llvm/test/CodeGen/AMDGPU/bf16.ll b/llvm/test/CodeGen/AMDGPU/bf16.ll
index c773742a459507..14c5a8c46f7ba5 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16.ll
@@ -26818,11 +26818,15 @@ define bfloat @v_canonicalize_bf16(bfloat %a) {
; GCN-LABEL: v_canonicalize_bf16:
; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT: v_mul_f32_e32 v0, 1.0, v0
+; GCN-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GCN-NEXT: s_setpc_b64 s[30:31]
;
; GFX7-LABEL: v_canonicalize_bf16:
; GFX7: ; %bb.0:
; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX7-NEXT: v_mul_f32_e32 v0, 1.0, v0
+; GFX7-NEXT: v_and_b32_e32 v0, 0xffff0000, v0
; GFX7-NEXT: s_setpc_b64 s[30:31]
;
; GFX8-LABEL: v_canonicalize_bf16:
diff --git a/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll b/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll
index 274621307f540d..f53adf88454bc6 100644
--- a/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll
@@ -170,6 +170,35 @@ define amdgpu_kernel void @s_test_canonicalize_var_f16(ptr addrspace(1) %out, i1
ret void
}
+define half @s_test_canonicalize_arg(half %x) #1 {
+; VI-LABEL: s_test_canonicalize_arg:
+; VI: ; %bb.0:
+; VI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT: v_max_f16_e32 v0, v0, v0
+; VI-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX9-LABEL: s_test_canonicalize_arg:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_max_f16_e32 v0, v0, v0
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; CI-LABEL: s_test_canonicalize_arg:
+; CI: ; %bb.0:
+; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CI-NEXT: v_cvt_f16_f32_e32 v0, v0
+; CI-NEXT: v_cvt_f32_f16_e32 v0, v0
+; CI-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: s_test_canonicalize_arg:
+; GFX11: ; %bb.0:
+; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT: v_max_f16_e32 v0, v0, v0
+; GFX11-NEXT: s_setpc_b64 s[30:31]
+ %canonicalized = call half @llvm.canonicalize.f16(half %x)
+ ret half %canonicalized
+}
+
define <2 x half> @v_test_canonicalize_build_vector_v2f16(half %lo, half %hi) #1 {
; VI-LABEL: v_test_canonicalize_build_vector_v2f16:
; VI: ; %bb.0:
|
197b1b3
to
6c814cc
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
6c814cc
to
c799ba0
Compare
My first attempt for this, special casing |
c799ba0
to
c496176
Compare
c496176
to
998a431
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach. Makes a lot of sense to me.
// If fcanonicalize's operand is implicitly canonicalized, we only need a copy. | ||
let AddedComplexity = 1000 in { | ||
def : GCNPat< | ||
(is_canonicalized_1<fcanonicalize> f16:$src), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a dumb question (I have never understood DAG selection patterns) but why can't we write this more like (fcanonicalize (is_canonicalized f16:$src))
, so the predicate applies to the operand instead of to the whole expression? Then we would not need separate _1
and _2
versions of is_canonicalized
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a dumb question, one that I would like an answer to as well :) Variations of that were what I tried first, but I couldn't get it to work, and TableGen's error messages were not helpful in letting me figure out a way to get it working. I decided to just stick what was already being done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PatFrag is what owns the predicate; it might be possible to define something using srcvalue as a way of getting a catch-all matcher with a predicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a look and couldn't find any custom predicate logic other than for instructions, which means we cannot do it on the operand level as the operand need not be an instruction. It's possible I missed something simple, but I think I would prefer to just leave it like this over making big changes to make TableGen support operand-level predicates.
class | ||
FPMinCanonMaxPat<Instruction minmaxInst, ValueType vt, SDPatternOperator min_or_max, | ||
SDPatternOperator max_or_min_oneuse> : GCNPat < | ||
(min_or_max (is_canonicalized_1<fcanonicalize> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how many more fp patterns there are that might need to skip over an fcanonicalize
in the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really arbitrary operations. fmin/fmax are just of particular interest because we're stuck inserting canonicalizes in the lowering for them. The more I think about it, the more I think we'd better of making this a separate transform (but just moving this to selection patterns is a net improvement for now I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may well be more, I just went for the ones that I came across from the test results. But if there are more, it doesn't result in wrong code, only suboptimal code, which I think is acceptable if the overall code still looks better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd only consider the min/max legalize case important. There's only a handful of canonicalizes in the math library, they aren't common (and I think we could avoid most of those too if we rewrote the code slightly)
// TODO: This is incorrect as it loses track of the operand's type. We may | ||
// end up effectively bitcasting from f32 to v2f16 or vice versa, and the | ||
// same bits that are canonicalized in one type need not be in the other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should give up on different element size bitcasts? (separate patch though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rather than give up, we should add a parameter specifying the type so that we would be able to handle things like v2f16
-> i32
-> truncate to i16
-> f16
. But yes, separate patch, this is a pre-existing issue.
class | ||
FPMinCanonMaxPat<Instruction minmaxInst, ValueType vt, SDPatternOperator min_or_max, | ||
SDPatternOperator max_or_min_oneuse> : GCNPat < | ||
(min_or_max (is_canonicalized_1<fcanonicalize> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really arbitrary operations. fmin/fmax are just of particular interest because we're stuck inserting canonicalizes in the lowering for them. The more I think about it, the more I think we'd better of making this a separate transform (but just moving this to selection patterns is a net improvement for now I think)
@@ -12603,20 +12627,21 @@ bool SITargetLowering::isCanonicalized(SelectionDAG &DAG, SDValue Op, | |||
case Intrinsic::amdgcn_trig_preop: | |||
case Intrinsic::amdgcn_log: | |||
case Intrinsic::amdgcn_exp2: | |||
case Intrinsic::amdgcn_sqrt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated bonus change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not unrelated, this is one of the things that came up in LLVM's testsuite where previously we were able to optimize away a fcanonicalize
because we saw fsqrt
, but now we only check later and see amdgcn_sqrt
.
We were relying on roundings to implicitly canonicalize, which is generally safe, except that roundings may be optimized away, and more generally, other operations may be optimized away as well. To solve this, as suggested by @arsenm, keep fcanonicalize nodes around for longer. Some tests revealed cases where we no longer figured out that previous operations already produced canonicalized results but we could easily change that; this commit includes those changes. Other tests revealed cases where we no longer figure out that previous operations already produced canonicalized results but larger changes are needed to detect that; this commit disables those tests or updates the expected results. Fixes llvm#82937.
998a431
to
ed0b46d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tablegen can be simplified today, but it shouldn't block this patch
Thanks! If I happen to end up figuring out how to simplify the TableGen I can create a followup PR, otherwise I look forward to seeing how someone else manages to do :) |
I managed to get it to work with this:
|
Oh of course! I should have seen that. Nice work. |
Awesome! I wonder if we could do something similar for is_divergent and is_uniform, so we don't need separate DivergentUnaryFrag/DivergentBinFrag/etc. |
Probably not. I'm running into a few issues trying to clean up the other uses. First, the predicate seems to not actually get applied when this is used as the source of a complex pattern. Plus we need at least another variant with an operand for the non-leaf cases |
We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away.
Fixes #82937.