Skip to content

[AMDGPU] Fix canonicalization of truncated values. #83054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

hvdijk
Copy link
Contributor

@hvdijk hvdijk commented Feb 26, 2024

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away.

Fixes #82937.

@llvmbot
Copy link
Member

llvmbot commented Feb 26, 2024

@llvm/pr-subscribers-llvm-selectiondag

@llvm/pr-subscribers-backend-amdgpu

Author: Harald van Dijk (hvdijk)

Changes

We were relying on roundings to implicitly canonicalize, which is generally safe, except with roundings that may be optimized away.

Fixes #82937.


Full diff: https://github.com/llvm/llvm-project/pull/83054.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+14)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16.ll (+4)
  • (modified) llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll (+29)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 84ef9679ab9563..f7f64fdc9aae8a 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -12831,6 +12831,20 @@ SDValue SITargetLowering::performFCanonicalizeCombine(
     }
   }
 
+  // The TRUNC parameter of FP_ROUND specifies whether the operation may be
+  // optimized away because the operation is known to be exact. Even if the
+  // operation would be considered exact in normal circumstances where we do not
+  // care about SNaN, we do care about SNaN here and must preserve the operation
+  // unless its input is known to be canonicalized.
+  if (SrcOpc == ISD::FP_ROUND || SrcOpc == ISD::STRICT_FP_ROUND) {
+    if (N0.getConstantOperandVal(1) == 0 ||
+        isCanonicalized(DAG, N0.getOperand(0)))
+      return N0;
+    SDLoc SL(N0);
+    return DAG.getNode(SrcOpc, SL, VT, N0.getOperand(0),
+                       DAG.getTargetConstant(0, SL, MVT::i32));
+  }
+
   return isCanonicalized(DAG, N0) ? N0 : SDValue();
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/bf16.ll b/llvm/test/CodeGen/AMDGPU/bf16.ll
index c773742a459507..14c5a8c46f7ba5 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16.ll
@@ -26818,11 +26818,15 @@ define bfloat @v_canonicalize_bf16(bfloat %a) {
 ; GCN-LABEL: v_canonicalize_bf16:
 ; GCN:       ; %bb.0:
 ; GCN-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GCN-NEXT:    v_mul_f32_e32 v0, 1.0, v0
+; GCN-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX7-LABEL: v_canonicalize_bf16:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX7-NEXT:    v_mul_f32_e32 v0, 1.0, v0
+; GFX7-NEXT:    v_and_b32_e32 v0, 0xffff0000, v0
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_canonicalize_bf16:
diff --git a/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll b/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll
index 274621307f540d..f53adf88454bc6 100644
--- a/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll
+++ b/llvm/test/CodeGen/AMDGPU/fcanonicalize.f16.ll
@@ -170,6 +170,35 @@ define amdgpu_kernel void @s_test_canonicalize_var_f16(ptr addrspace(1) %out, i1
   ret void
 }
 
+define half @s_test_canonicalize_arg(half %x) #1 {
+; VI-LABEL: s_test_canonicalize_arg:
+; VI:       ; %bb.0:
+; VI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; VI-NEXT:    v_max_f16_e32 v0, v0, v0
+; VI-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX9-LABEL: s_test_canonicalize_arg:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_max_f16_e32 v0, v0, v0
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; CI-LABEL: s_test_canonicalize_arg:
+; CI:       ; %bb.0:
+; CI-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; CI-NEXT:    v_cvt_f16_f32_e32 v0, v0
+; CI-NEXT:    v_cvt_f32_f16_e32 v0, v0
+; CI-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX11-LABEL: s_test_canonicalize_arg:
+; GFX11:       ; %bb.0:
+; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX11-NEXT:    v_max_f16_e32 v0, v0, v0
+; GFX11-NEXT:    s_setpc_b64 s[30:31]
+  %canonicalized = call half @llvm.canonicalize.f16(half %x)
+  ret half %canonicalized
+}
+
 define <2 x half> @v_test_canonicalize_build_vector_v2f16(half %lo, half %hi) #1 {
 ; VI-LABEL: v_test_canonicalize_build_vector_v2f16:
 ; VI:       ; %bb.0:

@hvdijk hvdijk force-pushed the amdgpu-fcanonicalize-fp_round branch from 197b1b3 to 6c814cc Compare February 27, 2024 00:22
@llvmbot llvmbot added the llvm:SelectionDAG SelectionDAGISel as well label Feb 27, 2024
Copy link

github-actions bot commented Feb 27, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@hvdijk hvdijk force-pushed the amdgpu-fcanonicalize-fp_round branch from 6c814cc to c799ba0 Compare February 27, 2024 00:31
@hvdijk
Copy link
Contributor Author

hvdijk commented Feb 27, 2024

My first attempt for this, special casing FP_ROUND operands, worked for the tests we had, but not in the general case, because it did not take into account that we may have nodes that will later be optimized to just FP_ROUND. What this second attempt does is, if we determine that an operand is already canonicalized, keep track of whether we saw any FP_ROUND with TRUNC=1. If we did, then we can still optimize FCANONICALIZE to a no-op by inserting a FREEZE node to block optimizations. This should be correct in all cases and avoids redundant recanonicalizations in the most common cases, but as the test updates show, does not avoid redundant recanonicalizations in all cases. I will attempt to update this further.

@hvdijk hvdijk force-pushed the amdgpu-fcanonicalize-fp_round branch from c799ba0 to c496176 Compare February 27, 2024 23:21
@hvdijk hvdijk force-pushed the amdgpu-fcanonicalize-fp_round branch from c496176 to 998a431 Compare March 5, 2024 20:58
Copy link
Contributor

@jayfoad jayfoad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach. Makes a lot of sense to me.

// If fcanonicalize's operand is implicitly canonicalized, we only need a copy.
let AddedComplexity = 1000 in {
def : GCNPat<
(is_canonicalized_1<fcanonicalize> f16:$src),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a dumb question (I have never understood DAG selection patterns) but why can't we write this more like (fcanonicalize (is_canonicalized f16:$src)), so the predicate applies to the operand instead of to the whole expression? Then we would not need separate _1 and _2 versions of is_canonicalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a dumb question, one that I would like an answer to as well :) Variations of that were what I tried first, but I couldn't get it to work, and TableGen's error messages were not helpful in letting me figure out a way to get it working. I decided to just stick what was already being done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PatFrag is what owns the predicate; it might be possible to define something using srcvalue as a way of getting a catch-all matcher with a predicate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a look and couldn't find any custom predicate logic other than for instructions, which means we cannot do it on the operand level as the operand need not be an instruction. It's possible I missed something simple, but I think I would prefer to just leave it like this over making big changes to make TableGen support operand-level predicates.

class
FPMinCanonMaxPat<Instruction minmaxInst, ValueType vt, SDPatternOperator min_or_max,
SDPatternOperator max_or_min_oneuse> : GCNPat <
(min_or_max (is_canonicalized_1<fcanonicalize>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how many more fp patterns there are that might need to skip over an fcanonicalize in the middle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really arbitrary operations. fmin/fmax are just of particular interest because we're stuck inserting canonicalizes in the lowering for them. The more I think about it, the more I think we'd better of making this a separate transform (but just moving this to selection patterns is a net improvement for now I think)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may well be more, I just went for the ones that I came across from the test results. But if there are more, it doesn't result in wrong code, only suboptimal code, which I think is acceptable if the overall code still looks better?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd only consider the min/max legalize case important. There's only a handful of canonicalizes in the math library, they aren't common (and I think we could avoid most of those too if we rewrote the code slightly)

Comment on lines +12598 to +12692
// TODO: This is incorrect as it loses track of the operand's type. We may
// end up effectively bitcasting from f32 to v2f16 or vice versa, and the
// same bits that are canonicalized in one type need not be in the other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should give up on different element size bitcasts? (separate patch though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rather than give up, we should add a parameter specifying the type so that we would be able to handle things like v2f16 -> i32 -> truncate to i16 -> f16. But yes, separate patch, this is a pre-existing issue.

class
FPMinCanonMaxPat<Instruction minmaxInst, ValueType vt, SDPatternOperator min_or_max,
SDPatternOperator max_or_min_oneuse> : GCNPat <
(min_or_max (is_canonicalized_1<fcanonicalize>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really arbitrary operations. fmin/fmax are just of particular interest because we're stuck inserting canonicalizes in the lowering for them. The more I think about it, the more I think we'd better of making this a separate transform (but just moving this to selection patterns is a net improvement for now I think)

@@ -12603,20 +12627,21 @@ bool SITargetLowering::isCanonicalized(SelectionDAG &DAG, SDValue Op,
case Intrinsic::amdgcn_trig_preop:
case Intrinsic::amdgcn_log:
case Intrinsic::amdgcn_exp2:
case Intrinsic::amdgcn_sqrt:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated bonus change?

Copy link
Contributor Author

@hvdijk hvdijk Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unrelated, this is one of the things that came up in LLVM's testsuite where previously we were able to optimize away a fcanonicalize because we saw fsqrt, but now we only check later and see amdgcn_sqrt.

We were relying on roundings to implicitly canonicalize, which is
generally safe, except that roundings may be optimized away, and more
generally, other operations may be optimized away as well.

To solve this, as suggested by @arsenm, keep fcanonicalize nodes around
for longer. Some tests revealed cases where we no longer figured out
that previous operations already produced canonicalized results but we
could easily change that; this commit includes those changes. Other
tests revealed cases where we no longer figure out that previous
operations already produced canonicalized results but larger changes are
needed to detect that; this commit disables those tests or updates the
expected results.

Fixes llvm#82937.
@hvdijk hvdijk force-pushed the amdgpu-fcanonicalize-fp_round branch from 998a431 to ed0b46d Compare March 6, 2024 19:40
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tablegen can be simplified today, but it shouldn't block this patch

@hvdijk hvdijk merged commit ceb744e into llvm:main Mar 13, 2024
@hvdijk hvdijk deleted the amdgpu-fcanonicalize-fp_round branch March 13, 2024 12:08
@hvdijk
Copy link
Contributor Author

hvdijk commented Mar 13, 2024

Thanks! If I happen to end up figuring out how to simplify the TableGen I can create a followup PR, otherwise I look forward to seeing how someone else manages to do :)

@arsenm
Copy link
Contributor

arsenm commented Mar 14, 2024

Thanks! If I happen to end up figuring out how to simplify the TableGen I can create a followup PR, otherwise I look forward to seeing how someone else manages to do :)

I managed to get it to work with this:

def is_canonicalized : PatLeaf<(fAny srcvalue:$src),
[{
    const SITargetLowering &Lowering =
              *static_cast<const SITargetLowering *>(getTargetLowering());
    return Lowering.isCanonicalized(*CurDAG, SDValue(N, 0));
}]
>;

def : GCNPat<
  (fcanonicalize (f32 is_canonicalized:$src)),
   (COPY f32:$src)
>;

@hvdijk
Copy link
Contributor Author

hvdijk commented Mar 14, 2024

Oh of course! I should have seen that. Nice work.

@jayfoad
Copy link
Contributor

jayfoad commented Mar 14, 2024

Thanks! If I happen to end up figuring out how to simplify the TableGen I can create a followup PR, otherwise I look forward to seeing how someone else manages to do :)

I managed to get it to work with this:

def is_canonicalized : PatLeaf<(fAny srcvalue:$src),
[{
    const SITargetLowering &Lowering =
              *static_cast<const SITargetLowering *>(getTargetLowering());
    return Lowering.isCanonicalized(*CurDAG, SDValue(N, 0));
}]
>;

def : GCNPat<
  (fcanonicalize (f32 is_canonicalized:$src)),
   (COPY f32:$src)
>;

Awesome! I wonder if we could do something similar for is_divergent and is_uniform, so we don't need separate DivergentUnaryFrag/DivergentBinFrag/etc.

@arsenm
Copy link
Contributor

arsenm commented Mar 15, 2024

Awesome! I wonder if we could do something similar for is_divergent and is_uniform, so we don't need separate DivergentUnaryFrag/DivergentBinFrag/etc.

Probably not. I'm running into a few issues trying to clean up the other uses. First, the predicate seems to not actually get applied when this is used as the source of a complex pattern. Plus we need at least another variant with an operand for the non-leaf cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AMDGPU: Wrong code for fcanonicalize
4 participants