Skip to content

[SDISel][Combine] Constant fold FP16_TO_FP #94790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 8, 2024

Conversation

qcolombet
Copy link
Collaborator

In some case, constant can survive early constant folding optimization because they are hidden behind several layers of type changes.

E.g., consider the following sequence (extracted from the arm test that this commit changes):

    t2: v1f16 = BUILD_VECTOR ConstantFP:f16<APFloat(0)>
    t4: v1f16 = insert_vector_elt t2, ConstantFP:f16<APFloat(0)>, Constant:i32<0>
  t5: f16 = bitcast t4
t6: f32 = fp_extend t5

Because the constant (APFloat(0)) is hidden behind a <1 x ty> type, all the constant folding that normally happen for scalar nodes when using SelectionDAG::getNode are blocked.

As a result the constant manages to survive as an actual conversion instruction down to the select phase:

t11: f32 = fp16_to_fp Constant:i32<0>

With the change in this patch, we try to do constant folding one more time during dag combine, which in the motivating example result in the much better sequence:

t7: ch = CopyToReg t0, Register:f32 %0, ConstantFP:f32<0.000000e+00>

Note: I'm sure we have this problem in a lot of other places. Generally speaking I believe SDISel is not that good with <1 x ty> compared to pure scalar. However, I only changed what I could easily test.

In some case, constant can survive early constant folding optimization
because they are hidden behind several layers of type changes.

E.g., consider the following sequence (extracted from the arm test
that this commit changes):
```
    t2: v1f16 = BUILD_VECTOR ConstantFP:f16<APFloat(0)>
    t4: v1f16 = insert_vector_elt t2, ConstantFP:f16<APFloat(0)>, Constant:i32<0>
  t5: f16 = bitcast t4
t6: f32 = fp_extend t5
```

Because the constant (APFloat(0)) is hidden behind a <1 x ty> type, all the
constant folding that normally happen for scalar nodes when using
`SelectionDAG::getNode` are blocked.

As a result the constant manages to survive as an actual conversion
instruction down to the select phase:
```
t11: f32 = fp16_to_fp Constant:i32<0>
```

With the change in this patch, we try to do constant folding one more
time during dag combine, which in the motivating example result in
the much better sequence:
```
t7: ch = CopyToReg t0, Register:f32 %0, ConstantFP:f32<0.000000e+00>
```

Note: I'm sure we have this problem in a lot of other places. Generally
speaking I believe SDISel is not that good with <1 x ty> compared to pure
scalar. However, I only changed what I could easily test.
@llvmbot
Copy link
Member

llvmbot commented Jun 7, 2024

@llvm/pr-subscribers-backend-arm

@llvm/pr-subscribers-backend-amdgpu

Author: Quentin Colombet (qcolombet)

Changes

In some case, constant can survive early constant folding optimization because they are hidden behind several layers of type changes.

E.g., consider the following sequence (extracted from the arm test that this commit changes):

    t2: v1f16 = BUILD_VECTOR ConstantFP:f16&lt;APFloat(0)&gt;
    t4: v1f16 = insert_vector_elt t2, ConstantFP:f16&lt;APFloat(0)&gt;, Constant:i32&lt;0&gt;
  t5: f16 = bitcast t4
t6: f32 = fp_extend t5

Because the constant (APFloat(0)) is hidden behind a <1 x ty> type, all the constant folding that normally happen for scalar nodes when using SelectionDAG::getNode are blocked.

As a result the constant manages to survive as an actual conversion instruction down to the select phase:

t11: f32 = fp16_to_fp Constant:i32&lt;0&gt;

With the change in this patch, we try to do constant folding one more time during dag combine, which in the motivating example result in the much better sequence:

t7: ch = CopyToReg t0, Register:f32 %0, ConstantFP:f32&lt;0.000000e+00&gt;

Note: I'm sure we have this problem in a lot of other places. Generally speaking I believe SDISel is not that good with <1 x ty> compared to pure scalar. However, I only changed what I could easily test.


Full diff: https://github.com/llvm/llvm-project/pull/94790.diff

4 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+6-1)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp-modifier.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll (+1-2)
  • (modified) llvm/test/CodeGen/ARM/arm-half-promote.ll (+1-3)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 2d5968bf5c2ea..ed6eefbaf89c5 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -26586,7 +26586,12 @@ SDValue DAGCombiner::visitFP16_TO_FP(SDNode *N) {
     }
   }
 
-  return SDValue();
+  // Sometimes constants manage to survive very late in the pipeline, e.g.,
+  // because they are wrapped inside the <1 x f16> type. Try one last time to
+  // get rid of them.
+  SDValue Folded = DAG.FoldConstantArithmetic(
+      N->getOpcode(), SDLoc(N), N->getValueType(0), ArrayRef<SDValue>(&N0, 1));
+  return Folded;
 }
 
 SDValue DAGCombiner::visitFP_TO_BF16(SDNode *N) {
diff --git a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
index 0a0179e866cd3..84bd9b6f6c5d4 100644
--- a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
+++ b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
@@ -1489,9 +1489,8 @@ define amdgpu_kernel void @v_no_clamp_add_src_v2f16_f16_src(ptr addrspace(1) %ou
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_mov_b64 s[4:5], s[2:3]
 ; SI-NEXT:    buffer_load_ushort v1, v[1:2], s[4:7], 0 addr64
-; SI-NEXT:    v_cvt_f32_f16_e64 v3, s6 clamp
+; SI-NEXT:    v_cvt_f16_f32_e32 v3, 0
 ; SI-NEXT:    s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT:    v_cvt_f16_f32_e32 v3, v3
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    v_cvt_f32_f16_e32 v1, v1
 ; SI-NEXT:    v_add_f32_e32 v1, 1.0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll b/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll
index 0d6e987165d87..ba04cdb795ce3 100644
--- a/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll
+++ b/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll
@@ -14,9 +14,8 @@ define void @phi_vec1half_to_f32_with_const_folding(ptr addrspace(1) %dst) #0 {
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_mov_b32 s4, 0
-; CHECK-NEXT:    v_cvt_f32_f16_e64 v2, s4
 ; CHECK-NEXT:  ; %bb.1: ; %bb
-; CHECK-NEXT:    v_cvt_f16_f32_e64 v2, v2
+; CHECK-NEXT:    v_cvt_f16_f32_e64 v2, s4
 ; CHECK-NEXT:    s_mov_b32 s7, 0xf000
 ; CHECK-NEXT:    s_mov_b32 s6, 0
 ; CHECK-NEXT:    s_mov_b32 s4, s6
diff --git a/llvm/test/CodeGen/ARM/arm-half-promote.ll b/llvm/test/CodeGen/ARM/arm-half-promote.ll
index a5fafd4238616..e1ab75b2ac7f1 100644
--- a/llvm/test/CodeGen/ARM/arm-half-promote.ll
+++ b/llvm/test/CodeGen/ARM/arm-half-promote.ll
@@ -116,9 +116,7 @@ define fastcc { <8 x half>, <8 x half> } @f3() {
 
 define void @extract_insert(ptr %dst) optnone noinline {
 ; CHECK-LABEL: extract_insert:
-; CHECK: movs r1, #0
-; CHECK: vmov s0, r1
-; CHECK: vcvtb.f32.f16 s0, s0
+; CHECK: vmov.i32 d0, #0x0
 ; CHECK: vcvtb.f16.f32 s0, s0
 ; CHECK: vmov r1, s0
 ; CHECK: strh r1, [r0]

@llvmbot
Copy link
Member

llvmbot commented Jun 7, 2024

@llvm/pr-subscribers-llvm-selectiondag

Author: Quentin Colombet (qcolombet)

Changes

In some case, constant can survive early constant folding optimization because they are hidden behind several layers of type changes.

E.g., consider the following sequence (extracted from the arm test that this commit changes):

    t2: v1f16 = BUILD_VECTOR ConstantFP:f16&lt;APFloat(0)&gt;
    t4: v1f16 = insert_vector_elt t2, ConstantFP:f16&lt;APFloat(0)&gt;, Constant:i32&lt;0&gt;
  t5: f16 = bitcast t4
t6: f32 = fp_extend t5

Because the constant (APFloat(0)) is hidden behind a <1 x ty> type, all the constant folding that normally happen for scalar nodes when using SelectionDAG::getNode are blocked.

As a result the constant manages to survive as an actual conversion instruction down to the select phase:

t11: f32 = fp16_to_fp Constant:i32&lt;0&gt;

With the change in this patch, we try to do constant folding one more time during dag combine, which in the motivating example result in the much better sequence:

t7: ch = CopyToReg t0, Register:f32 %0, ConstantFP:f32&lt;0.000000e+00&gt;

Note: I'm sure we have this problem in a lot of other places. Generally speaking I believe SDISel is not that good with <1 x ty> compared to pure scalar. However, I only changed what I could easily test.


Full diff: https://github.com/llvm/llvm-project/pull/94790.diff

4 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+6-1)
  • (modified) llvm/test/CodeGen/AMDGPU/clamp-modifier.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll (+1-2)
  • (modified) llvm/test/CodeGen/ARM/arm-half-promote.ll (+1-3)
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 2d5968bf5c2ea..ed6eefbaf89c5 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -26586,7 +26586,12 @@ SDValue DAGCombiner::visitFP16_TO_FP(SDNode *N) {
     }
   }
 
-  return SDValue();
+  // Sometimes constants manage to survive very late in the pipeline, e.g.,
+  // because they are wrapped inside the <1 x f16> type. Try one last time to
+  // get rid of them.
+  SDValue Folded = DAG.FoldConstantArithmetic(
+      N->getOpcode(), SDLoc(N), N->getValueType(0), ArrayRef<SDValue>(&N0, 1));
+  return Folded;
 }
 
 SDValue DAGCombiner::visitFP_TO_BF16(SDNode *N) {
diff --git a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
index 0a0179e866cd3..84bd9b6f6c5d4 100644
--- a/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
+++ b/llvm/test/CodeGen/AMDGPU/clamp-modifier.ll
@@ -1489,9 +1489,8 @@ define amdgpu_kernel void @v_no_clamp_add_src_v2f16_f16_src(ptr addrspace(1) %ou
 ; SI-NEXT:    s_waitcnt lgkmcnt(0)
 ; SI-NEXT:    s_mov_b64 s[4:5], s[2:3]
 ; SI-NEXT:    buffer_load_ushort v1, v[1:2], s[4:7], 0 addr64
-; SI-NEXT:    v_cvt_f32_f16_e64 v3, s6 clamp
+; SI-NEXT:    v_cvt_f16_f32_e32 v3, 0
 ; SI-NEXT:    s_mov_b64 s[2:3], s[6:7]
-; SI-NEXT:    v_cvt_f16_f32_e32 v3, v3
 ; SI-NEXT:    s_waitcnt vmcnt(0)
 ; SI-NEXT:    v_cvt_f32_f16_e32 v1, v1
 ; SI-NEXT:    v_add_f32_e32 v1, 1.0, v1
diff --git a/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll b/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll
index 0d6e987165d87..ba04cdb795ce3 100644
--- a/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll
+++ b/llvm/test/CodeGen/AMDGPU/select-phi-s16-fp.ll
@@ -14,9 +14,8 @@ define void @phi_vec1half_to_f32_with_const_folding(ptr addrspace(1) %dst) #0 {
 ; CHECK:       ; %bb.0: ; %entry
 ; CHECK-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; CHECK-NEXT:    s_mov_b32 s4, 0
-; CHECK-NEXT:    v_cvt_f32_f16_e64 v2, s4
 ; CHECK-NEXT:  ; %bb.1: ; %bb
-; CHECK-NEXT:    v_cvt_f16_f32_e64 v2, v2
+; CHECK-NEXT:    v_cvt_f16_f32_e64 v2, s4
 ; CHECK-NEXT:    s_mov_b32 s7, 0xf000
 ; CHECK-NEXT:    s_mov_b32 s6, 0
 ; CHECK-NEXT:    s_mov_b32 s4, s6
diff --git a/llvm/test/CodeGen/ARM/arm-half-promote.ll b/llvm/test/CodeGen/ARM/arm-half-promote.ll
index a5fafd4238616..e1ab75b2ac7f1 100644
--- a/llvm/test/CodeGen/ARM/arm-half-promote.ll
+++ b/llvm/test/CodeGen/ARM/arm-half-promote.ll
@@ -116,9 +116,7 @@ define fastcc { <8 x half>, <8 x half> } @f3() {
 
 define void @extract_insert(ptr %dst) optnone noinline {
 ; CHECK-LABEL: extract_insert:
-; CHECK: movs r1, #0
-; CHECK: vmov s0, r1
-; CHECK: vcvtb.f32.f16 s0, s0
+; CHECK: vmov.i32 d0, #0x0
 ; CHECK: vcvtb.f16.f32 s0, s0
 ; CHECK: vmov r1, s0
 ; CHECK: strh r1, [r0]

@qcolombet
Copy link
Collaborator Author

@arsenm @phoebewang This is the fix for the "perf" regression exposed by the correctness fix from #94591.

Copy link

github-actions bot commented Jun 7, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@qcolombet qcolombet merged commit 25506f4 into llvm:main Jun 8, 2024
7 checks passed
@qcolombet qcolombet deleted the missing_folding branch June 8, 2024 09:31
nekoshirro pushed a commit to nekoshirro/Alchemist-LLVM that referenced this pull request Jun 9, 2024
In some case, constant can survive early constant folding optimization
because they are hidden behind several layers of type changes.

E.g., consider the following sequence (extracted from the arm test that
this commit changes):
```
    t2: v1f16 = BUILD_VECTOR ConstantFP:f16<APFloat(0)>
    t4: v1f16 = insert_vector_elt t2, ConstantFP:f16<APFloat(0)>, Constant:i32<0>
  t5: f16 = bitcast t4
t6: f32 = fp_extend t5
```

Because the constant (APFloat(0)) is hidden behind a <1 x ty> type, all
the constant folding that normally happen for scalar nodes when using
`SelectionDAG::getNode` are blocked.

As a result the constant manages to survive as an actual conversion
instruction down to the select phase:
```
t11: f32 = fp16_to_fp Constant:i32<0>
```

With the change in this patch, we try to do constant folding one more
time during dag combine, which in the motivating example result in the
much better sequence:
```
t7: ch = CopyToReg t0, Register:f32 %0, ConstantFP:f32<0.000000e+00>
```

Note: I'm sure we have this problem in a lot of other places. Generally
speaking I believe SDISel is not that good with <1 x ty> compared to
pure scalar. However, I only changed what I could easily test.

Signed-off-by: Hafidz Muzakky <ais.muzakky@gmail.com>
@HerrCai0907 HerrCai0907 mentioned this pull request Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants