-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[RISCV] Avoid vl toggles when lowering vector_splice/experimental_vp_splice #146746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…splice When vectorizing a loop with a fixed-order recurrence we use a splice, which gets lowered to a vslidedown and vslideup pair. However with the way we lower it today we end up with extra vl toggles in the loop, especially with EVL tail folding, e.g: .LBB0_5: # %vector.body # =>This Inner Loop Header: Depth=1 sub a5, a2, a3 sh2add a6, a3, a1 zext.w a7, a4 vsetvli a4, a5, e8, mf2, ta, ma vle32.v v10, (a6) addi a7, a7, -1 vsetivli zero, 1, e32, m2, ta, ma vslidedown.vx v8, v8, a7 sh2add a6, a3, a0 vsetvli zero, a5, e32, m2, ta, ma vslideup.vi v8, v10, 1 vadd.vv v8, v10, v8 add a3, a3, a4 vse32.v v8, (a6) vmv2r.v v8, v10 bne a3, a2, .LBB0_5 Because the vslideup overwrites all but UpOffset elements from the vslidedown, we currently set the vslidedown's AVL to said offset. But in the vslideup we use either VLMAX or the EVL which causes a toggle. This increases the AVL of the vslidedown so it matches vslideup, even if the extra elements are overridden, to avoid the toggle. This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture. If we wanted to aggressively optimise for vl at the expense of introducing more toggles we could probably look at doing this in RISCVVLOptimizer.
@llvm/pr-subscribers-backend-risc-v Author: Luke Lau (lukel97) ChangesWhen vectorizing a loop with a fixed-order recurrence we use a splice, which gets lowered to a vslidedown and vslideup pair. However with the way we lower it today we end up with extra vl toggles in the loop, especially with EVL tail folding, e.g:
Because the vslideup overwrites all but UpOffset elements from the vslidedown, we currently set the vslidedown's AVL to said offset. But in the vslideup we use either VLMAX or the EVL which causes a toggle. This increases the AVL of the vslidedown so it matches vslideup, even if the extra elements are overridden, to avoid the toggle. This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture. If we wanted to aggressively optimise for vl at the expense of introducing more toggles we could probably look at doing this in RISCVVLOptimizer. Patch is 133.19 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146746.diff 6 Files Affected:
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 326dd7149ef96..989a2cd237262 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -12331,7 +12331,7 @@ SDValue RISCVTargetLowering::lowerVECTOR_SPLICE(SDValue Op,
SDValue SlideDown =
getVSlidedown(DAG, Subtarget, DL, VecVT, DAG.getUNDEF(VecVT), V1,
- DownOffset, TrueMask, UpOffset);
+ DownOffset, TrueMask, DAG.getRegister(RISCV::X0, XLenVT));
return getVSlideup(DAG, Subtarget, DL, VecVT, SlideDown, V2, UpOffset,
TrueMask, DAG.getRegister(RISCV::X0, XLenVT),
RISCVVType::TAIL_AGNOSTIC);
@@ -13354,8 +13354,7 @@ RISCVTargetLowering::lowerVPSpliceExperimental(SDValue Op,
if (ImmValue != 0)
Op1 = getVSlidedown(DAG, Subtarget, DL, ContainerVT,
- DAG.getUNDEF(ContainerVT), Op1, DownOffset, Mask,
- UpOffset);
+ DAG.getUNDEF(ContainerVT), Op1, DownOffset, Mask, EVL2);
SDValue Result = getVSlideup(DAG, Subtarget, DL, ContainerVT, Op1, Op2,
UpOffset, Mask, EVL2, RISCVVType::TAIL_AGNOSTIC);
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll
index 8160e62a43106..79fbdb007a70c 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll
@@ -7,10 +7,9 @@
define <2 x i64> @test_vp_splice_v2i64(<2 x i64> %va, <2 x i64> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v2i64:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -22,9 +21,8 @@ define <2 x i64> @test_vp_splice_v2i64_negative_offset(<2 x i64> %va, <2 x i64>
; CHECK-LABEL: test_vp_splice_v2i64_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e64, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -46,10 +44,10 @@ define <2 x i64> @test_vp_splice_v2i64_zero_offset(<2 x i64> %va, <2 x i64> %vb,
define <2 x i64> @test_vp_splice_v2i64_masked(<2 x i64> %va, <2 x i64> %vb, <2 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v2i64_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e64, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <2 x i64> @llvm.experimental.vp.splice.v2i64(<2 x i64> %va, <2 x i64> %vb, i32 5, <2 x i1> %mask, i32 %evla, i32 %evlb)
@@ -59,10 +57,9 @@ define <2 x i64> @test_vp_splice_v2i64_masked(<2 x i64> %va, <2 x i64> %vb, <2 x
define <4 x i32> @test_vp_splice_v4i32(<4 x i32> %va, <4 x i32> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v4i32:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -74,9 +71,8 @@ define <4 x i32> @test_vp_splice_v4i32_negative_offset(<4 x i32> %va, <4 x i32>
; CHECK-LABEL: test_vp_splice_v4i32_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e32, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -87,10 +83,10 @@ define <4 x i32> @test_vp_splice_v4i32_negative_offset(<4 x i32> %va, <4 x i32>
define <4 x i32> @test_vp_splice_v4i32_masked(<4 x i32> %va, <4 x i32> %vb, <4 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v4i32_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e32, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <4 x i32> @llvm.experimental.vp.splice.v4i32(<4 x i32> %va, <4 x i32> %vb, i32 5, <4 x i1> %mask, i32 %evla, i32 %evlb)
@@ -100,10 +96,9 @@ define <4 x i32> @test_vp_splice_v4i32_masked(<4 x i32> %va, <4 x i32> %vb, <4 x
define <8 x i16> @test_vp_splice_v8i16(<8 x i16> %va, <8 x i16> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v8i16:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -115,9 +110,8 @@ define <8 x i16> @test_vp_splice_v8i16_negative_offset(<8 x i16> %va, <8 x i16>
; CHECK-LABEL: test_vp_splice_v8i16_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e16, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -128,10 +122,10 @@ define <8 x i16> @test_vp_splice_v8i16_negative_offset(<8 x i16> %va, <8 x i16>
define <8 x i16> @test_vp_splice_v8i16_masked(<8 x i16> %va, <8 x i16> %vb, <8 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v8i16_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e16, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <8 x i16> @llvm.experimental.vp.splice.v8i16(<8 x i16> %va, <8 x i16> %vb, i32 5, <8 x i1> %mask, i32 %evla, i32 %evlb)
@@ -141,10 +135,9 @@ define <8 x i16> @test_vp_splice_v8i16_masked(<8 x i16> %va, <8 x i16> %vb, <8 x
define <16 x i8> @test_vp_splice_v16i8(<16 x i8> %va, <16 x i8> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v16i8:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e8, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e8, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -156,9 +149,8 @@ define <16 x i8> @test_vp_splice_v16i8_negative_offset(<16 x i8> %va, <16 x i8>
; CHECK-LABEL: test_vp_splice_v16i8_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e8, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e8, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -169,10 +161,10 @@ define <16 x i8> @test_vp_splice_v16i8_negative_offset(<16 x i8> %va, <16 x i8>
define <16 x i8> @test_vp_splice_v16i8_masked(<16 x i8> %va, <16 x i8> %vb, <16 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v16i8_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e8, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e8, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e8, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e8, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <16 x i8> @llvm.experimental.vp.splice.v16i8(<16 x i8> %va, <16 x i8> %vb, i32 5, <16 x i1> %mask, i32 %evla, i32 %evlb)
@@ -182,10 +174,9 @@ define <16 x i8> @test_vp_splice_v16i8_masked(<16 x i8> %va, <16 x i8> %vb, <16
define <2 x double> @test_vp_splice_v2f64(<2 x double> %va, <2 x double> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v2f64:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -197,9 +188,8 @@ define <2 x double> @test_vp_splice_v2f64_negative_offset(<2 x double> %va, <2 x
; CHECK-LABEL: test_vp_splice_v2f64_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e64, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -210,10 +200,10 @@ define <2 x double> @test_vp_splice_v2f64_negative_offset(<2 x double> %va, <2 x
define <2 x double> @test_vp_splice_v2f64_masked(<2 x double> %va, <2 x double> %vb, <2 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v2f64_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e64, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e64, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <2 x double> @llvm.experimental.vp.splice.v2f64(<2 x double> %va, <2 x double> %vb, i32 5, <2 x i1> %mask, i32 %evla, i32 %evlb)
@@ -223,10 +213,9 @@ define <2 x double> @test_vp_splice_v2f64_masked(<2 x double> %va, <2 x double>
define <4 x float> @test_vp_splice_v4f32(<4 x float> %va, <4 x float> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v4f32:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -238,9 +227,8 @@ define <4 x float> @test_vp_splice_v4f32_negative_offset(<4 x float> %va, <4 x f
; CHECK-LABEL: test_vp_splice_v4f32_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e32, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -251,10 +239,10 @@ define <4 x float> @test_vp_splice_v4f32_negative_offset(<4 x float> %va, <4 x f
define <4 x float> @test_vp_splice_v4f32_masked(<4 x float> %va, <4 x float> %vb, <4 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v4f32_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e32, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e32, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <4 x float> @llvm.experimental.vp.splice.v4f32(<4 x float> %va, <4 x float> %vb, i32 5, <4 x i1> %mask, i32 %evla, i32 %evlb)
@@ -264,10 +252,9 @@ define <4 x float> @test_vp_splice_v4f32_masked(<4 x float> %va, <4 x float> %vb
define <8 x half> @test_vp_splice_v8f16(<8 x half> %va, <8 x half> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v8f16:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -279,9 +266,8 @@ define <8 x half> @test_vp_splice_v8f16_negative_offset(<8 x half> %va, <8 x hal
; CHECK-LABEL: test_vp_splice_v8f16_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e16, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -292,10 +278,10 @@ define <8 x half> @test_vp_splice_v8f16_negative_offset(<8 x half> %va, <8 x hal
define <8 x half> @test_vp_splice_v8f16_masked(<8 x half> %va, <8 x half> %vb, <8 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v8f16_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e16, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <8 x half> @llvm.experimental.vp.splice.v8f16(<8 x half> %va, <8 x half> %vb, i32 5, <8 x i1> %mask, i32 %evla, i32 %evlb)
@@ -364,10 +350,9 @@ define <4 x half> @test_vp_splice_nxv2f16_with_firstelt(half %first, <4 x half>
define <8 x bfloat> @test_vp_splice_v8bf16(<8 x bfloat> %va, <8 x bfloat> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v8bf16:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v8, v8, 5
+; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret
@@ -379,9 +364,8 @@ define <8 x bfloat> @test_vp_splice_v8bf16_negative_offset(<8 x bfloat> %va, <8
; CHECK-LABEL: test_vp_splice_v8bf16_negative_offset:
; CHECK: # %bb.0:
; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetivli zero, 5, e16, m1, ta, ma
-; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
; CHECK-NEXT: vslideup.vi v8, v9, 5
; CHECK-NEXT: ret
@@ -392,10 +376,10 @@ define <8 x bfloat> @test_vp_splice_v8bf16_negative_offset(<8 x bfloat> %va, <8
define <8 x bfloat> @test_vp_splice_v8bf16_masked(<8 x bfloat> %va, <8 x bfloat> %vb, <8 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
; CHECK-LABEL: test_vp_splice_v8bf16_masked:
; CHECK: # %bb.0:
-; CHECK-NEXT: addi a0, a0, -5
-; CHECK-NEXT: vsetvli zero, a0, e16, m1, ta, ma
+; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT: vsetvli zero, a1, e16, m1, ta, mu
+; CHECK-NEXT: addi a0, a0, -5
+; CHECK-NEXT: vsetvli zero, zero, e16, m1, ta, mu
; CHECK-NEXT: vslideup.vx v8, v9, a0, v0.t
; CHECK-NEXT: ret
%v = call <8 x bfloat> @llvm.experimental.vp.splice.v8bf16(<8 x bfloat> %va, <8 x bfloat> %vb, i32 5, <8 x i1> %mask, i32 %evla, i32 %evlb)
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll b/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
index 90d798b167cfc..87b6442c38a42 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
@@ -42,10 +42,8 @@ define <vscale x 1 x i1> @splice_nxv1i1_offset_max(<vscale x 1 x i1> %a, <vscale
; CHECK-NEXT: vmv1r.v v0, v9
; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
; CHECK-NEXT: srli a0, a0, 3
-; CHECK-NEXT: addi a0, a0, -1
-; CHECK-NEXT: vsetvli zero, a0, e8, mf8, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 1
-; CHECK-NEXT: vsetvli a1, zero, e8, mf8, ta, ma
+; CHECK-NEXT: addi a0, a0, -1
; CHECK-NEXT: vslideup.vx v8, v10, a0
; CHECK-NEXT: vand.vi v8, v8, 1
; CHECK-NEXT: vmsne.vi v0, v8, 0
@@ -90,10 +88,8 @@ define <vscale x 2 x i1> @splice_nxv2i1_offset_max(<vscale x 2 x i1> %a, <vscale
; CHECK-NEXT: vmv1r.v v0, v9
; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
; CHECK-NEXT: srli a0, a0, 2
-; CHECK-NEXT: addi a0, a0, -3
-; CHECK-NEXT: vsetvli zero, a0, e8, mf4, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 3
-; CHECK-NEXT: vsetvli a1, zero, e8, mf4, ta, ma
+; CHECK-NEXT: addi a0, a0, -3
; CHECK-NEXT: vslideup.vx v8, v10, a0
; CHECK-NEXT: vand.vi v8, v8, 1
; CHECK-NEXT: vmsne.vi v0, v8, 0
@@ -138,10 +134,8 @@ define <vscale x 4 x i1> @splice_nxv4i1_offset_max(<vscale x 4 x i1> %a, <vscale
; CHECK-NEXT: vmv1r.v v0, v9
; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
; CHECK-NEXT: srli a0, a0, 1
-; CHECK-NEXT: addi a0, a0, -7
-; CHECK-NEXT: vsetvli zero, a0, e8, mf2, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 7
-; CHECK-NEXT: vsetvli a1, zero, e8, mf2, ta, ma
+; CHECK-NEXT: addi a0, a0, -7
; CHECK-NEXT: vslideup.vx v8, v10, a0
; CHECK-NEXT: vand.vi v8, v8, 1
; CHECK-NEXT: vmsne.vi v0, v8, 0
@@ -184,10 +178,8 @@ define <vscale x 8 x i1> @splice_nxv8i1_offset_max(<vscale x 8 x i1> %a, <vscale
; CHECK-NEXT: vmerge.vim v10, v8, 1, v0
; CHECK-NEXT: vmv1r.v v0, v9
; CHECK-NEXT: vmerge.vim v8, v8, 1, v0
-; CHECK-NEXT: addi a0, a0, -15
-; CHECK-NEXT: vsetvli zero, a0, e8, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 15
-; CHECK-NEXT: vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT: addi a0, a0, -15
; CHECK-NEXT: vslideup.vx v8, v10, a0
; CHECK-NEXT: vand.vi v8, v8, 1
; CHECK-NEXT: vmsne.vi v0, v8, 0
@@ -232,10 +224,8 @@ define <vscale x 16 x i1> @splice_nxv16i1_offset_max(<vscale x 16 x i1> %a, <vsc
; CHECK-NEXT: vmv1r.v v0, v9
; CHECK-NEXT: vmerge.vim v8, v10, 1, v0
; CHECK-NEXT: slli a0, a0, 1
-; CHECK-NEXT: addi a0, a0, -31
-; CHECK-NEXT: vsetvli zero, a0, e8, m2, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 31
-; CHECK-NEXT: vsetvli a1, zero, e8, m2, ta, ma
+; CHECK-NEXT: addi a0, a0, -31
; CHECK-NEXT: vslideup.vx v8, v12, a0
; CHECK-NEXT: vand.vi v8, v8, 1
; CHECK-NEXT: vmsne.vi v0, v8, 0
@@ -272,19 +262,19 @@ define <vscale x 32 x i1> @splice_nxv32i1_offset_max(<vscale x 32 x i1> %a, <vsc
; CHECK-LABEL: splice_nxv32i1_offset_max:
; CHECK: # %bb.0:
; CHECK-NEXT: vsetvli a0, zero, e8, m4, ta, ma
+; CHECK-NEXT: vmv1r.v v9, v0
+; CHECK-NEXT: vmv1r.v v0, v8
; CHECK-NEXT: vmv.v.i v12, 0
-; CHECK-NEXT: csrr a0, vlenb
-; CHECK-NEXT: li a1, 63
+; CHECK-NEXT: li a0, 63
; CHECK-NEXT: vmerge.vim v16, v12, 1, v0
+; CHECK-NEXT: vmv1r.v v0, v9
+; CHECK-NEXT: vmerge.vim v8, v12, 1, v0
+; CHECK-NEXT: vslidedown.vx v8, v8, a0
+; CHECK-NEXT: csrr a0, vlenb
; CHECK-NEXT: slli a0, a0, 2
; CHECK-NEXT: addi a0, a0, -63
-; CHECK-NEXT: vsetvli zero, a0, e8, m4, ta, ma
-; CHECK-NEXT: vslidedown.vx v16, v16, a1
-; CHECK-NEXT: vmv1r.v v0, v8
...
[truncated]
|
On X280, the VL toggle is better. The scalar unit is running ahead of the vector unit. The VL has been computed before the vector unit needs it. The VL is used to reduce the number of uops in the vector unit. |
In RISCVInsertVSETVLI we increase vl for vslides with an immediate of 1:
So for non EVL loops today we actually already omit the toggle. (I'm not sure why the EVL path doesn't trigger this) Should we add a tuning feature to gate this? |
That code has an LMUL=1 restriction so that helps. Though x280's ALU width is VLEN/2.
I think so. |
; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma | ||
; CHECK-NEXT: vslidedown.vi v8, v8, 5 | ||
; CHECK-NEXT: addi a0, a0, -5 | ||
; CHECK-NEXT: vslideup.vx v8, v9, a0 | ||
; CHECK-NEXT: ret | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this test (and a bunch other in this file) ill defined? The specification of this intrinsic says the offset starts the selected region from the concatenation. Since the concatenation is at most 4 elements long, isn't the result always poison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think so. I guess it's because we just decided to use 5 as the offset for all these examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assuem a copy paste mistake. The immediate for 2xi64 needs to be [-2, 1]. The splice result must include at least one element from the first vector of the concatenation.
For hardware which scales with LMUL - e.g. bp3 - this probably is net profitable. Yeah, seems like we might need a tuning flag. |
When vectorizing a loop with a fixed-order recurrence we use a splice, which gets lowered to a vslidedown and vslideup pair.
However with the way we lower it today we end up with extra vl toggles in the loop, especially with EVL tail folding, e.g:
Because the vslideup overwrites all but UpOffset elements from the vslidedown, we currently set the vslidedown's AVL to said offset.
But in the vslideup we use either VLMAX or the EVL which causes a toggle.
This increases the AVL of the vslidedown so it matches vslideup, even if the extra elements are overridden, to avoid the toggle.
This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture.
If we wanted to aggressively optimise for vl at the expense of introducing more toggles we could probably look at doing this in RISCVVLOptimizer.