Skip to content

[RISCV] Avoid vl toggles when lowering vector_splice/experimental_vp_splice #146746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented Jul 2, 2025

When vectorizing a loop with a fixed-order recurrence we use a splice, which gets lowered to a vslidedown and vslideup pair.

However with the way we lower it today we end up with extra vl toggles in the loop, especially with EVL tail folding, e.g:

.LBB0_5:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	a5, a2, a3
	sh2add	a6, a3, a1
	zext.w	a7, a4
	vsetvli	a4, a5, e8, mf2, ta, ma
	vle32.v	v10, (a6)
	addi	a7, a7, -1
	vsetivli	zero, 1, e32, m2, ta, ma
	vslidedown.vx	v8, v8, a7
	sh2add	a6, a3, a0
	vsetvli	zero, a5, e32, m2, ta, ma
	vslideup.vi	v8, v10, 1
	vadd.vv	v8, v10, v8
	add	a3, a3, a4
	vse32.v	v8, (a6)
	vmv2r.v	v8, v10
	bne	a3, a2, .LBB0_5

Because the vslideup overwrites all but UpOffset elements from the vslidedown, we currently set the vslidedown's AVL to said offset.

But in the vslideup we use either VLMAX or the EVL which causes a toggle.

This increases the AVL of the vslidedown so it matches vslideup, even if the extra elements are overridden, to avoid the toggle.

This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture.

If we wanted to aggressively optimise for vl at the expense of introducing more toggles we could probably look at doing this in RISCVVLOptimizer.

…splice

When vectorizing a loop with a fixed-order recurrence we use a splice, which gets lowered to a vslidedown and vslideup pair.

However with the way we lower it today we end up with extra vl toggles in the loop, especially with EVL tail folding, e.g:

    .LBB0_5:                                # %vector.body
                                            # =>This Inner Loop Header: Depth=1
    	sub	a5, a2, a3
    	sh2add	a6, a3, a1
    	zext.w	a7, a4
    	vsetvli	a4, a5, e8, mf2, ta, ma
    	vle32.v	v10, (a6)
    	addi	a7, a7, -1
    	vsetivli	zero, 1, e32, m2, ta, ma
    	vslidedown.vx	v8, v8, a7
    	sh2add	a6, a3, a0
    	vsetvli	zero, a5, e32, m2, ta, ma
    	vslideup.vi	v8, v10, 1
    	vadd.vv	v8, v10, v8
    	add	a3, a3, a4
    	vse32.v	v8, (a6)
    	vmv2r.v	v8, v10
    	bne	a3, a2, .LBB0_5

Because the vslideup overwrites all but UpOffset elements from the vslidedown, we currently set the vslidedown's AVL to said offset.

But in the vslideup we use either VLMAX or the EVL which causes a toggle.

This increases the AVL of the vslidedown so it matches vslideup, even if the extra elements are overridden, to avoid the toggle.

This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture.

If we wanted to aggressively optimise for vl at the expense of introducing more toggles we could probably look at doing this in RISCVVLOptimizer.
@llvmbot
Copy link
Member

llvmbot commented Jul 2, 2025

@llvm/pr-subscribers-backend-risc-v

Author: Luke Lau (lukel97)

Changes

When vectorizing a loop with a fixed-order recurrence we use a splice, which gets lowered to a vslidedown and vslideup pair.

However with the way we lower it today we end up with extra vl toggles in the loop, especially with EVL tail folding, e.g:

.LBB0_5:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	sub	a5, a2, a3
	sh2add	a6, a3, a1
	zext.w	a7, a4
	vsetvli	a4, a5, e8, mf2, ta, ma
	vle32.v	v10, (a6)
	addi	a7, a7, -1
	vsetivli	zero, 1, e32, m2, ta, ma
	vslidedown.vx	v8, v8, a7
	sh2add	a6, a3, a0
	vsetvli	zero, a5, e32, m2, ta, ma
	vslideup.vi	v8, v10, 1
	vadd.vv	v8, v10, v8
	add	a3, a3, a4
	vse32.v	v8, (a6)
	vmv2r.v	v8, v10
	bne	a3, a2, .LBB0_5

Because the vslideup overwrites all but UpOffset elements from the vslidedown, we currently set the vslidedown's AVL to said offset.

But in the vslideup we use either VLMAX or the EVL which causes a toggle.

This increases the AVL of the vslidedown so it matches vslideup, even if the extra elements are overridden, to avoid the toggle.

This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture.

If we wanted to aggressively optimise for vl at the expense of introducing more toggles we could probably look at doing this in RISCVVLOptimizer.


Patch is 133.19 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/146746.diff

6 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+2-3)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll (+48-64)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-splice.ll (+196-316)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-splice-mask-fixed-vectors.ll (+20-28)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-splice-mask-vectors.ll (+35-49)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-splice.ll (+54-72)
diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index 326dd7149ef96..989a2cd237262 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -12331,7 +12331,7 @@ SDValue RISCVTargetLowering::lowerVECTOR_SPLICE(SDValue Op,
 
   SDValue SlideDown =
       getVSlidedown(DAG, Subtarget, DL, VecVT, DAG.getUNDEF(VecVT), V1,
-                    DownOffset, TrueMask, UpOffset);
+                    DownOffset, TrueMask, DAG.getRegister(RISCV::X0, XLenVT));
   return getVSlideup(DAG, Subtarget, DL, VecVT, SlideDown, V2, UpOffset,
                      TrueMask, DAG.getRegister(RISCV::X0, XLenVT),
                      RISCVVType::TAIL_AGNOSTIC);
@@ -13354,8 +13354,7 @@ RISCVTargetLowering::lowerVPSpliceExperimental(SDValue Op,
 
   if (ImmValue != 0)
     Op1 = getVSlidedown(DAG, Subtarget, DL, ContainerVT,
-                        DAG.getUNDEF(ContainerVT), Op1, DownOffset, Mask,
-                        UpOffset);
+                        DAG.getUNDEF(ContainerVT), Op1, DownOffset, Mask, EVL2);
   SDValue Result = getVSlideup(DAG, Subtarget, DL, ContainerVT, Op1, Op2,
                                UpOffset, Mask, EVL2, RISCVVType::TAIL_AGNOSTIC);
 
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll
index 8160e62a43106..79fbdb007a70c 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-vp-splice.ll
@@ -7,10 +7,9 @@
 define <2 x i64> @test_vp_splice_v2i64(<2 x i64> %va, <2 x i64> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v2i64:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -22,9 +21,8 @@ define <2 x i64> @test_vp_splice_v2i64_negative_offset(<2 x i64> %va, <2 x i64>
 ; CHECK-LABEL: test_vp_splice_v2i64_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e64, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -46,10 +44,10 @@ define <2 x i64> @test_vp_splice_v2i64_zero_offset(<2 x i64> %va, <2 x i64> %vb,
 define <2 x i64> @test_vp_splice_v2i64_masked(<2 x i64> %va, <2 x i64> %vb, <2 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v2i64_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e64, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <2 x i64> @llvm.experimental.vp.splice.v2i64(<2 x i64> %va, <2 x i64> %vb, i32 5, <2 x i1> %mask, i32 %evla, i32 %evlb)
@@ -59,10 +57,9 @@ define <2 x i64> @test_vp_splice_v2i64_masked(<2 x i64> %va, <2 x i64> %vb, <2 x
 define <4 x i32> @test_vp_splice_v4i32(<4 x i32> %va, <4 x i32> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v4i32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -74,9 +71,8 @@ define <4 x i32> @test_vp_splice_v4i32_negative_offset(<4 x i32> %va, <4 x i32>
 ; CHECK-LABEL: test_vp_splice_v4i32_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e32, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -87,10 +83,10 @@ define <4 x i32> @test_vp_splice_v4i32_negative_offset(<4 x i32> %va, <4 x i32>
 define <4 x i32> @test_vp_splice_v4i32_masked(<4 x i32> %va, <4 x i32> %vb, <4 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v4i32_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <4 x i32> @llvm.experimental.vp.splice.v4i32(<4 x i32> %va, <4 x i32> %vb, i32 5, <4 x i1> %mask, i32 %evla, i32 %evlb)
@@ -100,10 +96,9 @@ define <4 x i32> @test_vp_splice_v4i32_masked(<4 x i32> %va, <4 x i32> %vb, <4 x
 define <8 x i16> @test_vp_splice_v8i16(<8 x i16> %va, <8 x i16> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v8i16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -115,9 +110,8 @@ define <8 x i16> @test_vp_splice_v8i16_negative_offset(<8 x i16> %va, <8 x i16>
 ; CHECK-LABEL: test_vp_splice_v8i16_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e16, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -128,10 +122,10 @@ define <8 x i16> @test_vp_splice_v8i16_negative_offset(<8 x i16> %va, <8 x i16>
 define <8 x i16> @test_vp_splice_v8i16_masked(<8 x i16> %va, <8 x i16> %vb, <8 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v8i16_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e16, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <8 x i16> @llvm.experimental.vp.splice.v8i16(<8 x i16> %va, <8 x i16> %vb, i32 5, <8 x i1> %mask, i32 %evla, i32 %evlb)
@@ -141,10 +135,9 @@ define <8 x i16> @test_vp_splice_v8i16_masked(<8 x i16> %va, <8 x i16> %vb, <8 x
 define <16 x i8> @test_vp_splice_v16i8(<16 x i8> %va, <16 x i8> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v16i8:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e8, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e8, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -156,9 +149,8 @@ define <16 x i8> @test_vp_splice_v16i8_negative_offset(<16 x i8> %va, <16 x i8>
 ; CHECK-LABEL: test_vp_splice_v16i8_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e8, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e8, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -169,10 +161,10 @@ define <16 x i8> @test_vp_splice_v16i8_negative_offset(<16 x i8> %va, <16 x i8>
 define <16 x i8> @test_vp_splice_v16i8_masked(<16 x i8> %va, <16 x i8> %vb, <16 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v16i8_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e8, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e8, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e8, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e8, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <16 x i8> @llvm.experimental.vp.splice.v16i8(<16 x i8> %va, <16 x i8> %vb, i32 5, <16 x i1> %mask, i32 %evla, i32 %evlb)
@@ -182,10 +174,9 @@ define <16 x i8> @test_vp_splice_v16i8_masked(<16 x i8> %va, <16 x i8> %vb, <16
 define <2 x double> @test_vp_splice_v2f64(<2 x double> %va, <2 x double> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v2f64:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -197,9 +188,8 @@ define <2 x double> @test_vp_splice_v2f64_negative_offset(<2 x double> %va, <2 x
 ; CHECK-LABEL: test_vp_splice_v2f64_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e64, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -210,10 +200,10 @@ define <2 x double> @test_vp_splice_v2f64_negative_offset(<2 x double> %va, <2 x
 define <2 x double> @test_vp_splice_v2f64_masked(<2 x double> %va, <2 x double> %vb, <2 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v2f64_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e64, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e64, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e64, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <2 x double> @llvm.experimental.vp.splice.v2f64(<2 x double> %va, <2 x double> %vb, i32 5, <2 x i1> %mask, i32 %evla, i32 %evlb)
@@ -223,10 +213,9 @@ define <2 x double> @test_vp_splice_v2f64_masked(<2 x double> %va, <2 x double>
 define <4 x float> @test_vp_splice_v4f32(<4 x float> %va, <4 x float> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v4f32:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -238,9 +227,8 @@ define <4 x float> @test_vp_splice_v4f32_negative_offset(<4 x float> %va, <4 x f
 ; CHECK-LABEL: test_vp_splice_v4f32_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e32, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -251,10 +239,10 @@ define <4 x float> @test_vp_splice_v4f32_negative_offset(<4 x float> %va, <4 x f
 define <4 x float> @test_vp_splice_v4f32_masked(<4 x float> %va, <4 x float> %vb, <4 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v4f32_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e32, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e32, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <4 x float> @llvm.experimental.vp.splice.v4f32(<4 x float> %va, <4 x float> %vb, i32 5, <4 x i1> %mask, i32 %evla, i32 %evlb)
@@ -264,10 +252,9 @@ define <4 x float> @test_vp_splice_v4f32_masked(<4 x float> %va, <4 x float> %vb
 define <8 x half> @test_vp_splice_v8f16(<8 x half> %va, <8 x half> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v8f16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -279,9 +266,8 @@ define <8 x half> @test_vp_splice_v8f16_negative_offset(<8 x half> %va, <8 x hal
 ; CHECK-LABEL: test_vp_splice_v8f16_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e16, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -292,10 +278,10 @@ define <8 x half> @test_vp_splice_v8f16_negative_offset(<8 x half> %va, <8 x hal
 define <8 x half> @test_vp_splice_v8f16_masked(<8 x half> %va, <8 x half> %vb, <8 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v8f16_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e16, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <8 x half> @llvm.experimental.vp.splice.v8f16(<8 x half> %va, <8 x half> %vb, i32 5, <8 x i1> %mask, i32 %evla, i32 %evlb)
@@ -364,10 +350,9 @@ define <4 x half> @test_vp_splice_nxv2f16_with_firstelt(half %first, <4 x half>
 define <8 x bfloat> @test_vp_splice_v8bf16(<8 x bfloat> %va, <8 x bfloat> %vb, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v8bf16:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vi v8, v8, 5
 ; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v8, v8, 5
+; CHECK-NEXT:    addi a0, a0, -5
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0
 ; CHECK-NEXT:    ret
 
@@ -379,9 +364,8 @@ define <8 x bfloat> @test_vp_splice_v8bf16_negative_offset(<8 x bfloat> %va, <8
 ; CHECK-LABEL: test_vp_splice_v8bf16_negative_offset:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetivli zero, 5, e16, m1, ta, ma
-; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
 ; CHECK-NEXT:    vslideup.vi v8, v9, 5
 ; CHECK-NEXT:    ret
 
@@ -392,10 +376,10 @@ define <8 x bfloat> @test_vp_splice_v8bf16_negative_offset(<8 x bfloat> %va, <8
 define <8 x bfloat> @test_vp_splice_v8bf16_masked(<8 x bfloat> %va, <8 x bfloat> %vb, <8 x i1> %mask, i32 zeroext %evla, i32 zeroext %evlb) {
 ; CHECK-LABEL: test_vp_splice_v8bf16_masked:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    addi a0, a0, -5
-; CHECK-NEXT:    vsetvli zero, a0, e16, m1, ta, ma
+; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 5, v0.t
-; CHECK-NEXT:    vsetvli zero, a1, e16, m1, ta, mu
+; CHECK-NEXT:    addi a0, a0, -5
+; CHECK-NEXT:    vsetvli zero, zero, e16, m1, ta, mu
 ; CHECK-NEXT:    vslideup.vx v8, v9, a0, v0.t
 ; CHECK-NEXT:    ret
   %v = call <8 x bfloat> @llvm.experimental.vp.splice.v8bf16(<8 x bfloat> %va, <8 x bfloat> %vb, i32 5, <8 x i1> %mask, i32 %evla, i32 %evlb)
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll b/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
index 90d798b167cfc..87b6442c38a42 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-splice.ll
@@ -42,10 +42,8 @@ define <vscale x 1 x i1> @splice_nxv1i1_offset_max(<vscale x 1 x i1> %a, <vscale
 ; CHECK-NEXT:    vmv1r.v v0, v9
 ; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
 ; CHECK-NEXT:    srli a0, a0, 3
-; CHECK-NEXT:    addi a0, a0, -1
-; CHECK-NEXT:    vsetvli zero, a0, e8, mf8, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 1
-; CHECK-NEXT:    vsetvli a1, zero, e8, mf8, ta, ma
+; CHECK-NEXT:    addi a0, a0, -1
 ; CHECK-NEXT:    vslideup.vx v8, v10, a0
 ; CHECK-NEXT:    vand.vi v8, v8, 1
 ; CHECK-NEXT:    vmsne.vi v0, v8, 0
@@ -90,10 +88,8 @@ define <vscale x 2 x i1> @splice_nxv2i1_offset_max(<vscale x 2 x i1> %a, <vscale
 ; CHECK-NEXT:    vmv1r.v v0, v9
 ; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
 ; CHECK-NEXT:    srli a0, a0, 2
-; CHECK-NEXT:    addi a0, a0, -3
-; CHECK-NEXT:    vsetvli zero, a0, e8, mf4, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 3
-; CHECK-NEXT:    vsetvli a1, zero, e8, mf4, ta, ma
+; CHECK-NEXT:    addi a0, a0, -3
 ; CHECK-NEXT:    vslideup.vx v8, v10, a0
 ; CHECK-NEXT:    vand.vi v8, v8, 1
 ; CHECK-NEXT:    vmsne.vi v0, v8, 0
@@ -138,10 +134,8 @@ define <vscale x 4 x i1> @splice_nxv4i1_offset_max(<vscale x 4 x i1> %a, <vscale
 ; CHECK-NEXT:    vmv1r.v v0, v9
 ; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
 ; CHECK-NEXT:    srli a0, a0, 1
-; CHECK-NEXT:    addi a0, a0, -7
-; CHECK-NEXT:    vsetvli zero, a0, e8, mf2, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 7
-; CHECK-NEXT:    vsetvli a1, zero, e8, mf2, ta, ma
+; CHECK-NEXT:    addi a0, a0, -7
 ; CHECK-NEXT:    vslideup.vx v8, v10, a0
 ; CHECK-NEXT:    vand.vi v8, v8, 1
 ; CHECK-NEXT:    vmsne.vi v0, v8, 0
@@ -184,10 +178,8 @@ define <vscale x 8 x i1> @splice_nxv8i1_offset_max(<vscale x 8 x i1> %a, <vscale
 ; CHECK-NEXT:    vmerge.vim v10, v8, 1, v0
 ; CHECK-NEXT:    vmv1r.v v0, v9
 ; CHECK-NEXT:    vmerge.vim v8, v8, 1, v0
-; CHECK-NEXT:    addi a0, a0, -15
-; CHECK-NEXT:    vsetvli zero, a0, e8, m1, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 15
-; CHECK-NEXT:    vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT:    addi a0, a0, -15
 ; CHECK-NEXT:    vslideup.vx v8, v10, a0
 ; CHECK-NEXT:    vand.vi v8, v8, 1
 ; CHECK-NEXT:    vmsne.vi v0, v8, 0
@@ -232,10 +224,8 @@ define <vscale x 16 x i1> @splice_nxv16i1_offset_max(<vscale x 16 x i1> %a, <vsc
 ; CHECK-NEXT:    vmv1r.v v0, v9
 ; CHECK-NEXT:    vmerge.vim v8, v10, 1, v0
 ; CHECK-NEXT:    slli a0, a0, 1
-; CHECK-NEXT:    addi a0, a0, -31
-; CHECK-NEXT:    vsetvli zero, a0, e8, m2, ta, ma
 ; CHECK-NEXT:    vslidedown.vi v8, v8, 31
-; CHECK-NEXT:    vsetvli a1, zero, e8, m2, ta, ma
+; CHECK-NEXT:    addi a0, a0, -31
 ; CHECK-NEXT:    vslideup.vx v8, v12, a0
 ; CHECK-NEXT:    vand.vi v8, v8, 1
 ; CHECK-NEXT:    vmsne.vi v0, v8, 0
@@ -272,19 +262,19 @@ define <vscale x 32 x i1> @splice_nxv32i1_offset_max(<vscale x 32 x i1> %a, <vsc
 ; CHECK-LABEL: splice_nxv32i1_offset_max:
 ; CHECK:       # %bb.0:
 ; CHECK-NEXT:    vsetvli a0, zero, e8, m4, ta, ma
+; CHECK-NEXT:    vmv1r.v v9, v0
+; CHECK-NEXT:    vmv1r.v v0, v8
 ; CHECK-NEXT:    vmv.v.i v12, 0
-; CHECK-NEXT:    csrr a0, vlenb
-; CHECK-NEXT:    li a1, 63
+; CHECK-NEXT:    li a0, 63
 ; CHECK-NEXT:    vmerge.vim v16, v12, 1, v0
+; CHECK-NEXT:    vmv1r.v v0, v9
+; CHECK-NEXT:    vmerge.vim v8, v12, 1, v0
+; CHECK-NEXT:    vslidedown.vx v8, v8, a0
+; CHECK-NEXT:    csrr a0, vlenb
 ; CHECK-NEXT:    slli a0, a0, 2
 ; CHECK-NEXT:    addi a0, a0, -63
-; CHECK-NEXT:    vsetvli zero, a0, e8, m4, ta, ma
-; CHECK-NEXT:    vslidedown.vx v16, v16, a1
-; CHECK-NEXT:    vmv1r.v v0, v8
...
[truncated]

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture.

On X280, the VL toggle is better. The scalar unit is running ahead of the vector unit. The VL has been computed before the vector unit needs it. The VL is used to reduce the number of uops in the vector unit.

@lukel97
Copy link
Contributor Author

lukel97 commented Jul 2, 2025

This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture.

On X280, the VL toggle is better. The scalar unit is running ahead of the vector unit. The VL has been computed before the vector unit needs it. The VL is used to reduce the number of uops in the vector unit.

In RISCVInsertVSETVLI we increase vl for vslides with an immediate of 1:

So for non EVL loops today we actually already omit the toggle. (I'm not sure why the EVL path doesn't trigger this)

Should we add a tuning feature to gate this?

@topperc
Copy link
Collaborator

topperc commented Jul 2, 2025

This is operating under the assumption that a vl toggle is more expensive than performing the vslidedown at a higher vl on the average microarchitecture.

On X280, the VL toggle is better. The scalar unit is running ahead of the vector unit. The VL has been computed before the vector unit needs it. The VL is used to reduce the number of uops in the vector unit.

In RISCVInsertVSETVLI we increase vl for vslides with an immediate of 1:

That code has an LMUL=1 restriction so that helps. Though x280's ALU width is VLEN/2.

So for non EVL loops today we actually already omit the toggle. (I'm not sure why the EVL path doesn't trigger this)

Should we add a tuning feature to gate this?

I think so.

; CHECK-NEXT: vsetvli zero, a1, e64, m1, ta, ma
; CHECK-NEXT: vslidedown.vi v8, v8, 5
; CHECK-NEXT: addi a0, a0, -5
; CHECK-NEXT: vslideup.vx v8, v9, a0
; CHECK-NEXT: ret

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this test (and a bunch other in this file) ill defined? The specification of this intrinsic says the offset starts the selected region from the concatenation. Since the concatenation is at most 4 elements long, isn't the result always poison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think so. I guess it's because we just decided to use 5 as the offset for all these examples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assuem a copy paste mistake. The immediate for 2xi64 needs to be [-2, 1]. The splice result must include at least one element from the first vector of the concatenation.

@preames
Copy link
Collaborator

preames commented Jul 2, 2025

For hardware which scales with LMUL - e.g. bp3 - this probably is net profitable. Yeah, seems like we might need a tuning flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants