[RISCV] Optimize two source deinterleave2 via ri.vunzip2{a,b} #142667

preames · 2025-06-03T20:19:26Z

As done for the existing vnsrl cases, we can split a two source deinterleave2
into two single source deinterleave2 and a slideup. We can also use a
concat-then-deinterleave2 tactic. Both are equally valid (except in the m8
source type case), and the concat-then-deinterleave2 saves one instruction
for fractional LMUL cases.

Additionally, if we happen to know the exact VLEN and our fixed vectors are
an even number of vector registers, we can avoid the need to split or concat
entirely and just use both registers sources.

In the review, I included these as separate changes since I find that slightly
easier to follow. I can either land these squashed or individually as reviewers
prefer.

The motivation is basically the same as the vnsrl cases; we'd rather do 3 simple linear in LMUL operation than need to fall back to a vrgather on at least one source.

This allows us to use a single instruction instead of needing to split and slide.

llvmbot · 2025-06-03T20:20:00Z

@llvm/pr-subscribers-backend-risc-v

Author: Philip Reames (preames)

Changes

As done for the existing vnsrl cases, we can split a two source deinterleave2
into two single source deinterleave2 and a slideup. Additionally, if we happen
to know the exact VLEN and our fixed vectors are an even number of vector
registers, we can avoid the need to split and just use both registers sources.

In the review, I included these as separate changes since I find that slightly
easier to follow. I can either land these squashed or individually as reviewers
prefer.

Full diff: https://github.com/llvm/llvm-project/pull/142667.diff

2 Files Affected:

(modified) llvm/lib/Target/RISCV/RISCVISelLowering.cpp (+17)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-deinterleave2.ll (+21-61)

diff --git a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
index f74ca2a1c5492..777f4f91908d4 100644
--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -5830,6 +5830,9 @@ static SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG,
         Index == 0 ? RISCVISD::RI_VUNZIP2A_VL : RISCVISD::RI_VUNZIP2B_VL;
     if (V2.isUndef())
       return lowerVZIP(Opc, V1, V2, DL, DAG, Subtarget);
+    if (auto VLEN = Subtarget.getRealVLen();
+        VLEN && VT.getSizeInBits().getKnownMinValue() % *VLEN == 0)
+      return lowerVZIP(Opc, V1, V2, DL, DAG, Subtarget);
     if (SDValue Src = foldConcatVector(V1, V2)) {
       EVT NewVT = VT.getDoubleNumVectorElementsVT();
       Src = DAG.getExtractSubvector(DL, NewVT, Src, 0);
@@ -5837,6 +5840,20 @@ static SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG,
           lowerVZIP(Opc, Src, DAG.getUNDEF(NewVT), DL, DAG, Subtarget);
       return DAG.getExtractSubvector(DL, VT, Res, 0);
     }
+    // Narrow each source and concatenate them.
+    // FIXME: For small LMUL it is better to concatenate first.
+    if (1 < count_if(Mask,
+                     [&Mask](int Idx) { return Idx < (int)Mask.size(); }) &&
+        1 < count_if(Mask,
+                     [&Mask](int Idx) { return Idx >= (int)Mask.size(); })) {
+      SDValue Lo = lowerVZIP(Opc, V1, DAG.getUNDEF(VT), DL, DAG, Subtarget);
+      SDValue Hi = lowerVZIP(Opc, V2, DAG.getUNDEF(VT), DL, DAG, Subtarget);
+
+      MVT SubVT = VT.getHalfNumVectorElementsVT();
+      return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT,
+                         DAG.getExtractSubvector(DL, SubVT, Lo, 0),
+                         DAG.getExtractSubvector(DL, SubVT, Hi, 0));
+    }
   }
 
   if (SDValue V =
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-deinterleave2.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-deinterleave2.ll
index 9c884454aa025..14b0e8352efa3 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-deinterleave2.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-shuffle-deinterleave2.ll
@@ -1364,13 +1364,11 @@ define <4 x i64> @unzip2a_dual_v4i64(<4 x i64> %a, <4 x i64> %b) {
 ;
 ; ZIP-LABEL: unzip2a_dual_v4i64:
 ; ZIP:       # %bb.0: # %entry
-; ZIP-NEXT:    vsetivli zero, 4, e64, m1, ta, mu
-; ZIP-NEXT:    vmv.v.i v0, 8
-; ZIP-NEXT:    vslideup.vi v10, v9, 2
-; ZIP-NEXT:    vslideup.vi v10, v9, 1, v0.t
-; ZIP-NEXT:    vmv.v.i v0, 12
-; ZIP-NEXT:    ri.vunzip2a.vv v11, v8, v9
-; ZIP-NEXT:    vmerge.vvm v8, v11, v10, v0
+; ZIP-NEXT:    vsetivli zero, 4, e64, m1, ta, ma
+; ZIP-NEXT:    ri.vunzip2a.vv v11, v9, v10
+; ZIP-NEXT:    ri.vunzip2a.vv v9, v8, v10
+; ZIP-NEXT:    vslideup.vi v9, v11, 2
+; ZIP-NEXT:    vmv.v.v v8, v9
 ; ZIP-NEXT:    ret
 entry:
   %c = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
@@ -1502,16 +1500,11 @@ define <16 x i64> @unzip2a_dual_v16i64(<16 x i64> %a, <16 x i64> %b) {
 ; ZIP-LABEL: unzip2a_dual_v16i64:
 ; ZIP:       # %bb.0: # %entry
 ; ZIP-NEXT:    vsetivli zero, 8, e64, m2, ta, ma
-; ZIP-NEXT:    ri.vunzip2a.vv v16, v8, v10
-; ZIP-NEXT:    vsetivli zero, 16, e16, m1, ta, ma
-; ZIP-NEXT:    vid.v v8
-; ZIP-NEXT:    li a0, -256
-; ZIP-NEXT:    vadd.vv v8, v8, v8
-; ZIP-NEXT:    vmv.s.x v0, a0
-; ZIP-NEXT:    vadd.vi v8, v8, -16
-; ZIP-NEXT:    vsetvli zero, zero, e64, m4, ta, mu
-; ZIP-NEXT:    vrgatherei16.vv v16, v12, v8, v0.t
-; ZIP-NEXT:    vmv.v.v v8, v16
+; ZIP-NEXT:    ri.vunzip2a.vv v16, v12, v14
+; ZIP-NEXT:    ri.vunzip2a.vv v12, v8, v10
+; ZIP-NEXT:    vsetivli zero, 16, e64, m4, ta, ma
+; ZIP-NEXT:    vslideup.vi v12, v16, 8
+; ZIP-NEXT:    vmv.v.v v8, v12
 ; ZIP-NEXT:    ret
 entry:
   %c = shufflevector <16 x i64> %a, <16 x i64> %b, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
@@ -1557,13 +1550,9 @@ define <4 x i64> @unzip2a_dual_v4i64_exact(<4 x i64> %a, <4 x i64> %b) vscale_ra
 ;
 ; ZIP-LABEL: unzip2a_dual_v4i64_exact:
 ; ZIP:       # %bb.0: # %entry
-; ZIP-NEXT:    vsetivli zero, 4, e64, m1, ta, mu
-; ZIP-NEXT:    vmv.v.i v0, 8
-; ZIP-NEXT:    vslideup.vi v10, v9, 2
-; ZIP-NEXT:    vslideup.vi v10, v9, 1, v0.t
-; ZIP-NEXT:    vmv.v.i v0, 12
-; ZIP-NEXT:    ri.vunzip2a.vv v11, v8, v9
-; ZIP-NEXT:    vmerge.vvm v8, v11, v10, v0
+; ZIP-NEXT:    vsetivli zero, 4, e64, m1, ta, ma
+; ZIP-NEXT:    ri.vunzip2a.vv v10, v8, v9
+; ZIP-NEXT:    vmv.v.v v8, v10
 ; ZIP-NEXT:    ret
 entry:
   %c = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
@@ -1609,13 +1598,12 @@ define <4 x i64> @unzip2a_dual_v4i64_exact_nf2(<4 x i64> %a, <4 x i64> %b) vscal
 ;
 ; ZIP-LABEL: unzip2a_dual_v4i64_exact_nf2:
 ; ZIP:       # %bb.0: # %entry
-; ZIP-NEXT:    vsetivli zero, 4, e64, m1, ta, mu
-; ZIP-NEXT:    vmv.v.i v0, 8
-; ZIP-NEXT:    vslideup.vi v10, v9, 2
-; ZIP-NEXT:    vslideup.vi v10, v9, 1, v0.t
-; ZIP-NEXT:    vmv.v.i v0, 12
-; ZIP-NEXT:    ri.vunzip2a.vv v11, v8, v9
-; ZIP-NEXT:    vmerge.vvm v8, v11, v10, v0
+; ZIP-NEXT:    vsetivli zero, 4, e64, m1, ta, ma
+; ZIP-NEXT:    ri.vunzip2a.vv v11, v9, v10
+; ZIP-NEXT:    ri.vunzip2a.vv v9, v8, v10
+; ZIP-NEXT:    vsetvli zero, zero, e64, m1, tu, ma
+; ZIP-NEXT:    vslideup.vi v9, v11, 2
+; ZIP-NEXT:    vmv1r.v v8, v9
 ; ZIP-NEXT:    ret
 entry:
   %c = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
@@ -1740,37 +1728,9 @@ define <16 x i64> @unzip2a_dual_v16i64_exact(<16 x i64> %a, <16 x i64> %b) vscal
 ;
 ; ZIP-LABEL: unzip2a_dual_v16i64_exact:
 ; ZIP:       # %bb.0: # %entry
-; ZIP-NEXT:    vsetivli zero, 4, e64, m1, ta, mu
-; ZIP-NEXT:    vslideup.vi v18, v15, 2
-; ZIP-NEXT:    vmv.v.i v16, 8
-; ZIP-NEXT:    vmv.v.i v17, 12
-; ZIP-NEXT:    vslideup.vi v20, v13, 2
-; ZIP-NEXT:    vmv.v.v v0, v16
-; ZIP-NEXT:    vslideup.vi v18, v15, 1, v0.t
-; ZIP-NEXT:    ri.vunzip2a.vv v15, v14, v19
-; ZIP-NEXT:    vmv.v.v v0, v17
-; ZIP-NEXT:    vmerge.vvm v15, v15, v18, v0
-; ZIP-NEXT:    vmv.v.v v0, v16
-; ZIP-NEXT:    vslideup.vi v20, v13, 1, v0.t
-; ZIP-NEXT:    ri.vunzip2a.vv v14, v12, v13
-; ZIP-NEXT:    vslideup.vi v12, v11, 2
-; ZIP-NEXT:    vslideup.vi v18, v9, 2
-; ZIP-NEXT:    vmv.v.v v0, v17
-; ZIP-NEXT:    vmerge.vvm v14, v14, v20, v0
-; ZIP-NEXT:    li a0, -256
-; ZIP-NEXT:    ri.vunzip2a.vv v20, v10, v13
-; ZIP-NEXT:    ri.vunzip2a.vv v10, v8, v19
-; ZIP-NEXT:    vmv.v.v v0, v16
-; ZIP-NEXT:    vslideup.vi v12, v11, 1, v0.t
-; ZIP-NEXT:    vmv.v.v v0, v17
-; ZIP-NEXT:    vmerge.vvm v13, v20, v12, v0
-; ZIP-NEXT:    vmv.v.v v0, v16
-; ZIP-NEXT:    vslideup.vi v18, v9, 1, v0.t
-; ZIP-NEXT:    vmv.v.v v0, v17
-; ZIP-NEXT:    vmerge.vvm v12, v10, v18, v0
-; ZIP-NEXT:    vmv.s.x v0, a0
 ; ZIP-NEXT:    vsetivli zero, 16, e64, m4, ta, ma
-; ZIP-NEXT:    vmerge.vvm v8, v12, v12, v0
+; ZIP-NEXT:    ri.vunzip2a.vv v16, v8, v12
+; ZIP-NEXT:    vmv.v.v v8, v16
 ; ZIP-NEXT:    ret
 entry:
   %c = shufflevector <16 x i64> %a, <16 x i64> %b, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>

This saves one unzip instruction, and avoids a vsetvl toggle.

topperc · 2025-06-03T22:42:06Z

Are we intentionally missing coverage for vunzip2b?

preames · 2025-06-03T23:22:24Z

Are we intentionally missing coverage for vunzip2b?

Not other than the fact it didn't seem interesting given the opcode and matching bits weren't changing. Happy to duplicate the tests if you like?

topperc

LGTM

topperc · 2025-06-04T01:40:58Z

Are we intentionally missing coverage for vunzip2b?

Not other than the fact it didn't seem interesting given the opcode and matching bits weren't changing. Happy to duplicate the tests if you like?

As long as it was intentional I don't care that much. I wanted to make sure there weren't tests that should have updated but didn't.

llvm-ci · 2025-06-04T03:40:52Z

LLVM Buildbot has detected a new failure on builder lldb-aarch64-ubuntu running on linaro-lldb-aarch64-ubuntu while building llvm at step 6 "test".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/59/builds/18797

Here is the relevant piece of the build log for the reference

Step 6 (test) failure: build (failure)
...
PASS: lldb-unit :: ValueObject/./LLDBValueObjectTests/8/12 (2239 of 2248)
PASS: lldb-unit :: ValueObject/./LLDBValueObjectTests/9/12 (2240 of 2248)
PASS: lldb-unit :: tools/lldb-server/tests/./LLDBServerTests/0/2 (2241 of 2248)
PASS: lldb-unit :: tools/lldb-server/tests/./LLDBServerTests/1/2 (2242 of 2248)
PASS: lldb-unit :: Utility/./UtilityTests/4/9 (2243 of 2248)
PASS: lldb-unit :: Host/./HostTests/10/13 (2244 of 2248)
PASS: lldb-unit :: Target/./TargetTests/11/14 (2245 of 2248)
PASS: lldb-unit :: Host/./HostTests/2/13 (2246 of 2248)
PASS: lldb-unit :: Process/gdb-remote/./ProcessGdbRemoteTests/8/9 (2247 of 2248)
TIMEOUT: lldb-shell :: Settings/TestCxxFrameFormatRecursive.test (2248 of 2248)
******************** TEST 'lldb-shell :: Settings/TestCxxFrameFormatRecursive.test' FAILED ********************
Exit Code: -9
Timeout: Reached timeout of 600 seconds

Command Output (stderr):
--
split-file /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/Shell/Settings/TestCxxFrameFormatRecursive.test /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp # RUN: at line 6
+ split-file /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/Shell/Settings/TestCxxFrameFormatRecursive.test /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp
/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang --target=specify-a-target-or-use-a-_host-substitution --target=aarch64-unknown-linux-gnu -pthread -fmodules-cache-path=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-shell -g -gdwarf /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp/main.cpp -o /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp.out # RUN: at line 7
+ /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang --target=specify-a-target-or-use-a-_host-substitution --target=aarch64-unknown-linux-gnu -pthread -fmodules-cache-path=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-shell -g -gdwarf /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp/main.cpp -o /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp.out
clang: warning: argument unused during compilation: '-fmodules-cache-path=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-shell' [-Wunused-command-line-argument]
/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/lldb --no-lldbinit -S /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/lit-lldb-init-quiet -o "settings set interpreter.stop-command-source-on-error false"        -x -b -s /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp/commands.input /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp.out -o exit 2>&1        | /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/FileCheck /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/Shell/Settings/TestCxxFrameFormatRecursive.test # RUN: at line 8
+ /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/FileCheck /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/Shell/Settings/TestCxxFrameFormatRecursive.test
+ /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/lldb --no-lldbinit -S /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/lit-lldb-init-quiet -o 'settings set interpreter.stop-command-source-on-error false' -x -b -s /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp/commands.input /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/Shell/Settings/Output/TestCxxFrameFormatRecursive.test.tmp.out -o exit

--

********************
********************
Timed Out Tests (1):
  lldb-shell :: Settings/TestCxxFrameFormatRecursive.test


Testing Time: 847.97s

Total Discovered Tests: 33274
  Skipped          :     1 (0.00%)
  Unsupported      :   520 (1.56%)
  Passed           : 32727 (98.36%)
  Expectedly Failed:    25 (0.08%)
  Timed Out        :     1 (0.00%)
FAILED: tools/lldb/test/CMakeFiles/check-lldb /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test/CMakeFiles/check-lldb 
cd /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test && /usr/bin/python3.10 /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/llvm-lit -v /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb/test
ninja: build stopped: subcommand failed.

preames added 2 commits June 3, 2025 13:10

[RISCV] Split vunzip2{a,b} for deinterleave2 as we do for vnsrl

a614043

The motivation is basically the same as the vnsrl cases; we'd rather do 3 simple linear in LMUL operation than need to fall back to a vrgather on at least one source.

[RISCV] Exploit register layout for vunzip2{a,b} when VLEN is known

c88ad6f

This allows us to use a single instruction instead of needing to split and slide.

preames requested review from lukel97 and topperc June 3, 2025 20:19

llvmbot added the backend:RISC-V label Jun 3, 2025

[RISCV] Prefer concat then unzip for fractional LMUL

a10508e

This saves one unzip instruction, and avoids a vsetvl toggle.

topperc approved these changes Jun 4, 2025

View reviewed changes

Address review comment - add a couple unzip2b tests

ef16400

preames merged commit 88738a7 into llvm:main Jun 4, 2025
7 of 10 checks passed

preames deleted the pr-xrivosvizip-split-vunzip2ab branch June 4, 2025 03:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RISCV] Optimize two source deinterleave2 via ri.vunzip2{a,b} #142667

[RISCV] Optimize two source deinterleave2 via ri.vunzip2{a,b} #142667

Uh oh!

preames commented Jun 3, 2025 •

edited

Loading

Uh oh!

llvmbot commented Jun 3, 2025

Uh oh!

topperc commented Jun 3, 2025

Uh oh!

preames commented Jun 3, 2025

Uh oh!

topperc left a comment

Uh oh!

topperc commented Jun 4, 2025

Uh oh!

Uh oh!

llvm-ci commented Jun 4, 2025

Uh oh!

Uh oh!

[RISCV] Optimize two source deinterleave2 via ri.vunzip2{a,b} #142667

[RISCV] Optimize two source deinterleave2 via ri.vunzip2{a,b} #142667

Uh oh!

Conversation

preames commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 3, 2025

Uh oh!

topperc commented Jun 3, 2025

Uh oh!

preames commented Jun 3, 2025

Uh oh!

topperc left a comment

Choose a reason for hiding this comment

Uh oh!

topperc commented Jun 4, 2025

Uh oh!

Uh oh!

llvm-ci commented Jun 4, 2025

Uh oh!

Uh oh!

preames commented Jun 3, 2025 •

edited

Loading