Description
On the spacemit-x60, GCC 14 is ~24% faster on the 525.x264_r SPEC CPU 2017 benchmark than a recent build of Clang.
A big chunk of this difference is due to GCC tail folding its loops with VL, whereas LLVM doesn't by default.
Because LLVM doesn't tail fold its loops, it generates both a vectorized body and a scalar epilogue. There is a minimum trip count >= VF required to execute the vectorized body, otherwise it can only run the scalar epilogue.
On 525.x264_r, there are some very hot functions (e.g. get_ref
) which never meet the minimum trip count and so the vector code is never ran. Tail folding avoids this issue and allows us to run the vectorized body every time.
There are likely other performance benefits to be had with tail folding with VL, so it seems worthwhile exploring.
"EVL tail folding" (LLVM's vector-predication terminology for VL tail folding), can be enabled from Clang with -mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -mllvm -force-tail-folding-style=data-with-evl
. It initially landed in #76172 but it isn't enabled by default yet due to support for it not being fully complete, both in the loop vectorizer and elsewhere in the RISC-V backend.
This issue aims to track what work is needed across the LLVM project to bring it up to a stable state, at which point we can evaluate its performance to see if it should be enabled by default.
It's not a complete list and only contains the tasks that I've noticed so far. Please feel free to edit and add to it!
I presume we will find more things that need addressed as time goes on.
- Set up CI infrastructure for -force-tail-folding-style=data-with-evl
- Likely need a buildbot that runs llvm-test-suite in this configuration, similar to the AArch64 sve2 buildbots
- Need to make sure to test with
rvv_vl_half_avl
to catch[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX
bugs
- Need to make sure to test with
- Igalia is running nightly SPEC CPU 2017 benchmarking with EVL tail folding via LNT on the spacemit-x60
- The spacemit-x60 doesn't implement the
[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX
behaviour, so we may be missing bugs here. We probably want to also test SPEC on qemu withrvv_vl_half_avl
- The spacemit-x60 doesn't implement the
- Igalia is also hosting a 2 stage clang rva23 with EVL tail folding buildbot
- Complete the redeployment of the Igalia rva23 EVL tail folding build-bot. See Finalise setup of buildbot for RISC-V RVA23 EVL tail folding #123947 and [RISCV] Move rva23 evl builder over to cross-compile and execute under qemu-system setup llvm-zorg#358
- Likely need a buildbot that runs llvm-test-suite in this configuration, similar to the AArch64 sve2 buildbots
- Address known miscompiles
- [LV][EVL] Incorrect behavior of fixed-order recurrence idiom with EVL tail folding #122461
- Handle Multi-exit loops
- Fix cases that abort vectorization entirely
- On SPEC CPU 2017 as of 02403f4, with EVL tail folding we vectorize 57% less loops that were previously vectorized. This is likely due to vectorization aborting when it encounters unimplemented cases:
- VPWidenIntOrFpInductionRecipe
- VPWidenPointerInductionRecipe
- Fixed-length VFs: There are cases where scalable vectorization isn’t possible and we currently don't allow fixed-length VFs, so presumably nothing gets vectorized in this case.
- Cases where the RISC-V cost model may have become unprofitable with EVL tail folding
- Implement support for EVL tail folding in other parts of the loop vectorizer
- Fixed-order recurrences (will fall back to DataWithoutLaneMask style after [LV][EVL] Disable fixed-order recurrence idiom with EVL tail folding. #122458)
- [LV][EVL] Support interleaved accesses for EVL tail folding. #123201 (for performance)
- [LV]Enable max safe distance in predicated DataWithEVL vectorization mode. #100755
- [VPlan] Use VPWidenIntrinsicRecipe to support binary and unary operations with EVL-vectorization #114205 (see note on RISCVVLOptimizer below)
- Extend RISC-V codegen
- Handle VP intrinsics in more places
- Segmented accesses: [IA][RISCV] Support VP loads/stores in InterleavedAccessPass #120490 for power-of-two cases
- [IR][RISCV] Add llvm.vector.(de)interleave3/5/7 #124825 codegen support for (de)interleave3/5/7. We still need to teach vectorizers and InterleavedAccessPass about them.
- Strided accesses in RISCVGatherScatterLowering
- [RISCV] Allow non-loop invariant steps in RISCVGatherScatterLowering #122244
- [RISCV] Support vp.{gather,scatter} in RISCVGatherScatterLowering #122232
- [LV][EVL] Generate negative strided load/store for reversed load/store #123608
- Eventually, the loop vectorizer should be taught to emit
vp.strided.{load,store}
intrinsics directly cc) @nikolaypanchenko
- [RISCV] Fold vp.reverse(vp.load(ADDR, MASK)) -> vp.strided.load(ADDR, -1, MASK). #123115
- [RISCV] Fold vp.store(vp.reverse(VAL), ADDR, MASK) -> vp.strided.store(VAL, NEW_ADDR, -1, MASK) #123123
- Segmented accesses: [IA][RISCV] Support VP loads/stores in InterleavedAccessPass #120490 for power-of-two cases
- RISCVVLOptimizer
- The VL optimizer may have made non-trapping VP intrinsics redundant. We should evaluate if we still need to transform intrinsics/calls/binops to VP intrinsics in the LV
- Handling tied operands in ternary pseudos in RISCVVLOptimizer #123760
- [LV] Introduce the EVLIVSimplify Pass for EVL-vectorized loops #91796
- Handle VP intrinsics in more places