RISC-V EVL tail folding

On the spacemit-x60, [GCC 14 is ~24% faster on the 525.x264_r SPEC CPU 2017 benchmark than a recent build of Clang](https://lnt.lukelau.me/db_default/v4/nts/profile/13/18/15).

A big chunk of this difference is due to GCC tail folding its loops with VL, whereas LLVM doesn't by default.

Because LLVM doesn't tail fold its loops, it generates both a vectorized body and a scalar epilogue. There is a minimum trip count >= VF required to execute the vectorized body, otherwise it can only run the scalar epilogue.

On 525.x264_r, there are some very hot functions (e.g. `get_ref`) which never meet the minimum trip count and so the vector code is never ran. Tail folding avoids this issue and allows us to run the vectorized body every time. 

There are likely other performance benefits to be had with tail folding with VL, so it seems worthwhile exploring. 

"EVL tail folding" (LLVM's vector-predication terminology for VL tail folding), can be enabled from Clang with `-mllvm -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -mllvm -force-tail-folding-style=data-with-evl`. It initially landed in #76172 but it isn't enabled by default yet due to support for it not being fully complete, both in the loop vectorizer and elsewhere in the RISC-V backend.

This issue aims to track what work is needed across the LLVM project to bring it up to a stable state, at which point we can evaluate its performance to see if it should be enabled by default. 

It's not a complete list and only contains the tasks that I've noticed so far. Please feel free to edit and add to it!
I presume we will find more things that need addressed as time goes on.

- [x] Set up CI infrastructure for -force-tail-folding-style=data-with-evl 
  - [x] Likely need a buildbot that runs llvm-test-suite in this configuration, similar to the [AArch64 sve2 buildbots](https://lab.llvm.org/buildbot/#/builders/198). Done: https://lab.llvm.org/staging/#/builders/16
    - Need to make sure to test with [`rvv_vl_half_avl`](https://github.com/qemu/qemu/commit/12f1e2ec0095b2b4bfe55f2a608bd87be58fb908) to catch `[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX` bugs
  - [X] Igalia is running [nightly SPEC CPU 2017 benchmarking with EVL tail folding via LNT on the spacemit-x60](https://lnt.lukelau.me/db_default/v4/nts/machine/26)
    - The spacemit-x60 doesn't implement the `[ceil(AVL/2),VLMAX] if VLMAX < AVL < 2*VLMAX` behaviour, so we may be missing bugs here. We probably want to also test SPEC on qemu with [`rvv_vl_half_avl`](https://github.com/qemu/qemu/commit/12f1e2ec0095b2b4bfe55f2a608bd87be58fb908)
  - [X] Igalia is also hosting a [2 stage clang rva23 with EVL tail folding buildbot](https://lab.llvm.org/staging/#/builders/16)
  - [x] Complete the redeployment of the Igalia rva23 EVL tail folding build-bot. See https://github.com/llvm/llvm-project/issues/123947 and https://github.com/llvm/llvm-zorg/pull/358
- [ ] Address known miscompiles
  - #122461
  - Handle Multi-exit loops
  - #146672 
- [ ] Fix cases that abort vectorization entirely
  - On SPEC CPU 2017 as of 1fe993c251966697d75123eb38fa710cdb346c8d, with EVL tail folding we vectorize geomean 10.1% less loops that were previously vectorized. This is likely due to vectorization aborting when it encounters unimplemented cases:
  - VPWidenIntOrFpInductionRecipe
    - #144666
    - #118638
  - [ ] VPWidenPointerInductionRecipe
  - [x] Fixed-length VFs: There are cases where scalable vectorization isn’t possible and we currently don't allow fixed-length VFs, so presumably nothing gets vectorized in this case. I believe this fixed with https://github.com/llvm/llvm-project/pull/125678
  - [ ] Cases where the RISC-V cost model may have become unprofitable with EVL tail folding
- [ ] Implement support for EVL tail folding in other parts of the loop vectorizer
  - [x] Fixed-order recurrences (will fall back to DataWithoutLaneMask style after #122458)
    - #124093
  - [ ] #123201 (for performance)
  - #100755
  - #114205 (see note on RISCVVLOptimizer below)
- [ ] Extend RISC-V codegen
  - [ ] Handle VP intrinsics in more places
    - Segmented accesses: #120490 for power-of-two cases
      - #124825  codegen support for (de)interleave3/5/7. We still need to teach vectorizers and InterleavedAccessPass about them.
    - Strided accesses in RISCVGatherScatterLowering
      - #122244 
      - #122232 
      - #123608
      - #128718 
      - Eventually, the loop vectorizer should be taught to emit `vp.strided.{load,store}` intrinsics directly cc) @nikolaypanchenko 
    - #123115 
    - #123123 
  - [x] RISCVVLOptimizer
    - Now that we have widened induction variable support, we need to handle reducing vl over recurrences
    - The VL optimizer may have made non-trapping VP intrinsics redundant. We should evaluate if we still need to transform intrinsics/calls/binops to VP intrinsics in the LV. https://github.com/llvm/llvm-project/pull/126177
    - #123760 
  - #91796

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RISC-V EVL tail folding #123069

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RISC-V EVL tail folding #123069

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions