Description
Hi Xianyi and Martin,
A colleague and I have worked on an updated RISC-V Vector (RVV) implementation the past few months and are now looking to contribute our work back. Our motivation was to support OpenBLAS on the SiFive X280 which is RVV 1.0 compliant. The existing C910V implementation has a number of limitations which we encountered, and as such we decided it was better to create *_rvv.c implementations within kernel/riscv64, leaving the existing *_vector.c implementation for C910V untouched.
Some of the improvements with this RVV implementation:
- Support vector ABS (apparently not supported on C910V)
- Support segment load/stores for complex data types, for better performance.
- Alternative (in our view better) implementation choices. Rather than loops using a predetermined vector length (VL), and a subsequent tail, we found that determining VL within the loop was actually more efficient (in addition to cleaner/smaller code).
- Generally faster for most kernels.
- Significantly faster version of GEMM, modeled after what was done for SVE (8xVLEN).
- All implementations using vector intrinsics, even the faster GEMM. No assembler required.
Prior to submitting a PR, for your and others review, what tests should be run? All of the CTESTs currently pass. In the course of our work, we’ve also run the benchmark suite many times.
Thank you,
Ken
P.S Our eventual PR would add the following, all of which is copyright OpenBLAS 2022:
kernel/riscv64/KERNEL.x280
kernel/riscv64/amax_rvv.c
kernel/riscv64/amin_rvv.c
kernel/riscv64/asum_rvv.c
kernel/riscv64/axpby_rvv.c
kernel/riscv64/axpy_rvv.c
kernel/riscv64/copy_rvv.c
kernel/riscv64/dot_rvv.c
kernel/riscv64/gemm_beta_rvv.c
kernel/riscv64/gemm_ncopy_rvv_v1.c
kernel/riscv64/gemm_tcopy_rvv_v1.c
kernel/riscv64/gemmkernel_rvv_v1x8.c
kernel/riscv64/gemv_n_rvv.c
kernel/riscv64/gemv_t_rvv.c
kernel/riscv64/iamax_rvv.c
kernel/riscv64/iamin_rvv.c
kernel/riscv64/imax_rvv.c
kernel/riscv64/imin_rvv.c
kernel/riscv64/izamax_rvv.c
kernel/riscv64/izamin_rvv.c
kernel/riscv64/max_rvv.c
kernel/riscv64/min_rvv.c
kernel/riscv64/nrm2_rvv.c
kernel/riscv64/rot_rvv.c
kernel/riscv64/scal_rvv.c
kernel/riscv64/sum_rvv.c
kernel/riscv64/swap_rvv.c
kernel/riscv64/symm_lcopy_rvv_v1.c
kernel/riscv64/symm_ucopy_rvv_v1.c
kernel/riscv64/symv_L_rvv.c
kernel/riscv64/symv_U_rvv.c
kernel/riscv64/trmm_lncopy_rvv_v1.c
kernel/riscv64/trmm_ltcopy_rvv_v1.c
kernel/riscv64/trmm_uncopy_rvv_v1.c
kernel/riscv64/trmm_utcopy_rvv_v1.c
kernel/riscv64/trmmkernel_rvv_v1x8.c
kernel/riscv64/trsm_kernel_LN_rvv_v1.c
kernel/riscv64/trsm_kernel_LT_rvv_v1.c
kernel/riscv64/trsm_kernel_RN_rvv_v1.c
kernel/riscv64/trsm_kernel_RT_rvv_v1.c
kernel/riscv64/trsm_lncopy_rvv_v1.c
kernel/riscv64/trsm_ltcopy_rvv_v1.c
kernel/riscv64/trsm_uncopy_rvv_v1.c
kernel/riscv64/trsm_utcopy_rvv_v1.c
kernel/riscv64/zamax_rvv.c
kernel/riscv64/zamin_rvv.c
kernel/riscv64/zasum_rvv.c
kernel/riscv64/zaxpby_rvv.c
kernel/riscv64/zaxpy_rvv.c
kernel/riscv64/zcopy_rvv.c
kernel/riscv64/zdot_rvv.c
kernel/riscv64/zgemm_beta_rvv.c
kernel/riscv64/zgemmkernel_rvv_v1x4.c
kernel/riscv64/zgemv_n_rvv.c
kernel/riscv64/zgemv_t_rvv.c
kernel/riscv64/znrm2_rvv.c
kernel/riscv64/zrot_rvv.c
kernel/riscv64/zscal_rvv.c
kernel/riscv64/zsum.c
kernel/riscv64/zsum_rvv.c
kernel/riscv64/zswap_rvv.c
kernel/riscv64/ztrmmkernel_2x2_rvv.c