Add rvv support for zsymv and active rvv support for zhemv #4049

sh-zheng · 2023-05-19T17:29:54Z

No description provided.

xianyi · 2023-05-20T05:45:03Z

Thank you for the patch.

Could you provide the toolchain and emulator for x280? Therefore, I could test it.

sh-zheng · 2023-05-20T06:02:20Z

Hi, xianyi
The toolchain is llvm&clang 16.0.0 and the emulator is qemu 7.0.
I have done some tests for c/z hemv and symv, for lower and upper cases. Since the c/z symv are not standard BLAS-2 subroutines, for testing, I made some changes in OpenBLAS/interface to run them as standard subroutines.

martin-frbg · 2023-05-20T16:43:04Z

Would you happen to have a suggestion for cross-compiling from x86_64 to the x280 ? I followed the instructions at https://github.com/sifive/riscv-llvm/blob/dev/README.md for building a LLVM16 toolchain with gcc sysroot, but the test executables (openblas_utest etc) all run into a SIGILL with the qemu-riscv64 (v7.1.0) built as part of the process.

martin-frbg · 2023-05-20T21:13:50Z

(similar result with the prebuilt Sifive LLVM16/gcc12.2/qemu7.0 toolchain from https://github.com/sifive/prepare-riscv-toolchain-qemu/releases - only difference is that I get SIGSEGV in the tests with that one instead of SIGILL)

sh-zheng · 2023-05-21T02:01:40Z

Hi, @martin-frbg
My llvm 16 and qemu 7.0 are downloaded from official site, where the supports for compiling and running in new rvv spec have been added. I have not tried the sifive version toolchain and emulator.
To build the lib, I use "make BINARY=64 CC=clang-path HOSTCC=gcc TARGET=x280 CFLAGS="-march=rv64gcv --sysroot=/opt/riscv/sysroot --gcc-toolchain=/opt/riscv" NO_FORTRAN=1 NO_SHARED=1 USE_THREAD=0"
To run, I use "qemu-riscv64 -L /opt/riscv/sysroot -cpu rv64,v=true,vlen=128,elen=64 ./a.out"

xianyi · 2023-05-21T07:09:29Z

Thank you. I will try it next week.

@martin-frbg I think I need to setup a CI for RISC-V via qemu.

angsch · 2023-05-21T07:42:53Z

kernel/riscv64/zhemv_LM_rvv.c

+
+                                v_res = VFREDSUM_FLOAT(vr0, v_z0, gvl);
+                                temp_r2 += VFMVFS_FLOAT(v_res);
+                                v_res = VFREDSUM_FLOAT(vr1, v_z0, gvl);


Since horizontal reductions are expensive, it may be advisable to do only one VFREDSUM instruction per real/imaginary part instead of one after the main loop (lines 146, 148) and in the tail (lines 184, 186). Let me do a code sketch:

if(len > 0){ // currently line 101 gvl = VSETVL(len); // typically gvl = VL_MAX [...] for(k = 0; k < len / gvl; k++){ // main loop as in code } // **No** reduction after the main loop. if(i < m){ // currently line 150 BLASLONG remainder_vl = VSETVL(m-i); // Introduce new variable instead of overwriting gvl // lines 152...182 use remainder_vl instead of gvl // **No** reduction in the tail } // Do the reduction as the final step using gvl: v_res = VFREDSUM_FLOAT(vr0, v_z0, gvl); temp_r2 += VFMVFS_FLOAT(v_res); v_res = VFREDSUM_FLOAT(vr1, v_z0, gvl); temp_i2 += VFMVFS_FLOAT(v_res); }

Exactly! I will extract the reductions outside, and do them just once. Thank you for the advice.

Maybe there is a problem.
If in the branch "if(i < m)", we accumulate vr0 and vr1 with another "remainder_vl", we have to set "tail undisturbed" to avoid tail overwrited. Such a tail policy may lead to more cost in registers copying in instruction scheduling (note in rvv spec 3.4.3).
In most of applications, the datasets are big and "len" is a large value. The cost of one more time reduction after the main loop may be acceptable.

Since the x280 is in-order, shouldn't

A simple in-order implementation can ignore the settings and simply execute all vector instructions using the undisturbed policy. The vta and vma state bits must still be provided in vtype for compatibility and to support thread migration.

hold? The statement is taken from the notes in the above-referenced rvv specs 3.4.3.

I don't know how much of a performance penalty horizontal reductions carry on x280. If you say that it is not an issue on x280 and the workloads you target, feel free to disregard the comment. If it turns out to be an issue, it can be revised later.

After having checked the vector length for x280, the easiest and cheapest solution may actually be to do the tail with scalar instructions. The vector length is 512 bits, allowing up to 8 single complex or 4 double complex numbers to be processed.

In my original idea, the first version of rvv 1.0 support of openblas may be better if it adapt to various cpus as many as it can (different VLEN, in order/out of order), not only for sifive x280. And for specific cpu, we can make some optimizations based on it later.

But I think that the total amount of specific optimizations shoule be controled, to avoid the dilate of codes. So we maybe have to make a balance between the performance and architecture flexibility.

It may make sense to have some kind of rvv-generic in addition to x280-specific target in the future, but so far we seem to be far away from the cpu-specific code bloat of x86_64 or even arm64. (What does bother me a bit, is that the earlier support for the C910/C906 appears to have been broken both in mainline develop and the risc-v branch. Maybe that is unavoidable if standards have progressed beyond the earlier hardware, but the corresponding boards appear to still be easier to buy than any more recent designs)

I have updated the code and combine the two reductions into one, with the explicit _tu policy. @angsch @martin-frbg

martin-frbg · 2023-05-21T15:15:17Z

Yes, as far as I can tell @HellerZheng has not claimed that these kernels in their present form are good for anything except x280 (and I also notice that Makefile.riscv64 adds -riscv-v-vector-bits-min=512 to the CFLAGS for x280 target)

martin-frbg · 2023-05-21T15:25:17Z

BTW thanks for the build/qemu instructions, it seems I had an error in my qemu command line. I now experience a hang in the "fork safety" part of utest/openblas_utest though - obviously unrelated to your PR, but I wonder if you noticed anything similar in your tests, if you ran the utest at all ?

sh-zheng · 2023-05-21T15:39:29Z

BTW thanks for the build/qemu instructions, it seems I had an error in my qemu command line. I now experience a hang in the "fork safety" part of utest/openblas_utest though - obviously unrelated to your PR, but I wonder if you noticed anything similar in your tests, if you ran the utest at all ?

Sorry, my qemu was builded last year and I have forgotten some details about the building. In my memory I did not meet this kind of error. I did't run utest. I just wrote some simple calls of these subroutines.

sh-zheng · 2023-05-21T15:59:42Z

Yes, as far as I can tell @HellerZheng has not claimed that these kernels in their present form are good for anything except x280 (and I also notice that Makefile.riscv64 adds -riscv-v-vector-bits-min=512 to the CFLAGS for x280 target)

I have just found that there is no accumulation operation in symv_L_rvv.c, for the register clearing in line 97
vr = VFMVVF_FLOAT(0, vl);. But the clearing here seems incorrect, because we really need an accumulation.

martin-frbg · 2023-05-21T18:35:01Z

Indeed both the SSYMV and DSYMV ctests fail, now that I get them to run. (And with the -ffast-math in Makefile.riscv64 I get lots of failures in the BLAS1 ctests as well, which go away when I remove that option... On the other hand, then openblas_utest reports failures in some variants of the MIN and MAX tests, so something is clearly amiss in the x280 rvv kernels).

martin-frbg · 2023-05-21T20:40:07Z

Need to correct myself, sorry - the ctest failures happen only when I copy your qemu option vlen=128, all ctests and all utests pass when I run qemu with vlen=512 (which I expect corresponds to the riscv-v-vector-bits-min=512 specified to the compiler).

sh-zheng · 2023-05-22T01:24:20Z

Since the kernels for x280 are complete enough, maybe we could build a rvv-generic support based on it, by removing the architecture-specific feature, such as in-order & tail/mask undisturbed assumotion, and -riscv-v-vector-bits-min=512 & -ffast-math cflags.

sh-zheng · 2023-05-22T08:17:29Z

Issue #4050 was raised to make more discussions about the rvv 1.0 support.

…sturbed setted.

ken-unger · 2023-06-03T16:30:54Z

I have just found that there is no accumulation operation in symv_L_rvv.c, for the register clearing in line 97
vr = VFMVVF_FLOAT(0, vl);. But the clearing here seems incorrect, because we really need an accumulation.

Hi @sh-zheng, indeed this is a bug and Heller will issue a PR for this shortly. Not caught in the C tests, but caught by Xianyi's BLAS-tester.

angsch · 2023-06-09T17:50:03Z

Great work @sh-zheng and an amazing bug spot

Add rvv support for zsymv and active rvv support for zhemv

18d7afe

angsch reviewed May 21, 2023

View reviewed changes

sh-zheng mentioned this pull request May 22, 2023

risc-v vector v1.0 support #4050

Open

Combine two reduction operations of zhe/symv into one, with tail undi…

d3bf5a5

…sturbed setted.

HellerZheng mentioned this pull request Jun 5, 2023

fix wrong vr = VFMVVF_FLOAT(0, vl); in symv_L_rvv.c and symv_U_rvv.c #4074

Merged

Merge branch 'xianyi:risc-v' into risc-v

2a8bc38

martin-frbg merged commit 62f0f50 into OpenMathLib:risc-v Jun 9, 2023

Add rvv support for zsymv and active rvv support for zhemv #4049

Add rvv support for zsymv and active rvv support for zhemv #4049

Uh oh!

Conversation

sh-zheng commented May 19, 2023

Uh oh!

xianyi commented May 20, 2023

Uh oh!

sh-zheng commented May 20, 2023

Uh oh!

martin-frbg commented May 20, 2023

Uh oh!

martin-frbg commented May 20, 2023

Uh oh!

sh-zheng commented May 21, 2023

Uh oh!

xianyi commented May 21, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-frbg commented May 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented May 21, 2023

Uh oh!

sh-zheng commented May 21, 2023

Uh oh!

sh-zheng commented May 21, 2023

Uh oh!

martin-frbg commented May 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented May 21, 2023

Uh oh!

sh-zheng commented May 22, 2023

Uh oh!

sh-zheng commented May 22, 2023

Uh oh!

ken-unger commented Jun 3, 2023

Uh oh!

angsch commented Jun 9, 2023

Uh oh!

Uh oh!

martin-frbg commented May 21, 2023 •

edited

Loading

martin-frbg commented May 21, 2023 •

edited

Loading