Skip to content

Add rvv support for zsymv and active rvv support for zhemv #4049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 9, 2023

Conversation

sh-zheng
Copy link
Contributor

No description provided.

@xianyi
Copy link
Collaborator

xianyi commented May 20, 2023

Hi @sh-zheng

Thank you for the patch.

Could you provide the toolchain and emulator for x280? Therefore, I could test it.

@sh-zheng
Copy link
Contributor Author

Hi, xianyi
The toolchain is llvm&clang 16.0.0 and the emulator is qemu 7.0.
I have done some tests for c/z hemv and symv, for lower and upper cases. Since the c/z symv are not standard BLAS-2 subroutines, for testing, I made some changes in OpenBLAS/interface to run them as standard subroutines.

@martin-frbg
Copy link
Collaborator

Would you happen to have a suggestion for cross-compiling from x86_64 to the x280 ? I followed the instructions at https://github.com/sifive/riscv-llvm/blob/dev/README.md for building a LLVM16 toolchain with gcc sysroot, but the test executables (openblas_utest etc) all run into a SIGILL with the qemu-riscv64 (v7.1.0) built as part of the process.

@martin-frbg
Copy link
Collaborator

(similar result with the prebuilt Sifive LLVM16/gcc12.2/qemu7.0 toolchain from https://github.com/sifive/prepare-riscv-toolchain-qemu/releases - only difference is that I get SIGSEGV in the tests with that one instead of SIGILL)

@sh-zheng
Copy link
Contributor Author

Hi, @martin-frbg
My llvm 16 and qemu 7.0 are downloaded from official site, where the supports for compiling and running in new rvv spec have been added. I have not tried the sifive version toolchain and emulator.
To build the lib, I use "make BINARY=64 CC=clang-path HOSTCC=gcc TARGET=x280 CFLAGS="-march=rv64gcv --sysroot=/opt/riscv/sysroot --gcc-toolchain=/opt/riscv" NO_FORTRAN=1 NO_SHARED=1 USE_THREAD=0"
To run, I use "qemu-riscv64 -L /opt/riscv/sysroot -cpu rv64,v=true,vlen=128,elen=64 ./a.out"

@xianyi
Copy link
Collaborator

xianyi commented May 21, 2023

Thank you. I will try it next week.

@martin-frbg I think I need to setup a CI for RISC-V via qemu.


v_res = VFREDSUM_FLOAT(vr0, v_z0, gvl);
temp_r2 += VFMVFS_FLOAT(v_res);
v_res = VFREDSUM_FLOAT(vr1, v_z0, gvl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since horizontal reductions are expensive, it may be advisable to do only one VFREDSUM instruction per real/imaginary part instead of one after the main loop (lines 146, 148) and in the tail (lines 184, 186). Let me do a code sketch:

if(len > 0){ // currently line 101
    gvl = VSETVL(len);  // typically gvl = VL_MAX
    [...]
    for(k = 0; k < len / gvl; k++){
       // main loop as in code
    }
    // **No** reduction after the main loop. 
   if(i < m){ // currently line 150
      BLASLONG remainder_vl = VSETVL(m-i);  // Introduce new variable instead of overwriting gvl
       // lines 152...182 use remainder_vl instead of gvl
       // **No** reduction in the tail
   }
   // Do the reduction as the final step using gvl:
   v_res = VFREDSUM_FLOAT(vr0, v_z0, gvl);
   temp_r2 += VFMVFS_FLOAT(v_res);
   v_res = VFREDSUM_FLOAT(vr1, v_z0, gvl);
   temp_i2 += VFMVFS_FLOAT(v_res);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! I will extract the reductions outside, and do them just once. Thank you for the advice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a problem.
If in the branch "if(i < m)", we accumulate vr0 and vr1 with another "remainder_vl", we have to set "tail undisturbed" to avoid tail overwrited. Such a tail policy may lead to more cost in registers copying in instruction scheduling (note in rvv spec 3.4.3).
In most of applications, the datasets are big and "len" is a large value. The cost of one more time reduction after the main loop may be acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the x280 is in-order, shouldn't

A simple in-order implementation can ignore the settings and simply execute all vector instructions using the undisturbed policy. The vta and vma state bits must still be provided in vtype for compatibility and to support thread migration.

hold? The statement is taken from the notes in the above-referenced rvv specs 3.4.3.

I don't know how much of a performance penalty horizontal reductions carry on x280. If you say that it is not an issue on x280 and the workloads you target, feel free to disregard the comment. If it turns out to be an issue, it can be revised later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After having checked the vector length for x280, the easiest and cheapest solution may actually be to do the tail with scalar instructions. The vector length is 512 bits, allowing up to 8 single complex or 4 double complex numbers to be processed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my original idea, the first version of rvv 1.0 support of openblas may be better if it adapt to various cpus as many as it can (different VLEN, in order/out of order), not only for sifive x280. And for specific cpu, we can make some optimizations based on it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think that the total amount of specific optimizations shoule be controled, to avoid the dilate of codes. So we maybe have to make a balance between the performance and architecture flexibility.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to have some kind of rvv-generic in addition to x280-specific target in the future, but so far we seem to be far away from the cpu-specific code bloat of x86_64 or even arm64. (What does bother me a bit, is that the earlier support for the C910/C906 appears to have been broken both in mainline develop and the risc-v branch. Maybe that is unavoidable if standards have progressed beyond the earlier hardware, but the corresponding boards appear to still be easier to buy than any more recent designs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code and combine the two reductions into one, with the explicit _tu policy. @angsch @martin-frbg

@martin-frbg
Copy link
Collaborator

martin-frbg commented May 21, 2023

Yes, as far as I can tell @HellerZheng has not claimed that these kernels in their present form are good for anything except x280 (and I also notice that Makefile.riscv64 adds -riscv-v-vector-bits-min=512 to the CFLAGS for x280 target)

@martin-frbg
Copy link
Collaborator

BTW thanks for the build/qemu instructions, it seems I had an error in my qemu command line. I now experience a hang in the "fork safety" part of utest/openblas_utest though - obviously unrelated to your PR, but I wonder if you noticed anything similar in your tests, if you ran the utest at all ?

@sh-zheng
Copy link
Contributor Author

BTW thanks for the build/qemu instructions, it seems I had an error in my qemu command line. I now experience a hang in the "fork safety" part of utest/openblas_utest though - obviously unrelated to your PR, but I wonder if you noticed anything similar in your tests, if you ran the utest at all ?

Sorry, my qemu was builded last year and I have forgotten some details about the building. In my memory I did not meet this kind of error. I did't run utest. I just wrote some simple calls of these subroutines.

@sh-zheng
Copy link
Contributor Author

Yes, as far as I can tell @HellerZheng has not claimed that these kernels in their present form are good for anything except x280 (and I also notice that Makefile.riscv64 adds -riscv-v-vector-bits-min=512 to the CFLAGS for x280 target)

I have just found that there is no accumulation operation in symv_L_rvv.c, for the register clearing in line 97
vr = VFMVVF_FLOAT(0, vl);. But the clearing here seems incorrect, because we really need an accumulation.

@martin-frbg
Copy link
Collaborator

martin-frbg commented May 21, 2023

Indeed both the SSYMV and DSYMV ctests fail, now that I get them to run. (And with the -ffast-math in Makefile.riscv64 I get lots of failures in the BLAS1 ctests as well, which go away when I remove that option... On the other hand, then openblas_utest reports failures in some variants of the MIN and MAX tests, so something is clearly amiss in the x280 rvv kernels).

@martin-frbg
Copy link
Collaborator

Need to correct myself, sorry - the ctest failures happen only when I copy your qemu option vlen=128, all ctests and all utests pass when I run qemu with vlen=512 (which I expect corresponds to the riscv-v-vector-bits-min=512 specified to the compiler).

@sh-zheng
Copy link
Contributor Author

Since the kernels for x280 are complete enough, maybe we could build a rvv-generic support based on it, by removing the architecture-specific feature, such as in-order & tail/mask undisturbed assumotion, and -riscv-v-vector-bits-min=512 & -ffast-math cflags.

@sh-zheng
Copy link
Contributor Author

Issue #4050 was raised to make more discussions about the rvv 1.0 support.

@ken-unger
Copy link

I have just found that there is no accumulation operation in symv_L_rvv.c, for the register clearing in line 97
vr = VFMVVF_FLOAT(0, vl);. But the clearing here seems incorrect, because we really need an accumulation.

Hi @sh-zheng, indeed this is a bug and Heller will issue a PR for this shortly. Not caught in the C tests, but caught by Xianyi's BLAS-tester.

@martin-frbg martin-frbg merged commit 62f0f50 into OpenMathLib:risc-v Jun 9, 2023
@angsch
Copy link
Contributor

angsch commented Jun 9, 2023

Great work @sh-zheng and an amazing bug spot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants