-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add rvv support for zsymv and active rvv support for zhemv #4049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @sh-zheng Thank you for the patch. Could you provide the toolchain and emulator for x280? Therefore, I could test it. |
Hi, xianyi |
Would you happen to have a suggestion for cross-compiling from x86_64 to the x280 ? I followed the instructions at https://github.com/sifive/riscv-llvm/blob/dev/README.md for building a LLVM16 toolchain with gcc sysroot, but the test executables (openblas_utest etc) all run into a SIGILL with the qemu-riscv64 (v7.1.0) built as part of the process. |
(similar result with the prebuilt Sifive LLVM16/gcc12.2/qemu7.0 toolchain from https://github.com/sifive/prepare-riscv-toolchain-qemu/releases - only difference is that I get SIGSEGV in the tests with that one instead of SIGILL) |
Hi, @martin-frbg |
Thank you. I will try it next week. @martin-frbg I think I need to setup a CI for RISC-V via qemu. |
kernel/riscv64/zhemv_LM_rvv.c
Outdated
|
||
v_res = VFREDSUM_FLOAT(vr0, v_z0, gvl); | ||
temp_r2 += VFMVFS_FLOAT(v_res); | ||
v_res = VFREDSUM_FLOAT(vr1, v_z0, gvl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since horizontal reductions are expensive, it may be advisable to do only one VFREDSUM
instruction per real/imaginary part instead of one after the main loop (lines 146, 148) and in the tail (lines 184, 186). Let me do a code sketch:
if(len > 0){ // currently line 101
gvl = VSETVL(len); // typically gvl = VL_MAX
[...]
for(k = 0; k < len / gvl; k++){
// main loop as in code
}
// **No** reduction after the main loop.
if(i < m){ // currently line 150
BLASLONG remainder_vl = VSETVL(m-i); // Introduce new variable instead of overwriting gvl
// lines 152...182 use remainder_vl instead of gvl
// **No** reduction in the tail
}
// Do the reduction as the final step using gvl:
v_res = VFREDSUM_FLOAT(vr0, v_z0, gvl);
temp_r2 += VFMVFS_FLOAT(v_res);
v_res = VFREDSUM_FLOAT(vr1, v_z0, gvl);
temp_i2 += VFMVFS_FLOAT(v_res);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! I will extract the reductions outside, and do them just once. Thank you for the advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is a problem.
If in the branch "if(i < m)", we accumulate vr0 and vr1 with another "remainder_vl", we have to set "tail undisturbed" to avoid tail overwrited. Such a tail policy may lead to more cost in registers copying in instruction scheduling (note in rvv spec 3.4.3).
In most of applications, the datasets are big and "len" is a large value. The cost of one more time reduction after the main loop may be acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the x280 is in-order, shouldn't
A simple in-order implementation can ignore the settings and simply execute all vector instructions using the undisturbed policy. The vta and vma state bits must still be provided in vtype for compatibility and to support thread migration.
hold? The statement is taken from the notes in the above-referenced rvv specs 3.4.3.
I don't know how much of a performance penalty horizontal reductions carry on x280. If you say that it is not an issue on x280 and the workloads you target, feel free to disregard the comment. If it turns out to be an issue, it can be revised later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having checked the vector length for x280, the easiest and cheapest solution may actually be to do the tail with scalar instructions. The vector length is 512 bits, allowing up to 8 single complex or 4 double complex numbers to be processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my original idea, the first version of rvv 1.0 support of openblas may be better if it adapt to various cpus as many as it can (different VLEN, in order/out of order), not only for sifive x280. And for specific cpu, we can make some optimizations based on it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think that the total amount of specific optimizations shoule be controled, to avoid the dilate of codes. So we maybe have to make a balance between the performance and architecture flexibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may make sense to have some kind of rvv-generic in addition to x280-specific target in the future, but so far we seem to be far away from the cpu-specific code bloat of x86_64 or even arm64. (What does bother me a bit, is that the earlier support for the C910/C906 appears to have been broken both in mainline develop
and the risc-v
branch. Maybe that is unavoidable if standards have progressed beyond the earlier hardware, but the corresponding boards appear to still be easier to buy than any more recent designs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code and combine the two reductions into one, with the explicit _tu
policy. @angsch @martin-frbg
Yes, as far as I can tell @HellerZheng has not claimed that these kernels in their present form are good for anything except x280 (and I also notice that Makefile.riscv64 adds |
BTW thanks for the build/qemu instructions, it seems I had an error in my qemu command line. I now experience a hang in the "fork safety" part of utest/openblas_utest though - obviously unrelated to your PR, but I wonder if you noticed anything similar in your tests, if you ran the utest at all ? |
Sorry, my qemu was builded last year and I have forgotten some details about the building. In my memory I did not meet this kind of error. I did't run utest. I just wrote some simple calls of these subroutines. |
I have just found that there is no accumulation operation in symv_L_rvv.c, for the register clearing in line 97 |
|
Need to correct myself, sorry - the ctest failures happen only when I copy your qemu option |
Since the kernels for x280 are complete enough, maybe we could build a rvv-generic support based on it, by removing the architecture-specific feature, such as in-order & tail/mask undisturbed assumotion, and -riscv-v-vector-bits-min=512 & -ffast-math cflags. |
Issue #4050 was raised to make more discussions about the rvv 1.0 support. |
Hi @sh-zheng, indeed this is a bug and Heller will issue a PR for this shortly. Not caught in the C tests, but caught by Xianyi's BLAS-tester. |
Great work @sh-zheng and an amazing bug spot |
No description provided.