Description
The discussions in #4049 inspire me to creat an issue for further discussions.
Differ from commercial ISAs, which have a clear development plan, the total amount of products supporting rvv may be large. Optimization for all individual products may lead to code bloat, and is contrary to the purpose of the vector isa, which is expected to be length-adaptive.
Until now the intrinsic spec of rvv 1.0 is stable enough to develop codes, and the support of rvv 1.0 has been fully submitted to openblas, based on sifive x280, an in-order cpu with vlen=512.
Would it be better to do more development, based on this x280 version? The final destination may be the compatibility in different vlen, instruction execution order, tail/mask policy. Of course the pursuing of compatibility may lead to suboptimum performance, a balance have to be considered.
There are some cpu specified features in kernels of x280 and may lead to incorrect results in other cpus. List as following
- Architecture specified cflags, such as
-riscv-v-vector-bits-min=512
and-ffast-math
. - Changing vl in a loop, leading to tail cleared without tail undisturbed setted. Such as
vl = VSETVL(k);
in symv_L_rvv.c, line 96. - Set vl by immediate value under the assumption of vlen=512. Such as
size_t vl = 8;
in gemm_tcopy_8_rvv.c, line 84.
In addition to above, the registers tiling in gemm of different vlen should be considered. Now we set GEMM_UNROLL_N_SHIFT 8
, which may waste other vector registers. 12 or 14 may be better?