Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8322174: RISC-V: C2 VectorizedHashCode RVV Version #17413

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

ygaevsky
Copy link
Contributor

@ygaevsky ygaevsky commented Jan 13, 2024

The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware.

Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Warning

 ⚠️ Patch contains a binary file (test/jdk/javax/management/loading/LibraryLoader/native.jar)

Issue

  • JDK-8322174: RISC-V: C2 VectorizedHashCode RVV Version (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/17413/head:pull/17413
$ git checkout pull/17413

Update a local copy of the PR:
$ git checkout pull/17413
$ git pull https://git.openjdk.org/jdk.git pull/17413/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 17413

View PR using the GUI difftool:
$ git pr show -t 17413

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/17413.diff

Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jan 13, 2024

👋 Welcome back ygaevsky! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Jan 13, 2024

NB: I have no access to RVV v1.0.0 hardware so to estimate performance improvements
adopted the patch to RVV v0.7.1 ISA [1] under OpenJDK-21 and run the JMH test
org.openjdk.bench.java.lang.ArraysHashCode on LicheePi-4A TH1520 which does support
RVV v.0.7.1.

[1] https://mail.openjdk.org/pipermail/riscv-port-dev/2024-January/001220.html

The results are below. Hopefully they will be similar on RVV v1.0.0 hardware.

Legend: UseVHI ==> UseVectorizedHashCodeIntrinsic

----------------------------------------------------------------------------------------------------------------------------------------------
                                [-XX:-UseVHI -XX:-UseRVV] [-XX:-UseVHI -XX:+UseRVV] [-XX:+UseVHI -XX:-UseRVV] [-XX:+UseVHi -XX:+UseRVV]
----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark    (size)  Mode  Cnt |       Score      Error  |       Score      Error  |       Score      Error  |       Score      Error  |Units|
----------------------------------------------------------------------------------------------------------------------------------------------
bytes             1  avgt   10 |      20.292 ±    0.524  |      20.693 ±    1.706  |      20.458 ±    0.718  |      20.276 ±    0.525  |ns/op|
bytes            10  avgt   10 |      35.107 ±    0.180  |      35.054 ±    0.029  |      30.898 ±    0.109  |      31.033 ±    0.132  |ns/op|
bytes           100  avgt   10 |     188.190 ±    4.192  |     188.805 ±    4.345  |     152.324 ±    2.205  |      97.673 ±    3.145  |ns/op|
bytes          1000  avgt   10 |    1664.569 ±    1.662  |    1663.711 ±    2.229  |    1184.224 ±    0.731  |     656.340 ±    1.908  |ns/op|
bytes         10000  avgt   10 |   16419.434 ±   68.995  |   16407.357 ±   43.737  |   11599.876 ±   23.574  |    6171.500 ±   16.633  |ns/op|
bytes        100000  avgt   10 |  167738.927 ± 3313.255  |  166577.887 ± 1552.963  |  119475.413 ± 1358.363  |   62061.873 ±  130.268  |ns/op|
chars             1  avgt   10 |      20.420 ±    1.031  |      20.294 ±    0.527  |      20.402 ±    0.992  |      21.267 ±    0.027  |ns/op|
chars            10  avgt   10 |      35.800 ±    0.032  |      35.778 ±    0.049  |      31.170 ±    0.199  |      31.744 ±    0.169  |ns/op|
chars           100  avgt   10 |     185.715 ±    0.674  |     184.531 ±    1.152  |     143.918 ±    1.147  |      90.613 ±    0.092  |ns/op|
chars          1000  avgt   10 |    1683.711 ±   46.493  |    1668.926 ±    6.850  |    1120.730 ±    3.017  |     652.677 ±    2.026  |ns/op|
chars         10000  avgt   10 |   16402.007 ±   16.654  |   16468.497 ±  136.411  |   10939.505 ±   72.647  |    6174.555 ±   28.879  |ns/op|
chars        100000  avgt   10 |  164826.072 ±  381.240  |  165807.663 ± 4328.908  |  114787.826 ± 4217.557  |   61724.436 ±   45.819  |ns/op|
ints              1  avgt   10 |      20.730 ±    2.375  |      20.506 ±    1.458  |      20.277 ±    0.517  |      20.169 ±    0.015  |ns/op|
ints             10  avgt   10 |      36.878 ±    0.059  |      36.162 ±    1.033  |      31.338 ±    0.243  |      32.511 ±    0.165  |ns/op|
ints            100  avgt   10 |     184.288 ±    0.790  |     184.939 ±    0.624  |     143.794 ±    0.708  |      80.406 ±    6.987  |ns/op|
ints           1000  avgt   10 |    1669.219 ±    3.559  |    1670.992 ±   13.830  |    1118.856 ±    1.086  |     486.305 ±    4.471  |ns/op|
ints          10000  avgt   10 |   16432.730 ±   62.326  |   16710.540 ±   68.028  |   11128.766 ±   57.448  |    5232.062 ±  291.835  |ns/op|
ints         100000  avgt   10 |  165387.705 ±  431.814  |  165597.050 ±  278.567  |  115605.648 ± 8245.853  |   45468.032 ± 1793.979  |ns/op|
multibytes        1  avgt   10 |       3.459 ±    0.020  |       3.473 ±    0.055  |       3.477 ±    0.145  |       3.480 ±    0.043  |ns/op|
multibytes       10  avgt   10 |      16.983 ±    0.264  |      17.526 ±    0.375  |      12.325 ±    0.117  |      13.415 ±    0.136  |ns/op|
multibytes      100  avgt   10 |     105.251 ±    0.250  |     105.032 ±    0.180  |      78.795 ±    0.260  |      53.210 ±    1.024  |ns/op|
multibytes     1000  avgt   10 |     948.171 ±    5.950  |     957.757 ±   12.117  |     700.407 ±    1.928  |     440.352 ±    2.248  |ns/op|
multibytes    10000  avgt   10 |    8829.949 ±   64.161  |    9007.879 ±  510.217  |    6406.776 ±   17.982  |    3430.480 ±   35.108  |ns/op|
multibytes   100000  avgt   10 |   89545.793 ± 6151.064  |   88335.319 ±   51.310  |   64236.061 ±   46.572  |   33380.485 ±   56.708  |ns/op|
multichars        1  avgt   10 |       3.475 ±    0.054  |       3.453 ±    0.066  |       3.492 ±    0.122  |       3.495 ±    0.047  |ns/op|
multichars       10  avgt   10 |      17.719 ±    0.645  |      17.201 ±    0.152  |      12.318 ±    0.141  |      13.093 ±    0.147  |ns/op|
multichars      100  avgt   10 |     106.735 ±    0.283  |     106.625 ±    0.177  |      77.695 ±    0.212  |      51.495 ±    0.166  |ns/op|
multichars     1000  avgt   10 |     927.573 ±    6.839  |     932.211 ±    3.445  |     696.374 ±    1.757  |     471.226 ±    1.499  |ns/op|
multichars    10000  avgt   10 |    9846.872 ±   20.840  |    9909.611 ±  188.165  |    6392.901 ±    4.849  |    3978.730 ±  180.130  |ns/op|
multichars   100000  avgt   10 |   88110.303 ±   41.764  |   88892.543 ± 2534.299  |   60615.033 ±   94.002  |   33956.859 ±  199.178  |ns/op|
multiints         1  avgt   10 |       3.450 ±    0.328  |       3.382 ±    0.150  |       3.345 ±    0.024  |       3.380 ±    0.040  |ns/op|
multiints        10  avgt   10 |      18.265 ±    0.424  |      18.644 ±    1.433  |      12.036 ±    0.041  |      13.773 ±    0.114  |ns/op|
multiints       100  avgt   10 |     107.500 ±    0.636  |     107.318 ±    0.466  |      77.971 ±    0.296  |      47.700 ±    0.408  |ns/op|
multiints      1000  avgt   10 |     924.920 ±    9.106  |     937.609 ±   44.303  |     695.427 ±    2.075  |     449.475 ±    2.061  |ns/op|
multiints     10000  avgt   10 |    9322.880 ±   49.589  |    9277.425 ±   91.828  |    7009.704 ±  297.983  |    6196.819 ±  367.531  |ns/op|
multiints    100000  avgt   10 |   88154.281 ±  279.258  |   88272.818 ±  103.608  |   64118.963 ± 6445.702  |   55317.212 ±  916.179  |ns/op|
multishorts       1  avgt   10 |       3.488 ±    0.034  |       3.531 ±    0.227  |       3.521 ±    0.051  |       3.512 ±    0.054  |ns/op|
multishorts      10  avgt   10 |      17.907 ±    0.380  |      17.408 ±    0.659  |      12.252 ±    0.110  |      13.445 ±    0.102  |ns/op|
multishorts     100  avgt   10 |     106.588 ±    0.188  |     107.500 ±    0.531  |      79.630 ±    0.428  |      53.886 ±    3.243  |ns/op|
multishorts    1000  avgt   10 |     931.732 ±    6.891  |     923.814 ±   11.836  |     701.534 ±    1.742  |     470.312 ±    2.117  |ns/op|
multishorts   10000  avgt   10 |    9663.105 ± 1017.387  |    9859.034 ±   66.672  |    6422.864 ±    7.486  |    3785.710 ±   37.656  |ns/op|
multishorts  100000  avgt   10 |   88799.262 ± 2363.672  |   88015.545 ±   52.795  |   60541.966 ±  155.521  |   33888.677 ±  127.071  |ns/op|
shorts            1  avgt   10 |      20.199 ±    0.083  |      20.190 ±    0.027  |      21.389 ±    0.600  |      21.250 ±    0.024  |ns/op|
shorts           10  avgt   10 |      35.842 ±    0.189  |      35.806 ±    0.167  |      30.960 ±    0.186  |      31.451 ±    0.182  |ns/op|
shorts          100  avgt   10 |     184.323 ±    0.488  |     185.318 ±    0.776  |     143.652 ±    1.057  |      90.657 ±    0.052  |ns/op|
shorts         1000  avgt   10 |    1664.583 ±    2.016  |    1666.803 ±    3.100  |    1118.623 ±    0.661  |     652.112 ±    0.346  |ns/op|
shorts        10000  avgt   10 |   16395.042 ±   39.388  |   16426.231 ±   75.461  |   10933.090 ±   16.165  |    6200.135 ±  116.218  |ns/op|
shorts       100000  avgt   10 |  165037.332 ±  226.003  |  167782.156 ± 8844.288  |  114329.012 ± 4326.851  |   61693.056 ±   93.278  |ns/op|
----------------------------------------------------------------------------------------------------------------------------------------------

@openjdk openjdk bot added the rfr Pull request is ready for review label Jan 13, 2024
@openjdk
Copy link

openjdk bot commented Jan 13, 2024

@ygaevsky The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Jan 13, 2024
@mlbridge
Copy link

mlbridge bot commented Jan 13, 2024

Webrevs

Copy link
Member

@RealFYang RealFYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments from a brief look.

src/hotspot/cpu/riscv/stubGenerator_riscv.cpp Outdated Show resolved Hide resolved
src/hotspot/cpu/riscv/riscv_v.ad Outdated Show resolved Hide resolved
// 31^^(MaxVectorSize-1)...31^^0 ==> vector registers
la(pows31, ExternalAddress(adr_pows31));
mv(t1, num_8b_elems_in_vec);
vsetvli(t0, t1, Assembler::e32, Assembler::m4);
Copy link
Member

@RealFYang RealFYang Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the scalar code for handling WIDE_TAIL could be eliminated with RVV's design for stripmining approach [1]? Looks like the current code doesn't take advantage of this design as new vl returned by vsetvli is not checked and used.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#sec-vector-config

One of the common approaches to handling a large number of elements is "stripmining" where each iteration of
a loop handles some number of elements, and the iterations continue until all elements have been processed. 
The RISC-V vector specification provides direct, portable support for this approach. The application specifies the
 total number of elements to be processed (the application vector length or AVL) as a candidate value for vl, and 
the hardware responds via a general-purpose register with the (frequently smaller) number of elements that the 
hardware will handle per iteration (stored in vl), based on the microarchitectural implementation and the vtype 
setting. A straightforward loop structure, shown in [Example of stripmining and changes to SEW]
(https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew),  depicts the ease with
 which the code keeps track of the remaining number of elements and the amount per iteration handled by hardware.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments, @RealFYang. I have tried to use vector instructions (m4 ==> m2) for the tail calculations but that makes the perfromance numbers only worse. :-(

I've made additional measurements with more granularity:

                                            [ -XX:-UseRVV ]  [-XX:+UseRVV }
ArraysHashCode.multiints      10  avgt   30  12.460 ± 0.155  13.836 ± 0.054  ns/op
ArraysHashCode.multiints      11  avgt   30  14.541 ± 0.140  14.613 ± 0.084  ns/op
ArraysHashCode.multiints      12  avgt   30  15.097 ± 0.052  15.517 ± 0.097  ns/op
ArraysHashCode.multiints      13  avgt   30  13.632 ± 0.137  14.486 ± 0.181  ns/op
ArraysHashCode.multiints      14  avgt   30  15.771 ± 0.108  16.153 ± 0.092  ns/op
ArraysHashCode.multiints      15  avgt   30  14.726 ± 0.088  15.930 ± 0.077  ns/op
ArraysHashCode.multiints      16  avgt   30  15.533 ± 0.067  15.496 ± 0.083  ns/op
ArraysHashCode.multiints      17  avgt   30  15.875 ± 0.173  16.878 ± 0.172  ns/op
ArraysHashCode.multiints      18  avgt   30  15.740 ± 0.114  16.465 ± 0.089  ns/op
ArraysHashCode.multiints      19  avgt   30  17.252 ± 0.051  17.628 ± 0.155  ns/op
ArraysHashCode.multiints      20  avgt   30  20.193 ± 0.282  19.039 ± 0.441  ns/op
ArraysHashCode.multiints      25  avgt   30  20.209 ± 0.070  20.513 ± 0.071  ns/op 
ArraysHashCode.multiints      30  avgt   30  23.157 ± 0.068  23.290 ± 0.165  ns/op
ArraysHashCode.multiints      35  avgt   30  28.671 ± 0.116  26.198 ± 0.127  ns/op <---
ArraysHashCode.multiints      40  avgt   30  30.992 ± 0.068  27.342 ± 0.072  ns/op
ArraysHashCode.multiints      45  avgt   30  39.408 ± 1.428  32.170 ± 0.230  ns/op
ArraysHashCode.multiints      50  avgt   30  41.976 ± 0.442  33.103 ± 0.090  ns/op
ArraysHashCode.multiints      55  avgt   30  45.379 ± 0.236  35.899 ± 0.692  ns/op
ArraysHashCode.multiints      60  avgt   30  48.615 ± 0.249  35.709 ± 0.477  ns/op
ArraysHashCode.multiints      65  avgt   30  51.455 ± 0.213  38.275 ± 0.266  ns/op
ArraysHashCode.multiints      70  avgt   30  54.032 ± 0.324  37.985 ± 0.264  ns/op
ArraysHashCode.multiints      75  avgt   30  56.759 ± 0.164  39.446 ± 0.425  ns/op
ArraysHashCode.multiints      80  avgt   30  61.334 ± 0.267  41.521 ± 0.310  ns/op
ArraysHashCode.multiints      85  avgt   30  66.177 ± 0.299  44.136 ± 0.407  ns/op
ArraysHashCode.multiints      90  avgt   30  67.444 ± 0.282  42.909 ± 0.275  ns/op
ArraysHashCode.multiints      95  avgt   30  77.312 ± 0.303  49.078 ± 1.166  ns/op
ArraysHashCode.multiints     100  avgt   30  78.405 ± 0.220  47.499 ± 0.553  ns/op
ArraysHashCode.multiints     105  avgt   30  75.706 ± 0.265  46.029 ± 0.579  ns/op

As you can see the numbers become better with +UseRVV only after length >= 30 and perhaps that can explain why my attempt to improve the tail with RVV instructions was unsuccessful - the cost of setting up Vector Unit for small lengths is to high. :-(

Copy link
Member

@RealFYang RealFYang Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I don't quite understand why there is a need to change LMUL from m4 to m2 if we are switching to use the stripmining approach. The tail calculation should normally share the code for VEC_LOOP, which also means we need to use some vector mask instructions to filter out the active elements for each loop iteration especially the iteration for handing the tail elements. And the vl returned by vsetvli tells us the number of elements which could be processed in parallel for one certain iteration ([1] is one example). I am not sure if you are trying this way. Do you have more details or code changes to share? Thanks.

[1] https://github.com/riscv/riscv-v-spec/blob/master/v-spec.adoc#example-stripmine-sew

Copy link
Contributor Author

@ygaevsky ygaevsky Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used m4->m2 change to process 8 elements in the tail with vector instructions after main vector loop. IIUC, the m4->m2 change in runtime is very costly, so I've created another patch with same goal but without m4->m2 change:

void C2_MacroAssembler::arrays_hashcode_v(Register ary, Register cnt, Register result,
                                          Register tmp1, Register tmp2, Register tmp3,
                                          Register tmp4, Register tmp5, Register tmp6,
                                          BasicType eltype)
{
...
  const int nof_vec_elems = MaxVectorSize;
  const int hof_vec_elems = nof_vec_elems >> 1;
  const int elsize_bytes = arrays_hashcode_elsize(eltype);
  const int elsize_shift = exact_log2(elsize_bytes);
  const int vec_step_bytes = nof_vec_elems << elsize_shift;
  const int half_vec_step_bytes = vec_step_bytes >> 1;
  const address adr_pows31 = StubRoutines::riscv::arrays_hashcode_powers_of_31()
                           + sizeof(jint);
 
...

  const Register chunks = tmp1;
  const Register chunks_end = chunks;
  const Register pows31 = tmp2;
  const Register powmax = tmp3;

  const VectorRegister v_coeffs =  v4;
  const VectorRegister v_src    =  v8;
  const VectorRegister v_sum    = v12;
  const VectorRegister v_powmax = v16;
  const VectorRegister v_result = v20;
  const VectorRegister v_tmp    = v24;
  const VectorRegister v_zred   = v28;

  Label DONE, TAIL, TAIL_LOOP, PRE_TAIL, SAVE_VRESULT, WIDE_TAIL, VEC_LOOP;

  // result has a value initially

  beqz(cnt, DONE);

  andi(chunks, cnt, ~(hof_vec_elems-1));
  beqz(chunks, TAIL);

  // load pre-calculated powers of 31
  la(pows31, ExternalAddress(adr_pows31));
  mv(t1, nof_vec_elems);
  vsetvli(t0, t1, Assembler::e32, Assembler::m4);
  vle32_v(v_coeffs, pows31);
  // clear vector registers used in intermediate calculations
  vmv_v_i(v_sum, 0);
  vmv_v_i(v_powmax, 0);
  vmv_v_i(v_result, 0);
  // set initial values
  vmv_s_x(v_result, result);
  vmv_s_x(v_zred, x0);

  andi(chunks, cnt, ~(nof_vec_elems-1));
  beqz(chunks, WIDE_TAIL);

  subw(cnt, cnt, chunks);
  slli(chunks_end, chunks, elsize_shift);
  add(chunks_end, ary, chunks_end);
  // get value of 31^^nof_vec_elems
  lw(powmax, Address(pows31, -1 * sizeof(jint)));
  vmv_s_x(v_powmax, powmax);

  bind(VEC_LOOP);
  // result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1)
  //                                + ...  + v_src[hof_vec_elems-1] * 31^^(0)
  vmul_vv(v_result, v_result, v_powmax);
  arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype);
  vmul_vv(v_src, v_src, v_coeffs);
  vredsum_vs(v_sum, v_src, v_zred);
  vadd_vv(v_result, v_result, v_sum);
  addi(ary, ary, vec_step_bytes); // bump array pointer
  bne(ary, chunks_end, VEC_LOOP); // reached the end of chunks?
  beqz(cnt, SAVE_VRESULT);

  bind(WIDE_TAIL);
  andi(chunks, cnt, ~(hof_vec_elems-1));
  beqz(chunks, PRE_TAIL);

  mv(t1, hof_vec_elems);
  subw(cnt, cnt, t1);
  vslidedown_vx(v_coeffs, v_coeffs, t1);
  // get value of 31^^hof_vec_elems
  lw(powmax, Address(pows31, sizeof(jint)*(hof_vec_elems - 1)));
  vmv_s_x(v_powmax, powmax);
  vsetvli(t0, t1, Assembler::e32, Assembler::m4);
  // result = result * 31^^(hof_vec_elems) + v_src[0] * 31^^(hof_vec_elems-1)
  //                                + ...  + v_src[hof_vec_elems-1] * 31^^(0)
  vmul_vv(v_result, v_result, v_powmax);
  arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype);
  vmul_vv(v_src, v_src, v_coeffs);
  vredsum_vs(v_sum, v_src, v_zred);
  vadd_vv(v_result, v_result, v_sum);
  beqz(cnt, SAVE_VRESULT);
  addi(ary, ary, half_vec_step_bytes); // bump array pointer

  bind(PRE_TAIL);
  vmv_x_s(result, v_result);

  bind(TAIL);
  slli(chunks_end, cnt, elsize_shift);
  add(chunks_end, ary, chunks_end);

  bind(TAIL_LOOP);
  arrays_hashcode_elload(t0, Address(ary), eltype);
  slli(t1, result, 5);           // optimize 31 * result
  subw(result, t1, result);      // with result<<5 - result
  addw(result, result, t0);
  addi(ary, ary, elsize_bytes);
  bne(ary, chunks_end, TAIL_LOOP);
  j(DONE);

  bind(SAVE_VRESULT);
  vmv_x_s(result, v_result);

  bind(DONE);
...
}

and got the following numbers:

[ -XX:+UseVectorizedHashCodeIntrinsic -XX:-UseRVV ]
Benchmark                  (size)  Mode  Cnt   Score   Error  Units
ArraysHashCode.multibytes       8  avgt   10  11.020 ± 0.225  ns/op
ArraysHashCode.multibytes       9  avgt   10  12.578 ± 0.117  ns/op
ArraysHashCode.multibytes      16  avgt   10  15.505 ± 0.273  ns/op
ArraysHashCode.multibytes      17  avgt   10  16.603 ± 0.164  ns/op
ArraysHashCode.multibytes      24  avgt   10  21.005 ± 0.271  ns/op
ArraysHashCode.multibytes      25  avgt   10  21.428 ± 0.227  ns/op
ArraysHashCode.multibytes      32  avgt   10  27.985 ± 0.356  ns/op
ArraysHashCode.multibytes      33  avgt   10  29.669 ± 0.145  ns/op
ArraysHashCode.multibytes      48  avgt   10  37.575 ± 0.318  ns/op
ArraysHashCode.multibytes      49  avgt   10  40.121 ± 0.229  ns/op
ArraysHashCode.multibytes      56  avgt   10  48.637 ± 0.274  ns/op
ArraysHashCode.multibytes      57  avgt   10  45.931 ± 0.305  ns/op
ArraysHashCode.multibytes      64  avgt   10  48.362 ± 0.315  ns/op
ArraysHashCode.multibytes      65  avgt   10  52.228 ± 0.320  ns/op
ArraysHashCode.multibytes      72  avgt   10  49.523 ± 0.287  ns/op
ArraysHashCode.multibytes      73  avgt   10  54.788 ± 0.437  ns/op
ArraysHashCode.multibytes      80  avgt   10  62.087 ± 0.289  ns/op
ArraysHashCode.multibytes      81  avgt   10  62.570 ± 0.211  ns/op
[ -XX:+UseVectorizedHashCodeIntrinsic -XX:+UseRVV ]
Benchmark                  (size)  Mode  Cnt   Score   Error  Units
ArraysHashCode.multibytes       8  avgt   10  15.700 ± 0.181  ns/op
ArraysHashCode.multibytes       9  avgt   10  20.743 ± 0.419  ns/op
ArraysHashCode.multibytes      16  avgt   10  30.189 ± 0.301  ns/op
ArraysHashCode.multibytes      17  avgt   10  32.639 ± 0.601  ns/op
ArraysHashCode.multibytes      24  avgt   10  36.358 ± 0.628  ns/op
ArraysHashCode.multibytes      25  avgt   10  34.486 ± 0.563  ns/op
ArraysHashCode.multibytes      32  avgt   10  42.667 ± 0.473  ns/op
ArraysHashCode.multibytes      33  avgt   10  44.858 ± 0.413  ns/op
ArraysHashCode.multibytes      48  avgt   10  47.132 ± 0.443  ns/op
ArraysHashCode.multibytes      49  avgt   10  51.528 ± 0.519  ns/op
ArraysHashCode.multibytes      56  avgt   10  52.133 ± 0.225  ns/op
ArraysHashCode.multibytes      57  avgt   10  48.549 ± 0.411  ns/op
ArraysHashCode.multibytes      64  avgt   10  57.399 ± 0.546  ns/op
ArraysHashCode.multibytes      65  avgt   10  57.680 ± 0.158  ns/op
ArraysHashCode.multibytes      72  avgt   10  50.890 ± 0.327  ns/op
ArraysHashCode.multibytes      73  avgt   10  54.338 ± 0.378  ns/op
ArraysHashCode.multibytes      80  avgt   10  59.218 ± 0.301  ns/op
ArraysHashCode.multibytes      81  avgt   10  63.889 ± 0.344  ns/op

As you can see the numbers are worse even in cases when scalar code is not used at all, i.e for lengths 16,24,32,48,56,64 etc. It seems possible to change the code to not contain any scalar code, e.g. use vslidedown instruction to move pre-calculated powers of 31 in v_coeffs according to the count of remaining elements, and perform the calculation:

  vmul_vv(v_result, v_result, v_powmax);
  arrays_hashcode_vec_elload(v_src, v_tmp, ary, eltype);
  vmul_vv(v_src, v_src, v_coeffs);
  vredsum_vs(v_sum, v_src, v_zred);
  vadd_vv(v_result, v_result, v_sum);

for them at once. However, as I pointed out above in notes about lengths24/36/..., that unlikely change the performance numbers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, any ideas for improvements the code are very welcome.

Copy link
Member

@RealFYang RealFYang Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, any ideas for improvements the code are very welcome.

Hi, I am afraid that your local changes posted is still not in a stripmining form. Normally I am expecting a single loop with masked vector instructions to handle all cases including the tail ones. See my previous comment [1]. Note that I am not saying that stripmining is the best one here in performance, but we will need the numbers to evaluate the different approaches.

[1] #17413 (comment)

@VladimirKempik VladimirKempik mentioned this pull request Feb 7, 2024
3 tasks
@bridgekeeper
Copy link

bridgekeeper bot commented Mar 4, 2024

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Mar 4, 2024

"Please keep me active" comment.

@openjdk
Copy link

openjdk bot commented Mar 13, 2024

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Apr 10, 2024

@ygaevsky this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout JDK-8322174
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added merge-conflict Pull request has merge conflict with target branch and removed rfr Pull request is ready for review labels Apr 10, 2024
@bridgekeeper
Copy link

bridgekeeper bot commented May 8, 2024

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ygaevsky
Copy link
Contributor Author

.

@bridgekeeper
Copy link

bridgekeeper bot commented Jun 7, 2024

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Jun 7, 2024

.

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 5, 2024

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Jul 8, 2024

.

@bridgekeeper
Copy link

bridgekeeper bot commented Aug 5, 2024

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Aug 5, 2024

.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 2, 2024

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Sep 2, 2024

.

@bridgekeeper
Copy link

bridgekeeper bot commented Sep 30, 2024

@ygaevsky This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

@ygaevsky
Copy link
Contributor Author

ygaevsky commented Oct 1, 2024

.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler hotspot-compiler-dev@openjdk.org merge-conflict Pull request has merge conflict with target branch
Development

Successfully merging this pull request may close these issues.

2 participants