-
Notifications
You must be signed in to change notification settings - Fork 6.1k
[JVMCI] Libgraal can deadlock in blocking compilation mode #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… JVMCI compilation
|
Welcome to the OpenJDK organization on GitHub! This repository is currently a read-only git mirror of the official Mercurial repository (located at https://hg.openjdk.java.net/). As such, we are not currently accepting pull requests here. If you would like to contribute to the OpenJDK project, please see https://openjdk.java.net/contribute/ on how to proceed. This pull request will be automatically closed. |
…ner_virtual_thread 8246039: SSLSocket HandshakeCompletedListeners are run on virtual threads
Like scalar shift, vector shift could do nothing when shift count is
zero.
This patch implements the 'Identity' method for all kinds of vector
shift nodes to optimize out 'ShiftVCntNode 0', which is typically a
redundant 'mov' in final generated code like below:
```
add x17, x12, x14
ldr q16, [x17, openjdk#16]
mov v16.16b, v16.16b
add x14, x13, x14
str q16, [x14, openjdk#16]
```
With this patch, the code above could be optimized as below:
```
add x17, x12, x14
ldr q16, [x17, openjdk#16]
add x14, x13, x14
str q16, [x14, openjdk#16]
```
[TESTS]
compiler/vectorapi/TestVectorShiftImm.java, jdk/incubator/vector,
hotspot::tier1 passed without new failure.
Change-Id: I7657c0daaa5f758966936b9ede670c8b9ad94c48
The vector shift count was defined by two separate nodes(LShiftCntV and
RShiftCntV), which would prevent them from being shared when the shift
counts are the same.
```
public static void test_shiftv(int sh) {
for (int i = 0; i < N; i+=1) {
a0[i] = a1[i] << sh;
b0[i] = b1[i] >> sh;
}
}
```
Given the example above, by merging the same shift counts into one
node, they could be shared by shift nodes(RShiftV or LShiftV) like
below:
```
Before:
1184 LShiftCntV === _ 1189 [[ 1185 ... ]]
1190 RShiftCntV === _ 1189 [[ 1191 ... ]]
1185 LShiftVI === _ 1181 1184 [[ 1186 ]]
1191 RShiftVI === _ 1187 1190 [[ 1192 ]]
After:
1190 ShiftCntV === _ 1189 [[ 1191 1204 ... ]]
1204 LShiftVI === _ 1211 1190 [[ 1203 ]]
1191 RShiftVI === _ 1187 1190 [[ 1192 ]]
```
The final code could remove one redundant “dup”(scalar->vector),
with one register saved.
```
Before:
dup v16.16b, w12
dup v17.16b, w12
...
ldr q18, [x13, openjdk#16]
sshl v18.4s, v18.4s, v16.4s
add x18, x16, x12 ; iaload
add x4, x15, x12
str q18, [x4, openjdk#16] ; iastore
ldr q18, [x18, openjdk#16]
add x12, x14, x12
neg v19.16b, v17.16b
sshl v18.4s, v18.4s, v19.4s
str q18, [x12, openjdk#16] ; iastore
After:
dup v16.16b, w11
...
ldr q17, [x13, openjdk#16]
sshl v17.4s, v17.4s, v16.4s
add x2, x22, x11 ; iaload
add x4, x16, x11
str q17, [x4, openjdk#16] ; iastore
ldr q17, [x2, openjdk#16]
add x11, x21, x11
neg v18.16b, v16.16b
sshl v17.4s, v17.4s, v18.4s
str q17, [x11, openjdk#16] ; iastore
```
Change-Id: I047f3f32df9535d706a9920857d212610e8ce315
r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```
Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms
1. Remove redundant instructions for abs with constant values
If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.
For example,
int[] a
for (int i = 0; i < SIZE; i++) {
a[i] = Math.abs(-38);
}
Before the patch, the generated code for the testcase above is:
...
mov w10, #0xffffffda
cmp w10, wzr
cneg w17, w10, lt
dup v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
movi v16.4s, #0x26
...
2. Remove redundant instructions for abs with char type
In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.
As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.
For example,
char[] a;
char[] b;
for (int i = 0; i < SIZE; i++) {
b[i] = (char) Math.abs(a[i]);
}
Before the patch, the generated assembly code for the testcase
above is:
B15:
add x13, x21, w20, sxtw openjdk#1
ldrh w11, [x13, openjdk#16]
cmp w11, wzr
cneg w10, w11, lt
strh w10, [x13, openjdk#16]
ldrh w10, [x13, openjdk#18]
cmp w10, wzr
cneg w10, w10, lt
strh w10, [x13, openjdk#18]
...
add w20, w20, #0x1
cmp w20, w17
b.lt B15
After the patch, the generated assembly code is:
B15:
sbfiz x18, x19, openjdk#1, openjdk#32
add x0, x14, x18
ldr q16, [x0, openjdk#16]
add x18, x21, x18
str q16, [x18, openjdk#16]
ldr q16, [x0, openjdk#32]
str q16, [x18, openjdk#32]
...
add w19, w19, #0x40
cmp w19, w17
b.lt B15
3. Convert some common abs operations to ideal forms
The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:
a) abs(0 - x) => abs(x)
Before the patch:
...
ldr w13, [x13, openjdk#16]
neg w13, w13
cmp w13, wzr
cneg w14, w13, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
b) abs(abs(x)) => abs(x)
Before the patch:
...
ldr w12, [x12, openjdk#16]
cmp w12, wzr
cneg w12, w12, lt
cmp w12, wzr
cneg w12, w12, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
The patch aims to help optimize Math.abs() mainly from these three parts:
1) Remove redundant instructions for abs with constant values
2) Remove redundant instructions for abs with char type
3) Convert some common abs operations to ideal forms
1. Remove redundant instructions for abs with constant values
If we can decide the value of the input node for function Math.abs()
at compile-time, we can substitute the Abs node with the absolute
value of the constant and don't have to calculate it at runtime.
For example,
int[] a
for (int i = 0; i < SIZE; i++) {
a[i] = Math.abs(-38);
}
Before the patch, the generated code for the testcase above is:
...
mov w10, #0xffffffda
cmp w10, wzr
cneg w17, w10, lt
dup v16.8h, w17
...
After the patch, the generated code for the testcase above is :
...
movi v16.4s, #0x26
...
2. Remove redundant instructions for abs with char type
In Java semantics, as the char type is always non-negative, we
could actually remove the absI node in the C2 middle end.
As for vectorization part, in current SLP, the vectorization of
Math.abs() with char type is intentionally disabled after
JDK-8261022 because it generates incorrect result before. After
removing the AbsI node in the middle end, Math.abs(char) can be
vectorized naturally.
For example,
char[] a;
char[] b;
for (int i = 0; i < SIZE; i++) {
b[i] = (char) Math.abs(a[i]);
}
Before the patch, the generated assembly code for the testcase
above is:
B15:
add x13, x21, w20, sxtw openjdk#1
ldrh w11, [x13, openjdk#16]
cmp w11, wzr
cneg w10, w11, lt
strh w10, [x13, openjdk#16]
ldrh w10, [x13, openjdk#18]
cmp w10, wzr
cneg w10, w10, lt
strh w10, [x13, openjdk#18]
...
add w20, w20, #0x1
cmp w20, w17
b.lt B15
After the patch, the generated assembly code is:
B15:
sbfiz x18, x19, openjdk#1, openjdk#32
add x0, x14, x18
ldr q16, [x0, openjdk#16]
add x18, x21, x18
str q16, [x18, openjdk#16]
ldr q16, [x0, openjdk#32]
str q16, [x18, openjdk#32]
...
add w19, w19, #0x40
cmp w19, w17
b.lt B15
3. Convert some common abs operations to ideal forms
The patch overrides some virtual support functions for AbsNode
so that optimization of gvn can work on it. Here are the optimizable
forms:
a) abs(0 - x) => abs(x)
Before the patch:
...
ldr w13, [x13, openjdk#16]
neg w13, w13
cmp w13, wzr
cneg w14, w13, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
b) abs(abs(x)) => abs(x)
Before the patch:
...
ldr w12, [x12, openjdk#16]
cmp w12, wzr
cneg w12, w12, lt
cmp w12, wzr
cneg w12, w12, lt
...
After the patch:
...
ldr w13, [x13, openjdk#16]
cmp w13, wzr
cneg w13, w13, lt
...
Change-Id: I5434c01a225796caaf07ffbb19983f4fe2e206bd
*** Implementation
In AArch64 NEON, vector shift right is implemented by vector shift left
instructions (SSHL[1] and USHL[2]) with negative shift count value. In
C2 backend, we generate a `neg` to given shift value followed by `sshl`
or `ushl` instruction.
For vector shift right, the vector shift count has two origins:
1) it can be duplicated from scalar variable/immediate(case-1),
2) it can be loaded directly from one vector(case-2).
This patch aims to optimize case-1. Specifically, we move the negate
from RShiftV* rules to RShiftCntV rule. As a result, the negate can be
hoisted outside of the loop if it's a loop invariant.
In this patch,
1) we split vshiftcnt* rules into vslcnt* and vsrcnt* rules to handle
shift left and shift right respectively. Compared to vslcnt* rules, the
negate is conducted in vsrcnt*.
2) for each vsra* and vsrl* rules, we create one variant, i.e. vsra*_var
and vsrl*_var. We use vsra* and vsrl* rules to handle case-1, and use
vsra*_var and vsrl*_var rules to handle case-2. Note that
ShiftVNode::is_var_shift() can be used to distinguish case-1 from
case-2.
3) we add one assertion for the vs*_imm rules as we have done on
ARM32[3].
4) several style issues are resolved.
*** Example
Take function `rShiftInt()` in the newly added micro benchmark
VectorShiftRight.java as an example.
```
public void rShiftInt() {
for (int i = 0; i < SIZE; i++) {
intsB[i] = intsA[i] >> count;
}
}
```
Arithmetic shift right is conducted inside a big loop. The following
code snippet shows the disassembly code generated by auto-vectorization
before we apply current patch. We can see that `neg` is conducted in the
loop body.
```
0x0000ffff89057a64: dup v16.16b, w13 <-- dup
0x0000ffff89057a68: mov w12, #0x7d00 // #32000
0x0000ffff89057a6c: sub w13, w2, w10
0x0000ffff89057a70: cmp w2, w10
0x0000ffff89057a74: csel w13, wzr, w13, lt
0x0000ffff89057a78: mov w8, #0x7d00 // #32000
0x0000ffff89057a7c: cmp w13, w8
0x0000ffff89057a80: csel w13, w12, w13, hi
0x0000ffff89057a84: add w14, w13, w10
0x0000ffff89057a88: nop
0x0000ffff89057a8c: nop
0x0000ffff89057a90: sbfiz x13, x10, openjdk#2, openjdk#32 <-- loop entry
0x0000ffff89057a94: add x15, x17, x13
0x0000ffff89057a98: ldr q17, [x15,openjdk#16]
0x0000ffff89057a9c: add x13, x0, x13
0x0000ffff89057aa0: neg v18.16b, v16.16b <-- neg
0x0000ffff89057aa4: sshl v17.4s, v17.4s, v18.4s <-- shift right
0x0000ffff89057aa8: str q17, [x13,openjdk#16]
0x0000ffff89057aac: ...
0x0000ffff89057b1c: add w10, w10, #0x20
0x0000ffff89057b20: cmp w10, w14
0x0000ffff89057b24: b.lt 0x0000ffff89057a90 <-- loop end
```
Here is the disassembly code after we apply current patch. We can see
that the negate is no longer conducted inside the loop, and it is
hoisted to the outside.
```
0x0000ffff8d053a68: neg w14, w13 <---- neg
0x0000ffff8d053a6c: dup v16.16b, w14 <---- dup
0x0000ffff8d053a70: sub w14, w2, w10
0x0000ffff8d053a74: cmp w2, w10
0x0000ffff8d053a78: csel w14, wzr, w14, lt
0x0000ffff8d053a7c: mov w8, #0x7d00 // #32000
0x0000ffff8d053a80: cmp w14, w8
0x0000ffff8d053a84: csel w14, w12, w14, hi
0x0000ffff8d053a88: add w13, w14, w10
0x0000ffff8d053a8c: nop
0x0000ffff8d053a90: sbfiz x14, x10, openjdk#2, openjdk#32 <-- loop entry
0x0000ffff8d053a94: add x15, x17, x14
0x0000ffff8d053a98: ldr q17, [x15,openjdk#16]
0x0000ffff8d053a9c: sshl v17.4s, v17.4s, v16.4s <-- shift right
0x0000ffff8d053aa0: add x14, x0, x14
0x0000ffff8d053aa4: str q17, [x14,openjdk#16]
0x0000ffff8d053aa8: ...
0x0000ffff8d053afc: add w10, w10, #0x20
0x0000ffff8d053b00: cmp w10, w13
0x0000ffff8d053b04: b.lt 0x0000ffff8d053a90 <-- loop end
```
*** Testing
Tier1~3 tests passed on Linux/AArch64 platform.
*** Performance Evaluation
- Auto-vectorization
One micro benchmark, i.e. VectorShiftRight.java, is added by this patch
in order to evaluate the optimization on vector shift right.
The following table shows the result. Column `Score-1` shows the score
before we apply current patch, and column `Score-2` shows the score when
we apply current patch.
We witness about 30% ~ 53% improvement on microbenchmarks.
```
Benchmark Units Score-1 Score-2
VectorShiftRight.rShiftByte ops/ms 10601.980 13816.353
VectorShiftRight.rShiftInt ops/ms 3592.831 5502.941
VectorShiftRight.rShiftLong ops/ms 1584.012 2425.247
VectorShiftRight.rShiftShort ops/ms 6643.414 9728.762
VectorShiftRight.urShiftByte ops/ms 2066.965 2048.336 (*)
VectorShiftRight.urShiftChar ops/ms 6660.805 9728.478
VectorShiftRight.urShiftInt ops/ms 3592.909 5514.928
VectorShiftRight.urShiftLong ops/ms 1583.995 2422.991
*: Logical shift right for Byte type(urShiftByte) is not vectorized, as
disscussed in [4].
```
- VectorAPI
Furthermore, we also evaluate the impact of this patch on VectorAPI
benchmarks, e.g., [5]. Details can be found in the table below. Columns
`Score-1` and `Score-2` show the scores before and after applying
current patch.
```
Benchmark Units Score-1 Score-2
Byte128Vector.LSHL ops/ms 10867.666 10873.993
Byte128Vector.LSHLShift ops/ms 10945.729 10945.741
Byte128Vector.LSHR ops/ms 8629.305 8629.343
Byte128Vector.LSHRShift ops/ms 8245.864 10303.521 <--
Byte128Vector.ASHR ops/ms 8619.691 8629.438
Byte128Vector.ASHRShift ops/ms 8245.860 10305.027 <--
Int128Vector.LSHL ops/ms 3104.213 3103.702
Int128Vector.LSHLShift ops/ms 3114.354 3114.371
Int128Vector.LSHR ops/ms 2380.717 2380.693
Int128Vector.LSHRShift ops/ms 2312.871 2992.377 <--
Int128Vector.ASHR ops/ms 2380.668 2380.647
Int128Vector.ASHRShift ops/ms 2312.894 2992.332 <--
Long128Vector.LSHL ops/ms 1586.907 1587.591
Long128Vector.LSHLShift ops/ms 1589.469 1589.540
Long128Vector.LSHR ops/ms 1209.754 1209.687
Long128Vector.LSHRShift ops/ms 1174.718 1527.502 <--
Long128Vector.ASHR ops/ms 1209.713 1209.669
Long128Vector.ASHRShift ops/ms 1174.712 1527.174 <--
Short128Vector.LSHL ops/ms 5945.542 5943.770
Short128Vector.LSHLShift ops/ms 5984.743 5984.640
Short128Vector.LSHR ops/ms 4613.378 4613.577
Short128Vector.LSHRShift ops/ms 4486.023 5746.466 <--
Short128Vector.ASHR ops/ms 4613.389 4613.478
Short128Vector.ASHRShift ops/ms 4486.019 5746.368 <--
```
1) For logical shift left(LSHL and LSHLShift), and shift right with
variable vector shift count(LSHR and ASHR) cases, we didn't find much
changes, which is expected.
2) For shift right with scalar shift count(LSHRShift and ASHRShift)
case, about 25% ~ 30% improvement can be observed, and this benefit is
introduced by current patch.
[1] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/SSHL--Signed-Shift-Left--register--
[2] https://developer.arm.com/documentation/ddi0596/2020-12/SIMD-FP-Instructions/USHL--Unsigned-Shift-Left--register--
[3] openjdk/jdk18#41
[4] openjdk#1087
[5] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/Byte128Vector.java#L509
After JDK-8275317, C2's SLP vectorizer has supported type conversion
between the same data size. We can also support conversions between
different data sizes like:
int <-> double
float <-> long
int <-> long
float <-> double
A typical test case:
int[] a;
double[] b;
for (int i = start; i < limit; i++) {
b[i] = (double) a[i];
}
Our expected OptoAssembly code for one iteration is like below:
add R12, R2, R11, LShiftL openjdk#2
vector_load V16,[R12, openjdk#16]
vectorcast_i2d V16, V16 # convert I to D vector
add R11, R1, R11, LShiftL openjdk#3 # ptr
add R13, R11, openjdk#16 # ptr
vector_store [R13], V16
To enable the vectorization, the patch solves the following problems
in the SLP.
There are three main operations in the case above, LoadI, ConvI2D and
StoreD. Assuming that the vector length is 128 bits, how many scalar
nodes should be packed together to a vector? If we decide it
separately for each operation node, like what we did before the patch
in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI
or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes
in a vector node sequence, like loading 4 elements to a vector, then
typecasting 2 elements and lastly storing these 2 elements, they become
invalid. As a result, we should look through the whole def-use chain
and then pick up the minimum of these element sizes, like function
SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp.
In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then
generate valid vector node sequence, like loading 2 elements,
converting the 2 elements to another type and storing the 2 elements
with new type.
After this, LoadI nodes don't make full use of the whole vector and
only occupy part of it. So we adapt the code in
SuperWord::get_vw_bytes_special() to the situation.
In SLP, we calculate a kind of alignment as position trace for each
scalar node in the whole vector. In this case, the alignments for 2
LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8.
Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which
mark that this node is the second node in the whole vector, while the
difference between 4 and 8 are just because of their own data sizes. In
this situation, we should try to remove the impact caused by different
data size in SLP. For example, in the stage of
SuperWord::extend_packlist(), while determining if it's potential to
pack a pair of def nodes in the function SuperWord::follow_use_defs(),
we remove the side effect of different data size by transforming the
target alignment from the use node. Because we believe that, assuming
that the vector length is 512 bits, if the ConvI2D use nodes have
alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12,
these two LoadI nodes should be packed as a pair as well.
Similarly, when determining if the vectorization is profitable, type
conversion between different data size takes a type of one size and
produces a type of another size, hence the special checks on alignment
and size should be applied, like what we do in SuperWord::is_vector_use.
After solving these problems, we successfully implemented the
vectorization of type conversion between different data sizes.
Here is the test data on NEON:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 216.431 ± 0.131 ns/op
VectorLoop.convertD2I 523 avgt 15 220.522 ± 0.311 ns/op
VectorLoop.convertF2D 523 avgt 15 217.034 ± 0.292 ns/op
VectorLoop.convertF2L 523 avgt 15 231.634 ± 1.881 ns/op
VectorLoop.convertI2D 523 avgt 15 229.538 ± 0.095 ns/op
VectorLoop.convertI2L 523 avgt 15 214.822 ± 0.131 ns/op
VectorLoop.convertL2F 523 avgt 15 230.188 ± 0.217 ns/op
VectorLoop.convertL2I 523 avgt 15 162.234 ± 0.235 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 124.352 ± 1.079 ns/op
VectorLoop.convertD2I 523 avgt 15 557.388 ± 8.166 ns/op
VectorLoop.convertF2D 523 avgt 15 118.082 ± 4.026 ns/op
VectorLoop.convertF2L 523 avgt 15 225.810 ± 11.180 ns/op
VectorLoop.convertI2D 523 avgt 15 166.247 ± 0.120 ns/op
VectorLoop.convertI2L 523 avgt 15 119.699 ± 2.925 ns/op
VectorLoop.convertL2F 523 avgt 15 220.847 ± 0.053 ns/op
VectorLoop.convertL2I 523 avgt 15 122.339 ± 2.738 ns/op
perf data on X86:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 279.466 ± 0.069 ns/op
VectorLoop.convertD2I 523 avgt 15 551.009 ± 7.459 ns/op
VectorLoop.convertF2D 523 avgt 15 276.066 ± 0.117 ns/op
VectorLoop.convertF2L 523 avgt 15 545.108 ± 5.697 ns/op
VectorLoop.convertI2D 523 avgt 15 745.303 ± 0.185 ns/op
VectorLoop.convertI2L 523 avgt 15 260.878 ± 0.044 ns/op
VectorLoop.convertL2F 523 avgt 15 502.016 ± 0.172 ns/op
VectorLoop.convertL2I 523 avgt 15 261.654 ± 3.326 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 106.975 ± 0.045 ns/op
VectorLoop.convertD2I 523 avgt 15 546.866 ± 9.287 ns/op
VectorLoop.convertF2D 523 avgt 15 82.414 ± 0.340 ns/op
VectorLoop.convertF2L 523 avgt 15 542.235 ± 2.785 ns/op
VectorLoop.convertI2D 523 avgt 15 92.966 ± 1.400 ns/op
VectorLoop.convertI2L 523 avgt 15 79.960 ± 0.528 ns/op
VectorLoop.convertL2F 523 avgt 15 504.712 ± 4.794 ns/op
VectorLoop.convertL2I 523 avgt 15 129.753 ± 0.094 ns/op
perf data on AVX512:
Before the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 282.984 ± 4.022 ns/op
VectorLoop.convertD2I 523 avgt 15 543.080 ± 3.873 ns/op
VectorLoop.convertF2D 523 avgt 15 273.950 ± 0.131 ns/op
VectorLoop.convertF2L 523 avgt 15 539.568 ± 2.747 ns/op
VectorLoop.convertI2D 523 avgt 15 745.238 ± 0.069 ns/op
VectorLoop.convertI2L 523 avgt 15 260.935 ± 0.169 ns/op
VectorLoop.convertL2F 523 avgt 15 501.870 ± 0.359 ns/op
VectorLoop.convertL2I 523 avgt 15 257.508 ± 0.174 ns/op
After the patch:
Benchmark (length) Mode Cnt Score Error Units
VectorLoop.convertD2F 523 avgt 15 76.687 ± 0.530 ns/op
VectorLoop.convertD2I 523 avgt 15 545.408 ± 4.657 ns/op
VectorLoop.convertF2D 523 avgt 15 273.935 ± 0.099 ns/op
VectorLoop.convertF2L 523 avgt 15 540.534 ± 3.032 ns/op
VectorLoop.convertI2D 523 avgt 15 745.234 ± 0.053 ns/op
VectorLoop.convertI2L 523 avgt 15 260.865 ± 0.104 ns/op
VectorLoop.convertL2F 523 avgt 15 63.834 ± 4.777 ns/op
VectorLoop.convertL2I 523 avgt 15 48.183 ± 0.990 ns/op
Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef
This patch fixes the wrong matching rule of replicate2L_zero. It was
matched "ReplicateI" by mistake so that long immediates(not only zero)
had to be moved to register first and matched to replicate2L finally. To
fix this trivial bug, this patch fixes the typo and extends the rule of
replicate2L_zero to replicate2L_imm, which now supports all possible
long immediate values.
The final code changes are shown as below:
replicate2L_imm:
mov x13, #0xff
movk x13, #0xff, lsl openjdk#16
movk x13, #0xff, lsl openjdk#32
dup v16.2d, x13
=>
movi v16.2d, #0xff00ff00ff
[Test]
test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
passed without failure.
Change-Id: Ieac92820dea560239a968de3d7430003f01726bd
```
public short[] vectorUnsignedShiftRight(short[] shorts) {
short[] res = new short[SIZE];
for (int i = 0; i < SIZE; i++) {
res[i] = (short) (shorts[i] >>> 3);
}
return res;
}
```
In C2's SLP, vectorization of unsigned shift right on signed
subword types (byte/short) like the case above is intentionally
disabled[1]. Because the vector unsigned shift on signed
subword types behaves differently from the Java spec. It's
worthy to vectorize more cases in quite low cost. Also,
unsigned shift right on signed subword is not uncommon and we
may find similar cases in Lucene benchmark[2].
Taking unsigned right shift on short type as an example,
Short:
| <- 16 bits -> | <- 16 bits -> |
| 1 1 1 ... 1 1 | data |
when the shift amount is a constant not greater than the number
of sign extended bits, 16 higher bits for short type shown like
above, the unsigned shift on signed subword types can be
transformed into a signed shift and hence becomes vectorizable.
Here is the transformation:
For T_SHORT (shift <= 16):
src RShiftCntV shift src RShiftCntV shift
\ / ==> \ /
URShiftVS RShiftVS
This patch does the transformation in SuperWord::implemented() and
SuperWord::output(). It helps vectorize the short cases above. We
can handle unsigned right shift on byte type in a similar way. The
generated assembly code for one iteration on aarch64 is like:
```
...
sbfiz x13, x10, openjdk#1, openjdk#32
add x15, x11, x13
ldr q16, [x15, openjdk#16]
sshr v16.8h, v16.8h, openjdk#3
add x13, x17, x13
str q16, [x13, openjdk#16]
...
```
Here is the performance data for micro-benchmark before and after
this patch on both AArch64 and x64 machines. We can observe about
~80% improvement with this patch.
The perf data on AArch64:
Before the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 295.711 ± 0.117 ns/op
urShiftImmShort 1024 3 avgt 5 284.559 ± 0.148 ns/op
after the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 45.111 ± 0.047 ns/op
urShiftImmShort 1024 3 avgt 5 55.294 ± 0.072 ns/op
The perf data on X86:
Before the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 361.374 ± 4.621 ns/op
urShiftImmShort 1024 3 avgt 5 365.390 ± 3.595 ns/op
After the patch:
Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units
urShiftImmByte 1024 3 avgt 5 105.489 ± 0.488 ns/op
urShiftImmShort 1024 3 avgt 5 43.400 ± 0.394 ns/op
[1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190
[2] https://github.com/jpountz/decode-128-ints-benchmark/
Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.
With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:
Before:
mov z16.b, p0/z, #1
fmov x0, d16
orr x0, x0, x0, lsr openjdk#7
orr x0, x0, x0, lsr openjdk#14
orr x0, x0, x0, lsr openjdk#28
and x0, x0, #0xff
fmov x8, v16.d[1]
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#8
orr x8, xzr, #0x2
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#16
orr x8, xzr, #0x3
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#24
orr x8, xzr, #0x4
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#32
mov x8, #0x5
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#40
orr x8, xzr, #0x6
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#48
orr x8, xzr, #0x7
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#56
After:
mov z16.b, p0/z, #1
mov z17.b, #1
bext z16.d, z16.d, z17.d
mov z17.d, #0
uzp1 z16.s, z16.s, z17.s
uzp1 z16.h, z16.h, z17.h
uzp1 z16.b, z16.b, z17.b
mov x0, v16.d[0]
[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-
Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.
With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:
Before:
mov z16.b, p0/z, #1
fmov x0, d16
orr x0, x0, x0, lsr openjdk#7
orr x0, x0, x0, lsr openjdk#14
orr x0, x0, x0, lsr openjdk#28
and x0, x0, #0xff
fmov x8, v16.d[1]
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#8
orr x8, xzr, #0x2
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#16
orr x8, xzr, #0x3
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#24
orr x8, xzr, #0x4
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#32
mov x8, #0x5
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#40
orr x8, xzr, #0x6
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#48
orr x8, xzr, #0x7
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#56
After:
mov z16.b, p0/z, #1
mov z17.b, #1
bext z16.d, z16.d, z17.d
mov z17.d, #0
uzp1 z16.s, z16.s, z17.s
uzp1 z16.h, z16.h, z17.h
uzp1 z16.b, z16.b, z17.b
mov x0, v16.d[0]
[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-
Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
After JDK-8283091, the loop below can be vectorized partially.
Statement 1 can be vectorized but statement 2 can't.
```
// int[] iArr; long[] lArrFld; int i1,i2;
for (i1 = 6; i1 < 227; i1++) {
iArr[i1] += lArrFld[i1]++; // statement 1
iArr[i1 + 1] -= (i2++); // statement 2
}
```
But we got incorrect results because the vector packs of iArr are
scheduled incorrectly like:
```
...
load_vector XMM1,[R8 + openjdk#16 + R11 << openjdk#2]
movl RDI, [R8 + openjdk#20 + R11 << openjdk#2] # int
load_vector XMM2,[R9 + openjdk#8 + R11 << openjdk#3]
subl RDI, R11 # int
vpaddq XMM3,XMM2,XMM0 ! add packedL
store_vector [R9 + openjdk#8 + R11 << openjdk#3],XMM3
vector_cast_l2x XMM2,XMM2 !
vpaddd XMM1,XMM2,XMM1 ! add packedI
addl RDI, openjdk#228 # int
movl [R8 + openjdk#20 + R11 << openjdk#2], RDI # int
movl RBX, [R8 + openjdk#24 + R11 << openjdk#2] # int
subl RBX, R11 # int
addl RBX, openjdk#227 # int
movl [R8 + openjdk#24 + R11 << openjdk#2], RBX # int
...
movl RBX, [R8 + openjdk#40 + R11 << openjdk#2] # int
subl RBX, R11 # int
addl RBX, openjdk#223 # int
movl [R8 + openjdk#40 + R11 << openjdk#2], RBX # int
movl RDI, [R8 + openjdk#44 + R11 << openjdk#2] # int
subl RDI, R11 # int
addl RDI, openjdk#222 # int
movl [R8 + openjdk#44 + R11 << openjdk#2], RDI # int
store_vector [R8 + openjdk#16 + R11 << openjdk#2],XMM1
...
```
simplified as:
```
load_vector iArr in statement 1
unvectorized loads/stores in statement 2
store_vector iArr in statement 1
```
We cannot pick the memory state from the first load for LoadI pack
here, as the LoadI vector operation must load the new values in memory
after iArr writes 'iArr[i1 + 1] - (i2++)' to 'iArr[i1 + 1]'(statement 2).
We must take the memory state of the last load where we have assigned
new values ('iArr[i1 + 1] - (i2++)') to the iArr array.
In JDK-8240281, we picked the memory state of the first load. Different
from the scenario in JDK-8240281, the store, which is dependent on an
earlier load here, is in a pack to be scheduled and the LoadI pack
depends on the last_mem. As designed[2], to schedule the StoreI pack,
all memory operations in another single pack should be moved in the same
direction. We know that the store in the pack depends on one of loads in
the LoadI pack, so the LoadI pack should be scheduled before the StoreI
pack. And the LoadI pack depends on the last_mem, so the last_mem must
be scheduled before the LoadI pack and also before the store pack.
Therefore, we need to take the memory state of the last load for the
LoadI pack here.
To fix it, the pack adds additional checks while picking the memory state
of the first load. When the store locates in a pack and the load pack
relies on the last_mem, we shouldn't choose the memory state of the
first load but choose the memory state of the last load.
[1]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2380
[2]https://github.com/openjdk/jdk/blob/0ae834105740f7cf73fe96be22e0f564ad29b18d/src/hotspot/share/opto/superword.cpp#L2232
Jira: ENTLLT-5482
Change-Id: I341d10b91957b60a1b4aff8116723e54083a5fb8
CustomizedGitHooks: yes
…nodes Recently we found that the rotate left/right benchmarks with vectorapi emit a redundant "and" instruction on both aarch64 and x86_64 machines which can be done away with. For example - and(and(a, b), b) generates two "and" instructions which can be reduced to a single "and" operation- and(a, b) since "and" (and "or") operations are commutative and idempotent in nature. This can help improve performance for all those workloads which have multiple "and"/"or" operations with the same value by reducing them to fewer "and"/"or" operations accordingly. This patch adds the following transformations for vector logical operations - AndV and OrV : (OpV (OpV a b) b) => (OpV a b) (OpV (OpV a b) a) => (OpV a b) (OpV (OpV a b m1) b m1) => (OpV a b m1) (OpV (OpV a b m1) a m1) => (OpV a b m1) (OpV a (OpV a b)) => (OpV a b) (OpV b (OpV a b)) => (OpV a b) (OpV a (OpV a b m) m) => (OpV a b m) where Op = "And", "Or" Links for benchmarks tested are given below :- https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L728 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L764 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L728 https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L764 Before this patch, the disassembly for one these testcases (IntMaxVector.ROR) for Neon is shown below : ldr q16, [x12, openjdk#16] and v16.16b, v16.16b, v20.16b and v16.16b, v16.16b, v20.16b add x12, x16, x11 sub v17.4s, v21.4s, v16.4s ldr q18, [x12, openjdk#16] sshl v17.4s, v18.4s, v17.4s add x11, x18, x11 neg v19.16b, v16.16b ushl v19.4s, v18.4s, v19.4s orr v16.16b, v17.16b, v19.16b str q16, [x11, openjdk#16] After this patch, the disassembly for the same testcase above is shown below : ldr q16, [x12, openjdk#16] and v16.16b, v16.16b, v20.16b add x12, x16, x11 sub v17.4s, v21.4s, v16.4s ldr q18, [x12, openjdk#16] sshl v17.4s, v18.4s, v17.4s add x11, x18, x11 neg v19.16b, v16.16b ushl v19.4s, v18.4s, v19.4s orr v16.16b, v17.16b, v19.16b str q16, [x11, openjdk#16] The other tests also emit an extra "and" instruction as shown above for the vector ROR/ROL operations. Below are the performance results for the vectorapi rotate tests (tests given in the links above) with this patch on aarch64 and x86_64 machines (for int and long types) - Benchmark aarch64 x86_64 IntMaxVector.ROL 25.57% 26.09% IntMaxVector.ROR 23.75% 24.15% LongMaxVector.ROL 28.91% 28.51% LongMaxVector.ROR 16.51% 29.11% The percentage indicates the percent gain/improvement in performance (ops/ms) with this patch over the master build without this patch. The machine descriptions are given below - aarch64 - 128-bit aarch64 machine x86_64 - 256-bit x86 machine
Fix failing tests
…erOfTrailingZeros/numberOfLeadingZeros()` Background: Java API[1] for `Long.bitCount/numberOfTrailingZeros/ numberOfLeadingZeros()` returns int type, while Vector API[2] for them returns long type. Currently, to support auto-vectorization of Java API and Vector API at the same time, some vector platforms, namely aarch64 and x86, provides two types of vector nodes taking long type: One produces long vector type for vector API, and the other one produces int vector type by casting long-type result from the first one. We can move the casting work for auto-vectorization of Java API to the mid-end so that we can unify the vector implementation in the backend, reducing extra code. The patch does the refactoring and also fixes several issues below. 1. Refine the auto-vectorization of `Long.bitCount/numberOfTrailingZeros/numberOfLeadingZeros()` In the patch, during the stage of generating vector node for the candidate pack, to implement the complete behavior of these Java APIs, superword will make two consecutive vector nodes: the first one, the same as Vector API, does the real execution to produce long-type result, and the second one casts the result to int vector type. For those platforms, which have supported correctly vectorizing these java APIs before, the patch has no real impact on final generated assembly code and, consequently, has no performance regression. 2. Fix the IR check failure of `compiler/vectorization/TestPopCountVectorLong.java` on 128-bit sve platform These Java APIs take a long type and produce an int type, like conversion nodes between different data sizes do. In superword, the alignment of their input nodes is different from their own. It results in that these APIs can't be vectorized when `-XX:MaxVectorSize=16`. So, the IR check for vector nodes in `compiler/vectorization/TestPopCountVectorLong.java` would fail. To fix the issue of alignment, the patch corrects their related alignment, just like it did for conversion nodes between different data sizes. After the patch, these Java APIs can be vectorized on 128-bit platforms, as long as the auto- vectorization is profitable. 3. Fix the incorrect vectorization of `numberOfTrailingZeros/numberOfLeadingZeros()` in aarch64 platforms with more than 128 bits Although `Long.NumberOfLeadingZeros/NumberOfTrailingZeros()` can be vectorized on sve platforms when `-XX:MaxVectorSize=32` or `-XX:MaxVectorSize=64` even before the patch, aarch64 backend didn't provide special vector implementation for Java API and thus the generated code is not correct, like: ``` LOOP: sxtw x13, w12 add x14, x15, x13, uxtx openjdk#3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Incorrectly use integer rbit/clz insn for long type vector *rbit z16.s, p7/m, z16.s *clz z16.s, p7/m, z16.s add x13, x16, x13, uxtx openjdk#2 str q16, [x13, openjdk#16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP ``` It causes a runtime failure of the testcase `compiler/vectorization/TestNumberOfContinuousZeros.java` added in the patch. After the refactoring, the testcase can pass and the code is corrected: ``` LOOP: sxtw x13, w12 add x14, x15, x13, uxtx openjdk#3 add x17, x14, #0x10 ld1d {z16.d}, p7/z, [x17] // Compute with long vector type and convert to int vector type *rbit z16.d, p7/m, z16.d *clz z16.d, p7/m, z16.d *mov z24.d, #0 *uzp1 z25.s, z16.s, z24.s add x13, x16, x13, uxtx openjdk#2 str q25, [x13, openjdk#16] ... add w12, w12, #0x20 cmp w12, w3 b.lt LOOP ``` 4. Fix an assertion failure on x86 avx2 platform Before, on x86 avx2 platform, there is an assertion failure when C2 tries to vectorize the loops like: ``` // long[] ia; // int[] ic; for (int i = 0; i < LENGTH; ++i) { ic[i] = Long.numberOfLeadingZeros(ia[i]); } ``` X86 backend supports vectorizing `numberOfLeadingZeros()` on avx2 platform, but it uses `evpmovqd()` to do casting for `CountLeadingZerosV`[3], which can only be used when `UseAVX > 2`[4]. After the refactoring, the failure can be fixed naturally. Tier 1~3 passed with no new failures on Linux AArch64/X86 platform. [1] https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#bitCount(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfTrailingZeros(long) https://docs.oracle.com/en/java/javase/19/docs/api/java.base/java/lang/Long.html#numberOfLeadingZeros(long) [2] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/LongVector.java#L687 [3] https://github.com/openjdk/jdk/blob/544e31722528d12fae0eb19271f85886680801a6/src/hotspot/cpu/x86/x86.ad#L9418 [4] https://github.com/openjdk/jdk/blob/fc616588c1bf731150a9d9b80033bb589bcb231f/src/hotspot/cpu/x86/assembler_x86.cpp#L2239
…dk#16) * Only use conditional far branch in copy_memory for zgc * Remove unused code
Co-authored-by: Xin Liu <xxinliu@amazon.com>
…ng into ldp/stp on AArch64 Macro-assembler on aarch64 can merge adjacent loads or stores into ldp/stp[1]. For example, it can merge: ``` str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] ``` into ``` stp w20, w10, [sp, openjdk#16] ``` But C2 may generate a sequence like: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str x19, [sp, openjdk#24] <--- str w10, [sp, openjdk#20] <--- Before sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` We can't do any merging for non-adjacent loads or stores. The patch is to sort the spilling or unspilling sequence in the order of offset during instruction scheduling and bundling phase. After that, we can get a new sequence: ``` str x21, [sp, openjdk#8] str w20, [sp, openjdk#16] str w10, [sp, openjdk#20] <--- str x19, [sp, openjdk#24] <--- After sorting str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` Then macro-assembler can do ld/st merging: ``` str x21, [sp, openjdk#8] stp w20, w10, [sp, openjdk#16] <--- Merged str x19, [sp, openjdk#24] str x11, [sp, openjdk#40] str w13, [sp, openjdk#48] str x16, [sp, openjdk#56] ``` To justify the patch, we run `HelloWorld.java` ``` public class HelloWorld { public static void main(String [] args) { System.out.println("Hello World!"); } } ``` with `java -Xcomp -XX:-TieredCompilation HelloWorld`. Before the patch, macro-assembler can do ld/st merging for 3688 times. After the patch, the number of ld/st merging increases to 3871 times, by ~5 %. Tested tier1~3 on x86 and AArch64. [1] https://github.com/openjdk/jdk/blob/a95062b39a431b4937ab6e9e73de4d2b8ea1ac49/src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp#L2079
Add framework for other platforms. Moved fill_to_memory_atomic back to the .cpp from the .hpp in order to get 32-bit fixed.
* Intial cut for repeatable builds * Fix line wrapping * Fix line wrapping * Fix line wrapping * Fix line wrapping
https://bugs.openjdk.java.net/browse/JDK-8252543