Skip to content

Conversation

@e1iu
Copy link
Owner

@e1iu e1iu commented Jan 25, 2022

No description provided.

Rob McKenna and others added 30 commits February 4, 2022 13:07
…_block

Reviewed-by: tschatzl, iwalulya, sjohanss
Reviewed-by: jnimeh, hchao
…NPrintWriter.println(...) method

Reviewed-by: prr
Reviewed-by: chagedorn, thartmann, xliu
…t.java is failing

Reviewed-by: sspitsyn, dcubed, lmesnik
…emException on Windows 11

Reviewed-by: dfuchs
…dlock/JavaDeadlock001/TestDescription.java from problemlist.

Reviewed-by: sspitsyn
Reviewed-by: mgronlun
Reviewed-by: djelinski, alanb, dfuchs, aefimov
Reviewed-by: dcubed, coleenp, lfoltan
Alexander Matveev and others added 22 commits February 25, 2022 20:49
Reviewed-by: stuefe, coleenp, dholmes
Reviewed-by: iris, rriggs, bpb, lancea, mchung, scolebourne
…/compiler threads

Reviewed-by: kvn, thartmann
…ames and X509Certificate::getIssuerAlternativeNames in otherName

6776681: Invalid encoding of an OtherName in X509Certificate.getAlternativeNames()

Reviewed-by: mullan
This patch aims to optimize extract operation on vector for AArch64
according to Neoverse N2 and V1 software optimization guide[1][2].

Currently, extract operation is used by "Vector.lane"[3]. As SVE
doesn’t have direct instruction support for such operation like
"pextr"[4] in x86, the final code is as below:

```
	Byte512Vector.lane(7)

        orr     x8, xzr, #0x7
        whilele p0.b, xzr, x8
        lastb   w10, p0, z16.b
        sxtb    w10, w10
```

This patch uses NEON instruction instead if the target lane is located
in the NEON 128b range. For the same example above, the generated code
is much simpler:

```
        smov    x11, v16.b[7]
```

For those cases that target lane is located out of the NEON 128b range,
this patch uses EXT to shift the target to the lowest. The generated
code is as below:

```
        Byte512Vector.lane(63)

        mov     z17.d, z16.d
        ext     z17.b, z17.b, z17.b, openjdk#63
        smov    x10, v17.b[0]
```

hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE machine.

Refs:

 // From Arm Neoverse N2 and V1 Software Optimization Guide
 // NOTE: The data inside "()" belongs to V1
 +--------------+---------+------------+--------------------+
 |  Instruction | Latency | Throughput | Utilized Pipelines |
 +--------------+---------+------------+--------------------+
 |WHILELE       |        3|      1(1/2)|               M(M0)|
 +--------------+---------+------------+--------------------+
 |LASTB(scalar) |     5(6)|           1|               V1,M0|
 +--------------+---------+------------+--------------------+
 |EXT           |        2|           2|              V(V01)|
 +--------------+---------+------------+--------------------+
 |UMOV,SMOV     |        2|           1|                   V|
 +--------------+---------+------------+--------------------+
 |ORR           |        2|           2|              V(V01)|
 +--------------+---------+------------+--------------------+
 |INS           |        2|        2(4)|                   V|
 +--------------+---------+------------+--------------------+

[1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
[2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
[3] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java#L2693
[4] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq

Change-Id: I90cfc1f8deb84145f42132d58d3b211c4a8933ad
This patch optimizes Vector.withLane for 64 and 128 SPECIES. For 64-
and 128-bit vector, insert operation could be implemented by ASIMD
instructions, for better performance. E.g., for IntVector.SPECIES_128,
"IntVector.withLane(0, (int)4)" generates code as below:

```
        Before:
        orr     w10, wzr, #0x4
        index   z17.s, #-16, #1
        cmpeq   p0.s, p7/z, z17.s, #-16
        mov     z17.d, z16.d
        mov     z17.s, p0/m, w10

        After
        orr     w10, wzr, #0x4
        mov     v16.s[0], w10
```

This patch also does a small enhancement for vector whose size is
greater than 128 bits, that it may save 1 "DUP" just if the target index
is smaller than 32. E.g., For ByteVector.SPECIES_512,
"ByteVector.withLane(0, (byte)4)" generates code as below:

```
        Before:
        index   z18.b, #0, #1
        mov     z17.b, #0
        cmpeq   p0.b, p7/z, z18.b, z17.b
        mov     z17.d, z16.d
        mov     z17.b, p0/m, w16

        After:
        index   z17.b, #-16, #1
        cmpeq   p0.b, p7/z, z17.b, #-16
        mov     z17.d, z16.d
        mov     z17.b, p0/m, w16
```

Change-Id: I700a28bc2fc15b6baca03b8d8574bb17992bf4a7
This patch speeds up add/mul/min/max reduction for 64 and 128 SPECIES
on N2 machine.

According to Neoverse N2 software optimization guide[1], ASIMD
reduction instructions are faster than SVE's for 128-bit vector size.
This patch adds some rules to distinguish 64 bits and 128 bits vector
size, so that for these two special cases, they can generate code the
same as NEON. E.g., For ByteVector.SPECIES_128,
"ByteVector.reduceLanes(VectorOperators.ADD)" generates code as below:

```
        Before:
        orr     x8, xzr, #0x10
        whilelo p0.b, xzr, x8
        uaddv   d17, p0, z16.b
        smov    x15, v17.b[0]
        add     w15, w14, w15, sxtb

        After:
        addv    b17, v16.16b
        smov    x12, v17.b[0]
        add     w12, w12, w16, sxtb
```

The performance improvements 60% ~ 100% on my test machine.

[1] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001

Change-Id: Id4637cd4b0b7948864780eafa84150787697e4df
This patch uses PTRUE instruction to create predicate registers for
partial vector operations which would be efficient than WHILELO
instruction according to the software optimization guide of N2[1] and
V1[2].

[1] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
[2] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/

Change-Id: I9be5aa82ab567e19c62c698dc9c4d852efd2f607
@e1iu e1iu closed this Mar 2, 2022
e1iu pushed a commit that referenced this pull request Mar 24, 2022
…or 64/128-bit vector sizes

This patch optimizes the SVE backend implementations of Vector.lane and
Vector.withLane for 64/128-bit vector size. The basic idea is to use
lower costs NEON instructions when the vector size is 64/128 bits.

1. Vector.lane(int i) (Gets the lane element at lane index i)

As SVE doesn’t have direct instruction support for extraction like
"pextr"[1] in x86, the final code was shown as below:

```
        Byte512Vector.lane(7)

        orr     x8, xzr, #0x7
        whilele p0.b, xzr, x8
        lastb   w10, p0, z16.b
        sxtb    w10, w10
```

This patch uses NEON instruction instead if the target lane is located
in the NEON 128b range. For the same example above, the generated code
now is much simpler:

```
        smov    x11, v16.b[7]
```

For those cases that target lane is located out of the NEON 128b range,
this patch uses EXT to shift the target to the lowest. The generated
code is as below:

```
        Byte512Vector.lane(63)

        mov     z17.d, z16.d
        ext     z17.b, z17.b, z17.b, openjdk#63
        smov    x10, v17.b[0]
```

2. Vector.withLane(int i, E e) (Replaces the lane element of this vector
                                at lane index i with value e)

For 64/128-bit vector, insert operation could be implemented by NEON
instructions to get better performance. E.g., for IntVector.SPECIES_128,
"IntVector.withLane(0, (int)4)" generates code as below:

```
        Before:
        orr     w10, wzr, #0x4
        index   z17.s, #-16, #1
        cmpeq   p0.s, p7/z, z17.s, #-16
        mov     z17.d, z16.d
        mov     z17.s, p0/m, w10

        After
        orr     w10, wzr, #0x4
        mov     v16.s[0], w10
```

This patch also does a small enhancement for vectors whose sizes are
greater than 128 bits. It can save 1 "DUP" if the target index is
smaller than 32. E.g., For ByteVector.SPECIES_512,
"ByteVector.withLane(0, (byte)4)" generates code as below:

```
        Before:
        index   z18.b, #0, #1
        mov     z17.b, #0
        cmpeq   p0.b, p7/z, z18.b, z17.b
        mov     z17.d, z16.d
        mov     z17.b, p0/m, w16

        After:
        index   z17.b, #-16, #1
        cmpeq   p0.b, p7/z, z17.b, #-16
        mov     z17.d, z16.d
        mov     z17.b, p0/m, w16
```

With this patch, we can see up to 200% performance gain for specific
vector micro benchmarks in my SVE testing system.

[TEST]
test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
passed without failure.

[1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq

Change-Id: Ic2a48f852011978d0f252db040371431a339d73c
e1iu pushed a commit that referenced this pull request Mar 29, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu pushed a commit that referenced this pull request Mar 29, 2022
…th SVE2

This patch implements AArch64 codegen for VectorLongToMask using the
SVE2 BitPerm feature. With this patch, the final code (generated on an
SVE vector reg size of 512-bit QEMU emulator) is shown as below:

        mov     z17.b, #0
        mov     v17.d[0], x13
        sunpklo z17.h, z17.b
        sunpklo z17.s, z17.h
        sunpklo z17.d, z17.s
        mov     z16.b, #1
        bdep    z17.d, z17.d, z16.d
        cmpne   p0.b, p7/z, z17.b, #0

Change-Id: Ia83e80bbd879f86fef5dd607e44c530f2ce143d0
e1iu pushed a commit that referenced this pull request Apr 21, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.

With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:

Before:

        mov     z16.b, p0/z, #1
        fmov    x0, d16
        orr     x0, x0, x0, lsr openjdk#7
        orr     x0, x0, x0, lsr openjdk#14
        orr     x0, x0, x0, lsr openjdk#28
        and     x0, x0, #0xff
        fmov    x8, v16.d[1]
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#8

        orr     x8, xzr, #0x2
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#16

        orr     x8, xzr, #0x3
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#24

        orr     x8, xzr, #0x4
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#32

        mov     x8, #0x5
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#40

        orr     x8, xzr, #0x6
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#48

        orr     x8, xzr, #0x7
        whilele p1.d, xzr, x8
        lastb   x8, p1, z16.d
        orr     x8, x8, x8, lsr openjdk#7
        orr     x8, x8, x8, lsr openjdk#14
        orr     x8, x8, x8, lsr openjdk#28
        and     x8, x8, #0xff
        orr     x0, x0, x8, lsl openjdk#56

After:

        mov     z16.b, p0/z, #1
        mov     z17.b, #1
        bext    z16.d, z16.d, z17.d
        mov     z17.d, #0
        uzp1    z16.s, z16.s, z17.s
        uzp1    z16.h, z16.h, z17.h
        uzp1    z16.b, z16.b, z17.b
        mov     x0, v16.d[0]

[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-

Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu pushed a commit that referenced this pull request May 19, 2022
…th SVE2

This patch implements AArch64 codegen for VectorLongToMask using the
SVE2 BitPerm feature. With this patch, the final code (generated on an
SVE vector reg size of 512-bit QEMU emulator) is shown as below:

        mov     z17.b, #0
        mov     v17.d[0], x13
        sunpklo z17.h, z17.b
        sunpklo z17.s, z17.h
        sunpklo z17.d, z17.s
        mov     z16.b, #1
        bdep    z17.d, z17.d, z16.d
        cmpne   p0.b, p7/z, z17.b, #0

Change-Id: I9135fce39c8a08c72b757c78b258f5d968baa7ff
@e1iu e1iu deleted the vectorapi-n2 branch July 10, 2023 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.