forked from openjdk/jdk
-
Notifications
You must be signed in to change notification settings - Fork 0
Internal review for the performance of VectorAPI on N2 #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Reviewed-by: dfuchs, aefimov
Reviewed-by: aivanov
…_block Reviewed-by: tschatzl, iwalulya, sjohanss
Reviewed-by: kbarrett, adinn
…Species classes Reviewed-by: iklam, ccheung
Reviewed-by: cjplummer
Reviewed-by: jnimeh, hchao
…d by JDK-8280950 Reviewed-by: jlaskey
Reviewed-by: prr, serb
…NPrintWriter.println(...) method Reviewed-by: prr
Reviewed-by: redestad
Reviewed-by: chagedorn, thartmann, xliu
…t.java is failing Reviewed-by: sspitsyn, dcubed, lmesnik
…emException on Windows 11 Reviewed-by: dfuchs
…ck is enabled Reviewed-by: dfuchs
Reviewed-by: tschatzl, sjohanss
Reviewed-by: ayang, tschatzl
… signed JAR Reviewed-by: weijun, hchao
Reviewed-by: xuelei, hchao
…dlock/JavaDeadlock001/TestDescription.java from problemlist. Reviewed-by: sspitsyn
…oryTest.java Reviewed-by: cjplummer, dfuchs
Reviewed-by: mgronlun
Reviewed-by: stuefe, coleenp, xliu
Reviewed-by: djelinski, alanb, dfuchs, aefimov
Reviewed-by: iwalulya, sjohanss
Reviewed-by: aivanov
Reviewed-by: dholmes
Reviewed-by: dcubed, coleenp, lfoltan
…iption Reviewed-by: asemenyuk
Reviewed-by: rriggs
Reviewed-by: mgronlun
Reviewed-by: xuelei
Reviewed-by: mikael
Reviewed-by: darcy
…formations Reviewed-by: chagedorn
Reviewed-by: thartmann, roland
Reviewed-by: stuefe, coleenp, dholmes
Reviewed-by: iris, rriggs, bpb, lancea, mchung, scolebourne
…/compiler threads Reviewed-by: kvn, thartmann
…ames and X509Certificate::getIssuerAlternativeNames in otherName 6776681: Invalid encoding of an OtherName in X509Certificate.getAlternativeNames() Reviewed-by: mullan
Reviewed-by: pchilanomate, hseigel
Reviewed-by: coleenp, ccheung
Reviewed-by: vromero
Reviewed-by: jjg
Reviewed-by: iklam
Reviewed-by: aph, shade
This patch aims to optimize extract operation on vector for AArch64
according to Neoverse N2 and V1 software optimization guide[1][2].
Currently, extract operation is used by "Vector.lane"[3]. As SVE
doesn’t have direct instruction support for such operation like
"pextr"[4] in x86, the final code is as below:
```
Byte512Vector.lane(7)
orr x8, xzr, #0x7
whilele p0.b, xzr, x8
lastb w10, p0, z16.b
sxtb w10, w10
```
This patch uses NEON instruction instead if the target lane is located
in the NEON 128b range. For the same example above, the generated code
is much simpler:
```
smov x11, v16.b[7]
```
For those cases that target lane is located out of the NEON 128b range,
this patch uses EXT to shift the target to the lowest. The generated
code is as below:
```
Byte512Vector.lane(63)
mov z17.d, z16.d
ext z17.b, z17.b, z17.b, openjdk#63
smov x10, v17.b[0]
```
hotspot/compiler/vectorapi, jdk/incubator/vector passed on SVE machine.
Refs:
// From Arm Neoverse N2 and V1 Software Optimization Guide
// NOTE: The data inside "()" belongs to V1
+--------------+---------+------------+--------------------+
| Instruction | Latency | Throughput | Utilized Pipelines |
+--------------+---------+------------+--------------------+
|WHILELE | 3| 1(1/2)| M(M0)|
+--------------+---------+------------+--------------------+
|LASTB(scalar) | 5(6)| 1| V1,M0|
+--------------+---------+------------+--------------------+
|EXT | 2| 2| V(V01)|
+--------------+---------+------------+--------------------+
|UMOV,SMOV | 2| 1| V|
+--------------+---------+------------+--------------------+
|ORR | 2| 2| V(V01)|
+--------------+---------+------------+--------------------+
|INS | 2| 2(4)| V|
+--------------+---------+------------+--------------------+
[1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
[2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
[3] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/IntVector.java#L2693
[4] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq
Change-Id: I90cfc1f8deb84145f42132d58d3b211c4a8933ad
This patch optimizes Vector.withLane for 64 and 128 SPECIES. For 64-
and 128-bit vector, insert operation could be implemented by ASIMD
instructions, for better performance. E.g., for IntVector.SPECIES_128,
"IntVector.withLane(0, (int)4)" generates code as below:
```
Before:
orr w10, wzr, #0x4
index z17.s, #-16, #1
cmpeq p0.s, p7/z, z17.s, #-16
mov z17.d, z16.d
mov z17.s, p0/m, w10
After
orr w10, wzr, #0x4
mov v16.s[0], w10
```
This patch also does a small enhancement for vector whose size is
greater than 128 bits, that it may save 1 "DUP" just if the target index
is smaller than 32. E.g., For ByteVector.SPECIES_512,
"ByteVector.withLane(0, (byte)4)" generates code as below:
```
Before:
index z18.b, #0, #1
mov z17.b, #0
cmpeq p0.b, p7/z, z18.b, z17.b
mov z17.d, z16.d
mov z17.b, p0/m, w16
After:
index z17.b, #-16, #1
cmpeq p0.b, p7/z, z17.b, #-16
mov z17.d, z16.d
mov z17.b, p0/m, w16
```
Change-Id: I700a28bc2fc15b6baca03b8d8574bb17992bf4a7
This patch speeds up add/mul/min/max reduction for 64 and 128 SPECIES
on N2 machine.
According to Neoverse N2 software optimization guide[1], ASIMD
reduction instructions are faster than SVE's for 128-bit vector size.
This patch adds some rules to distinguish 64 bits and 128 bits vector
size, so that for these two special cases, they can generate code the
same as NEON. E.g., For ByteVector.SPECIES_128,
"ByteVector.reduceLanes(VectorOperators.ADD)" generates code as below:
```
Before:
orr x8, xzr, #0x10
whilelo p0.b, xzr, x8
uaddv d17, p0, z16.b
smov x15, v17.b[0]
add w15, w14, w15, sxtb
After:
addv b17, v16.16b
smov x12, v17.b[0]
add w12, w12, w16, sxtb
```
The performance improvements 60% ~ 100% on my test machine.
[1] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001
Change-Id: Id4637cd4b0b7948864780eafa84150787697e4df
This patch uses PTRUE instruction to create predicate registers for partial vector operations which would be efficient than WHILELO instruction according to the software optimization guide of N2[1] and V1[2]. [1] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 [2] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ Change-Id: I9be5aa82ab567e19c62c698dc9c4d852efd2f607
e1iu
pushed a commit
that referenced
this pull request
Mar 24, 2022
…or 64/128-bit vector sizes
This patch optimizes the SVE backend implementations of Vector.lane and
Vector.withLane for 64/128-bit vector size. The basic idea is to use
lower costs NEON instructions when the vector size is 64/128 bits.
1. Vector.lane(int i) (Gets the lane element at lane index i)
As SVE doesn’t have direct instruction support for extraction like
"pextr"[1] in x86, the final code was shown as below:
```
Byte512Vector.lane(7)
orr x8, xzr, #0x7
whilele p0.b, xzr, x8
lastb w10, p0, z16.b
sxtb w10, w10
```
This patch uses NEON instruction instead if the target lane is located
in the NEON 128b range. For the same example above, the generated code
now is much simpler:
```
smov x11, v16.b[7]
```
For those cases that target lane is located out of the NEON 128b range,
this patch uses EXT to shift the target to the lowest. The generated
code is as below:
```
Byte512Vector.lane(63)
mov z17.d, z16.d
ext z17.b, z17.b, z17.b, openjdk#63
smov x10, v17.b[0]
```
2. Vector.withLane(int i, E e) (Replaces the lane element of this vector
at lane index i with value e)
For 64/128-bit vector, insert operation could be implemented by NEON
instructions to get better performance. E.g., for IntVector.SPECIES_128,
"IntVector.withLane(0, (int)4)" generates code as below:
```
Before:
orr w10, wzr, #0x4
index z17.s, #-16, #1
cmpeq p0.s, p7/z, z17.s, #-16
mov z17.d, z16.d
mov z17.s, p0/m, w10
After
orr w10, wzr, #0x4
mov v16.s[0], w10
```
This patch also does a small enhancement for vectors whose sizes are
greater than 128 bits. It can save 1 "DUP" if the target index is
smaller than 32. E.g., For ByteVector.SPECIES_512,
"ByteVector.withLane(0, (byte)4)" generates code as below:
```
Before:
index z18.b, #0, #1
mov z17.b, #0
cmpeq p0.b, p7/z, z18.b, z17.b
mov z17.d, z16.d
mov z17.b, p0/m, w16
After:
index z17.b, #-16, #1
cmpeq p0.b, p7/z, z17.b, #-16
mov z17.d, z16.d
mov z17.b, p0/m, w16
```
With this patch, we can see up to 200% performance gain for specific
vector micro benchmarks in my SVE testing system.
[TEST]
test/jdk/jdk/incubator/vector, test/hotspot/jtreg/compiler/vectorapi
passed without failure.
[1] https://www.felixcloutier.com/x86/pextrb:pextrd:pextrq
Change-Id: Ic2a48f852011978d0f252db040371431a339d73c
e1iu
pushed a commit
that referenced
this pull request
Mar 29, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.
With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:
Before:
mov z16.b, p0/z, #1
fmov x0, d16
orr x0, x0, x0, lsr openjdk#7
orr x0, x0, x0, lsr openjdk#14
orr x0, x0, x0, lsr openjdk#28
and x0, x0, #0xff
fmov x8, v16.d[1]
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#8
orr x8, xzr, #0x2
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#16
orr x8, xzr, #0x3
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#24
orr x8, xzr, #0x4
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#32
mov x8, #0x5
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#40
orr x8, xzr, #0x6
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#48
orr x8, xzr, #0x7
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#56
After:
mov z16.b, p0/z, #1
mov z17.b, #1
bext z16.d, z16.d, z17.d
mov z17.d, #0
uzp1 z16.s, z16.s, z17.s
uzp1 z16.h, z16.h, z17.h
uzp1 z16.b, z16.b, z17.b
mov x0, v16.d[0]
[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-
Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu
pushed a commit
that referenced
this pull request
Mar 29, 2022
…th SVE2
This patch implements AArch64 codegen for VectorLongToMask using the
SVE2 BitPerm feature. With this patch, the final code (generated on an
SVE vector reg size of 512-bit QEMU emulator) is shown as below:
mov z17.b, #0
mov v17.d[0], x13
sunpklo z17.h, z17.b
sunpklo z17.s, z17.h
sunpklo z17.d, z17.s
mov z16.b, #1
bdep z17.d, z17.d, z16.d
cmpne p0.b, p7/z, z17.b, #0
Change-Id: Ia83e80bbd879f86fef5dd607e44c530f2ce143d0
e1iu
pushed a commit
that referenced
this pull request
Apr 21, 2022
This patch optimizes the backend implementation of VectorMaskToLong for
AArch64, given a more efficient approach to mov value bits from
predicate register to general purpose register as x86 PMOVMSK[1] does,
by using BEXT[2] which is available in SVE2.
With this patch, the final code (input mask is byte type with
SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU
emulator) changes as below:
Before:
mov z16.b, p0/z, #1
fmov x0, d16
orr x0, x0, x0, lsr openjdk#7
orr x0, x0, x0, lsr openjdk#14
orr x0, x0, x0, lsr openjdk#28
and x0, x0, #0xff
fmov x8, v16.d[1]
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#8
orr x8, xzr, #0x2
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#16
orr x8, xzr, #0x3
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#24
orr x8, xzr, #0x4
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#32
mov x8, #0x5
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#40
orr x8, xzr, #0x6
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#48
orr x8, xzr, #0x7
whilele p1.d, xzr, x8
lastb x8, p1, z16.d
orr x8, x8, x8, lsr openjdk#7
orr x8, x8, x8, lsr openjdk#14
orr x8, x8, x8, lsr openjdk#28
and x8, x8, #0xff
orr x0, x0, x8, lsl openjdk#56
After:
mov z16.b, p0/z, #1
mov z17.b, #1
bext z16.d, z16.d, z17.d
mov z17.d, #0
uzp1 z16.s, z16.s, z17.s
uzp1 z16.h, z16.h, z17.h
uzp1 z16.b, z16.b, z17.b
mov x0, v16.d[0]
[1] https://www.felixcloutier.com/x86/pmovmskb
[2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask-
Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0
e1iu
pushed a commit
that referenced
this pull request
May 19, 2022
…th SVE2
This patch implements AArch64 codegen for VectorLongToMask using the
SVE2 BitPerm feature. With this patch, the final code (generated on an
SVE vector reg size of 512-bit QEMU emulator) is shown as below:
mov z17.b, #0
mov v17.d[0], x13
sunpklo z17.h, z17.b
sunpklo z17.s, z17.h
sunpklo z17.d, z17.s
mov z16.b, #1
bdep z17.d, z17.d, z16.d
cmpne p0.b, p7/z, z17.b, #0
Change-Id: I9135fce39c8a08c72b757c78b258f5d968baa7ff
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.