-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64 #73788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64 #73788
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue Detailsnull
|
// So the bit position in 'matches' corresponds to the element offset. | ||
if (matches == 0) | ||
combinedVector = (Vector128.Equals(values0, search) | Vector128.Equals(values1, search)).AsByte(); | ||
if (!VectorContainsMatch(combinedVector)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a helper like this? Other methods appear to achieve the same thing by using e.g. combinedVector.AsByte().ExtractMostSignificantBits() == 0
... that's not feasible here, or doesn't perform well, or some such thing? e.g.
runtime/src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.cs
Lines 655 to 656 in 3e0a5ad
uint matches = Vector128.Equals(values, search).AsByte().ExtractMostSignificantBits(); | |
if (matches == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper is emitting a better sequence of instructions, consequently higher performance, while detecting a match.
with VectorContainsMatch:
...
umaxp v19.16b, v18.16b, v18.16b
umov x7, v19.d[0]
cbnz x7, G_M000_IG17
...
with ExtractMostSignificantBits:
...
ldr q18, [@RWD00]
and v18.16b, v16.16b, v18.16b
ldr q17, [@RWD16]
ushl v16.16b, v18.16b, v17.16b
movi v17.4s, #0x00
ext v17.16b, v16.16b, v17.16b, #8
addv b17, v17.8b
umov w0, v17.b[0]
lsl w0, w0, #8
addv b16, v16.8b
umov w1, v16.b[0]
orr w1, w0, w1
cbz w1, G_M000_IG08
...
RWD00 dq 8080808080808080h, 8080808080808080h
RWD16 dq 00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h
On altra (not configured for benchmarking):
| Method | Job | Toolchain | Size | Mean | Error | StdDev | Median | Min | Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|---------------------- |----------- |---------------------------------------------------------------------------------------------------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:|
| IndexOfAnyTwoValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 94.40 ns | 0.073 ns | 0.068 ns | 94.42 ns | 94.19 ns | 94.46 ns | 1.62 | Slower | - | NA |
| IndexOfAnyTwoValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 32.40 ns | 0.019 ns | 0.015 ns | 32.39 ns | 32.38 ns | 32.43 ns | 0.56 | Faster | - | NA |
| IndexOfAnyTwoValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 58.32 ns | 0.021 ns | 0.019 ns | 58.33 ns | 58.29 ns | 58.36 ns | 1.00 | Base | - | NA |
| | | | | | | | | | | | | | |
| IndexOfAnyThreeValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 109.25 ns | 0.328 ns | 0.307 ns | 109.40 ns | 108.48 ns | 109.51 ns | 1.58 | Slower | - | NA |
| IndexOfAnyThreeValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 43.68 ns | 0.028 ns | 0.026 ns | 43.68 ns | 43.64 ns | 43.74 ns | 0.63 | Faster | - | NA |
| IndexOfAnyThreeValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 69.29 ns | 0.086 ns | 0.080 ns | 69.30 ns | 69.13 ns | 69.40 ns | 1.00 | Base | - | NA |
| | | | | | | | | | | | | | |
| IndexOfAnyFourValues | Job-YNXVVV | /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 129.95 ns | 0.024 ns | 0.022 ns | 129.94 ns | 129.92 ns | 129.99 ns | 1.53 | Slower | - | NA |
| IndexOfAnyFourValues | Job-TMIMPY | /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 57.56 ns | 0.066 ns | 0.062 ns | 57.58 ns | 57.49 ns | 57.64 ns | 0.68 | Faster | - | NA |
| IndexOfAnyFourValues | Job-EKBZGE | /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun | 512 | 84.85 ns | 0.109 ns | 0.102 ns | 84.87 ns | 84.61 ns | 84.99 ns | 1.00 | Base | - | NA |
I need to confirm the above benchmarking numbers on a better system, probably @adamsitnik can help.
Full Assembly:
VectorContainsMatch
; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 4 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
G_M000_IG02: ;; offset=0008H
AA1F03E4 mov x4, xzr
2A0303E5 mov w5, w3
93407C63 sxtw x3, w3
D1002063 sub x3, x3, #8
F100007F cmp x3, #0
540003AB blt G_M000_IG05
G_M000_IG03: ;; offset=0020H
AA0303E5 mov x5, x3
14000033 b G_M000_IG14
align [0 bytes for IG07]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG04: ;; offset=0028H
D37FF883 lsl x3, x4, #1
8B030003 add x3, x0, x3
79400066 ldrh w6, [x3]
53003C27 uxth w7, w1
6B0600FF cmp w7, w6
54000580 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
54000520 beq G_M000_IG13
79400466 ldrh w6, [x3,#2]
6B0600FF cmp w7, w6
54000480 beq G_M000_IG12
6B06011F cmp w8, w6
54000440 beq G_M000_IG12
79400866 ldrh w6, [x3,#4]
6B0600FF cmp w7, w6
540003A0 beq G_M000_IG11
6B06011F cmp w8, w6
54000360 beq G_M000_IG11
79400C66 ldrh w6, [x3,#6]
6B0600FF cmp w7, w6
540002C0 beq G_M000_IG10
6B06011F cmp w8, w6
54000280 beq G_M000_IG10
91001084 add x4, x4, #4
D10010A5 sub x5, x5, #4
G_M000_IG05: ;; offset=0090H
F10010BF cmp x5, #4
54FFFCA2 bhs G_M000_IG04
G_M000_IG06: ;; offset=0098H
B4000185 cbz x5, G_M000_IG08
53003C27 uxth w7, w1
G_M000_IG07: ;; offset=00A0H
D37FF886 lsl x6, x4, #1
78666806 ldrh w6, [x0, x6]
6B0600FF cmp w7, w6
54000200 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
540001A0 beq G_M000_IG13
91000484 add x4, x4, #1
D10004A5 sub x5, x5, #1
B5FFFEE5 cbnz x5, G_M000_IG07
G_M000_IG08: ;; offset=00C8H
12800000 movn w0, #0
G_M000_IG09: ;; offset=00CCH
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
G_M000_IG10: ;; offset=00D4H
11000C84 add w4, w4, #3
14000025 b G_M000_IG18
align [0 bytes for IG15]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG11: ;; offset=00DCH
11000884 add w4, w4, #2
14000023 b G_M000_IG18
G_M000_IG12: ;; offset=00E4H
11000484 add w4, w4, #1
14000021 b G_M000_IG18
G_M000_IG13: ;; offset=00ECH
14000020 b G_M000_IG18
G_M000_IG14: ;; offset=00F0H
53003C27 uxth w7, w1
4E020CF0 dup v16.8h, w7
53003C48 uxth w8, w2
4E020D11 dup v17.8h, w8
B4000185 cbz x5, G_M000_IG16
G_M000_IG15: ;; offset=0104H
D37FF881 lsl x1, x4, #1
3CE16812 ldr q18, [x0, x1]
6E728E13 cmeq v19.8h, v16.8h, v18.8h
6E728E32 cmeq v18.8h, v17.8h, v18.8h
4EB21E72 orr v18.8h, v19.8h, v18.8h
6E32A653 umaxp v19.16b, v18.16b, v18.16b
4E083E67 umov x7, v19.d[0]
B50001A7 cbnz x7, G_M000_IG17
91002084 add x4, x4, #8
EB0400BF cmp x5, x4
54FFFEC8 bhi G_M000_IG15
G_M000_IG16: ;; offset=0130H
D37FF8A4 lsl x4, x5, #1
3CE46812 ldr q18, [x0, x4]
AA0503E4 mov x4, x5
6E728E10 cmeq v16.8h, v16.8h, v18.8h
6E728E31 cmeq v17.8h, v17.8h, v18.8h
4EB11E12 orr v18.8h, v16.8h, v17.8h
6E32A650 umaxp v16.16b, v18.16b, v18.16b
4E083E00 umov x0, v16.d[0]
B4FFFBC0 cbz x0, G_M000_IG08
G_M000_IG17: ;; offset=0154H
6E32A650 umaxp v16.16b, v18.16b, v18.16b
4E083E00 umov x0, v16.d[0]
DAC00000 rbit x0, x0
DAC01000 clz x0, x0
13037C00 asr w0, w0, #3
0B040004 add w4, w0, w4
G_M000_IG18: ;; offset=016CH
2A0403E0 mov w0, w4
G_M000_IG19: ;; offset=0170H
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
; Total bytes of code 376
ExtractMostSignificantBits
; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 7 single block inlinees; 1 inlinees without PGO data
G_M000_IG01: ;; offset=0000H
A9BF7BFD stp fp, lr, [sp,#-16]!
910003FD mov fp, sp
G_M000_IG02: ;; offset=0008H
AA1F03E4 mov x4, xzr
2A0303E5 mov w5, w3
93407C63 sxtw x3, w3
D1002063 sub x3, x3, #8
F100007F cmp x3, #0
540003AB blt G_M000_IG05
G_M000_IG03: ;; offset=0020H
AA0303E5 mov x5, x3
14000033 b G_M000_IG14
align [0 bytes for IG07]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG04: ;; offset=0028H
D37FF883 lsl x3, x4, #1
8B030003 add x3, x0, x3
79400066 ldrh w6, [x3]
53003C27 uxth w7, w1
6B0600FF cmp w7, w6
54000580 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
54000520 beq G_M000_IG13
79400466 ldrh w6, [x3,#2]
6B0600FF cmp w7, w6
54000480 beq G_M000_IG12
6B06011F cmp w8, w6
54000440 beq G_M000_IG12
79400866 ldrh w6, [x3,#4]
6B0600FF cmp w7, w6
540003A0 beq G_M000_IG11
6B06011F cmp w8, w6
54000360 beq G_M000_IG11
79400C66 ldrh w6, [x3,#6]
6B0600FF cmp w7, w6
540002C0 beq G_M000_IG10
6B06011F cmp w8, w6
54000280 beq G_M000_IG10
91001084 add x4, x4, #4
D10010A5 sub x5, x5, #4
G_M000_IG05: ;; offset=0090H
F10010BF cmp x5, #4
54FFFCA2 bhs G_M000_IG04
G_M000_IG06: ;; offset=0098H
B4000185 cbz x5, G_M000_IG08
53003C27 uxth w7, w1
G_M000_IG07: ;; offset=00A0H
D37FF886 lsl x6, x4, #1
78666806 ldrh w6, [x0, x6]
6B0600FF cmp w7, w6
54000200 beq G_M000_IG13
53003C48 uxth w8, w2
6B06011F cmp w8, w6
540001A0 beq G_M000_IG13
91000484 add x4, x4, #1
D10004A5 sub x5, x5, #1
B5FFFEE5 cbnz x5, G_M000_IG07
G_M000_IG08: ;; offset=00C8H
12800000 movn w0, #0
G_M000_IG09: ;; offset=00CCH
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
G_M000_IG10: ;; offset=00D4H
11000C84 add w4, w4, #3
14000039 b G_M000_IG18
align [0 bytes for IG15]
align [0 bytes]
align [0 bytes]
align [0 bytes]
G_M000_IG11: ;; offset=00DCH
11000884 add w4, w4, #2
14000037 b G_M000_IG18
G_M000_IG12: ;; offset=00E4H
11000484 add w4, w4, #1
14000035 b G_M000_IG18
G_M000_IG13: ;; offset=00ECH
14000034 b G_M000_IG18
G_M000_IG14: ;; offset=00F0H
53003C27 uxth w7, w1
4E020CF0 dup v16.8h, w7
53003C48 uxth w8, w2
4E020D11 dup v17.8h, w8
B40002C5 cbz x5, G_M000_IG16
9C000672 ldr q18, [@RWD00]
G_M000_IG15: ;; offset=0108H
D37FF881 lsl x1, x4, #1
3CE16813 ldr q19, [x0, x1]
6E738E14 cmeq v20.8h, v16.8h, v19.8h
6E738E33 cmeq v19.8h, v17.8h, v19.8h
4EB31E93 orr v19.8h, v20.8h, v19.8h
4E321E73 and v19.16b, v19.16b, v18.16b
9C000614 ldr q20, [@RWD16]
6E344673 ushl v19.16b, v19.16b, v20.16b
4F000414 movi v20.4s, #0x00
6E144274 ext v20.16b, v19.16b, v20.16b, #8
0E31BA94 addv b20, v20.8b
0E013E81 umov w1, v20.b[0]
53185C21 lsl w1, w1, #8
0E31BA73 addv b19, v19.8b
0E013E62 umov w2, v19.b[0]
2A020021 orr w1, w1, w2
350002E1 cbnz w1, G_M000_IG17
91002084 add x4, x4, #8
EB0400BF cmp x5, x4
54FFFDA8 bhi G_M000_IG15
G_M000_IG16: ;; offset=0158H
D37FF8A4 lsl x4, x5, #1
3CE46812 ldr q18, [x0, x4]
AA0503E4 mov x4, x5
6E728E10 cmeq v16.8h, v16.8h, v18.8h
6E728E32 cmeq v18.8h, v17.8h, v18.8h
4EB21E10 orr v16.8h, v16.8h, v18.8h
9C000312 ldr q18, [@RWD00]
4E321E12 and v18.16b, v16.16b, v18.16b
9C000351 ldr q17, [@RWD16]
6E314650 ushl v16.16b, v18.16b, v17.16b
4F000411 movi v17.4s, #0x00
6E114211 ext v17.16b, v16.16b, v17.16b, #8
0E31BA31 addv b17, v17.8b
0E013E20 umov w0, v17.b[0]
53185C00 lsl w0, w0, #8
0E31BA10 addv b16, v16.8b
0E013E01 umov w1, v16.b[0]
2A010001 orr w1, w0, w1
34FFF941 cbz w1, G_M000_IG08
G_M000_IG17: ;; offset=01A4H
5AC00020 rbit w0, w1
5AC01000 clz w0, w0
2A0003E0 mov w0, w0
D341FC00 lsr x0, x0, #1
8B000084 add x4, x4, x0
17FFFFCD b G_M000_IG13
G_M000_IG18: ;; offset=01BCH
2A0403E0 mov w0, w4
G_M000_IG19: ;; offset=01C0H
A8C17BFD ldp fp, lr, [sp],#16
D65F03C0 ret lr
RWD00 dq 8080808080808080h, 8080808080808080h
RWD16 dq 00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h
; Total bytes of code 456
Is this PR needed given #73469? |
73469 has been merged, so should this be closed now? |
I'm going to close this as it doesn't appear to be actionable now. @SwapnilGaikwad, if there are specific pieces that should be ported over, can you open a new PR for that? Thanks! |
Slightly mistimed the updates. It seems we can squeeze some performance at the cost of readability. |
No description provided.