Skip to content

Add intrinsic for SpanHelpers.Char.IndexOfAny on AArch64 #73788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

SwapnilGaikwad
Copy link
Contributor

No description provided.

@ghost ghost added area-System.Memory community-contribution Indicates that the PR has been added by a community member labels Aug 11, 2022
@ghost
Copy link

ghost commented Aug 11, 2022

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: SwapnilGaikwad
Assignees: -
Labels:

area-System.Memory

Milestone: -

// So the bit position in 'matches' corresponds to the element offset.
if (matches == 0)
combinedVector = (Vector128.Equals(values0, search) | Vector128.Equals(values1, search)).AsByte();
if (!VectorContainsMatch(combinedVector))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a helper like this? Other methods appear to achieve the same thing by using e.g. combinedVector.AsByte().ExtractMostSignificantBits() == 0... that's not feasible here, or doesn't perform well, or some such thing? e.g.

uint matches = Vector128.Equals(values, search).AsByte().ExtractMostSignificantBits();
if (matches == 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper is emitting a better sequence of instructions, consequently higher performance, while detecting a match.

with VectorContainsMatch:

...
umaxp   v19.16b, v18.16b, v18.16b
umov    x7, v19.d[0]
cbnz    x7, G_M000_IG17
...

with ExtractMostSignificantBits:

...
ldr     q18, [@RWD00]
and     v18.16b, v16.16b, v18.16b
ldr     q17, [@RWD16]
ushl    v16.16b, v18.16b, v17.16b
movi    v17.4s, #0x00
ext     v17.16b, v16.16b, v17.16b, #8
addv    b17, v17.8b
umov    w0, v17.b[0]
lsl     w0, w0, #8
addv    b16, v16.8b
umov    w1, v16.b[0]
orr     w1, w0, w1
cbz     w1, G_M000_IG08
...
RWD00  	dq	8080808080808080h, 8080808080808080h
RWD16  	dq	00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h

On altra (not configured for benchmarking):

|                Method |        Job |                                                                                                 Toolchain | Size |      Mean |    Error |   StdDev |    Median |       Min |       Max | Ratio | MannWhitney(2%) | Allocated | Alloc Ratio |
|---------------------- |----------- |---------------------------------------------------------------------------------------------------------- |----- |----------:|---------:|---------:|----------:|----------:|----------:|------:|---------------- |----------:|------------:|
|   IndexOfAnyTwoValues | Job-YNXVVV |           /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  94.40 ns | 0.073 ns | 0.068 ns |  94.42 ns |  94.19 ns |  94.46 ns |  1.62 |          Slower |         - |          NA |
|   IndexOfAnyTwoValues | Job-TMIMPY |   /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  32.40 ns | 0.019 ns | 0.015 ns |  32.39 ns |  32.38 ns |  32.43 ns |  0.56 |          Faster |         - |          NA |
|   IndexOfAnyTwoValues | Job-EKBZGE |        /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  58.32 ns | 0.021 ns | 0.019 ns |  58.33 ns |  58.29 ns |  58.36 ns |  1.00 |            Base |         - |          NA |
|                       |            |                                                                                                           |      |           |          |          |           |           |           |       |                 |           |             |
| IndexOfAnyThreeValues | Job-YNXVVV |           /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 109.25 ns | 0.328 ns | 0.307 ns | 109.40 ns | 108.48 ns | 109.51 ns |  1.58 |          Slower |         - |          NA |
| IndexOfAnyThreeValues | Job-TMIMPY |   /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  43.68 ns | 0.028 ns | 0.026 ns |  43.68 ns |  43.64 ns |  43.74 ns |  0.63 |          Faster |         - |          NA |
| IndexOfAnyThreeValues | Job-EKBZGE |        /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  69.29 ns | 0.086 ns | 0.080 ns |  69.30 ns |  69.13 ns |  69.40 ns |  1.00 |            Base |         - |          NA |
|                       |            |                                                                                                           |      |           |          |          |           |           |           |       |                 |           |             |
|  IndexOfAnyFourValues | Job-YNXVVV |           /Extract_MSB/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 | 129.95 ns | 0.024 ns | 0.022 ns | 129.94 ns | 129.92 ns | 129.99 ns |  1.53 |          Slower |         - |          NA |
|  IndexOfAnyFourValues | Job-TMIMPY |   /VectorContainsMatch/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  57.56 ns | 0.066 ns | 0.062 ns |  57.58 ns |  57.49 ns |  57.64 ns |  0.68 |          Faster |         - |          NA |
|  IndexOfAnyFourValues | Job-EKBZGE |        /unchecked_main/bin/testhost/net7.0-Linux-Release-arm64/shared/Microsoft.NETCore.App/7.0.0/corerun |  512 |  84.85 ns | 0.109 ns | 0.102 ns |  84.87 ns |  84.61 ns |  84.99 ns |  1.00 |            Base |         - |          NA |

I need to confirm the above benchmarking numbers on a better system, probably @adamsitnik can help.

Full Assembly:

VectorContainsMatch
; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 6 single block inlinees; 4 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp

G_M000_IG02:                ;; offset=0008H
        AA1F03E4          mov     x4, xzr
        2A0303E5          mov     w5, w3
        93407C63          sxtw    x3, w3
        D1002063          sub     x3, x3, #8
        F100007F          cmp     x3, #0
        540003AB          blt     G_M000_IG05

G_M000_IG03:                ;; offset=0020H
        AA0303E5          mov     x5, x3
        14000033          b       G_M000_IG14
                          align   [0 bytes for IG07]
                          align   [0 bytes]
                          align   [0 bytes]
                          align   [0 bytes]

G_M000_IG04:                ;; offset=0028H
        D37FF883          lsl     x3, x4, #1
        8B030003          add     x3, x0, x3
        79400066          ldrh    w6, [x3]
        53003C27          uxth    w7, w1
        6B0600FF          cmp     w7, w6
        54000580          beq     G_M000_IG13
        53003C48          uxth    w8, w2
        6B06011F          cmp     w8, w6
        54000520          beq     G_M000_IG13
        79400466          ldrh    w6, [x3,#2]
        6B0600FF          cmp     w7, w6
        54000480          beq     G_M000_IG12
        6B06011F          cmp     w8, w6
        54000440          beq     G_M000_IG12
        79400866          ldrh    w6, [x3,#4]
        6B0600FF          cmp     w7, w6
        540003A0          beq     G_M000_IG11
        6B06011F          cmp     w8, w6
        54000360          beq     G_M000_IG11
        79400C66          ldrh    w6, [x3,#6]
        6B0600FF          cmp     w7, w6
        540002C0          beq     G_M000_IG10
        6B06011F          cmp     w8, w6
        54000280          beq     G_M000_IG10
        91001084          add     x4, x4, #4
        D10010A5          sub     x5, x5, #4

G_M000_IG05:                ;; offset=0090H
        F10010BF          cmp     x5, #4
        54FFFCA2          bhs     G_M000_IG04

G_M000_IG06:                ;; offset=0098H
        B4000185          cbz     x5, G_M000_IG08
        53003C27          uxth    w7, w1

G_M000_IG07:                ;; offset=00A0H
        D37FF886          lsl     x6, x4, #1
        78666806          ldrh    w6, [x0, x6]
        6B0600FF          cmp     w7, w6
        54000200          beq     G_M000_IG13
        53003C48          uxth    w8, w2
        6B06011F          cmp     w8, w6
        540001A0          beq     G_M000_IG13
        91000484          add     x4, x4, #1
        D10004A5          sub     x5, x5, #1
        B5FFFEE5          cbnz    x5, G_M000_IG07

G_M000_IG08:                ;; offset=00C8H
        12800000          movn    w0, #0

G_M000_IG09:                ;; offset=00CCH
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

G_M000_IG10:                ;; offset=00D4H
        11000C84          add     w4, w4, #3
        14000025          b       G_M000_IG18
                          align   [0 bytes for IG15]
                          align   [0 bytes]
                          align   [0 bytes]
                          align   [0 bytes]

G_M000_IG11:                ;; offset=00DCH
        11000884          add     w4, w4, #2
        14000023          b       G_M000_IG18

G_M000_IG12:                ;; offset=00E4H
        11000484          add     w4, w4, #1
        14000021          b       G_M000_IG18

G_M000_IG13:                ;; offset=00ECH
        14000020          b       G_M000_IG18

G_M000_IG14:                ;; offset=00F0H
        53003C27          uxth    w7, w1
        4E020CF0          dup     v16.8h, w7
        53003C48          uxth    w8, w2
        4E020D11          dup     v17.8h, w8
        B4000185          cbz     x5, G_M000_IG16

G_M000_IG15:                ;; offset=0104H
        D37FF881          lsl     x1, x4, #1
        3CE16812          ldr     q18, [x0, x1]
        6E728E13          cmeq    v19.8h, v16.8h, v18.8h
        6E728E32          cmeq    v18.8h, v17.8h, v18.8h
        4EB21E72          orr     v18.8h, v19.8h, v18.8h
        6E32A653          umaxp   v19.16b, v18.16b, v18.16b
        4E083E67          umov    x7, v19.d[0]
        B50001A7          cbnz    x7, G_M000_IG17
        91002084          add     x4, x4, #8
        EB0400BF          cmp     x5, x4
        54FFFEC8          bhi     G_M000_IG15

G_M000_IG16:                ;; offset=0130H
        D37FF8A4          lsl     x4, x5, #1
        3CE46812          ldr     q18, [x0, x4]
        AA0503E4          mov     x4, x5
        6E728E10          cmeq    v16.8h, v16.8h, v18.8h
        6E728E31          cmeq    v17.8h, v17.8h, v18.8h
        4EB11E12          orr     v18.8h, v16.8h, v17.8h
        6E32A650          umaxp   v16.16b, v18.16b, v18.16b
        4E083E00          umov    x0, v16.d[0]
        B4FFFBC0          cbz     x0, G_M000_IG08

G_M000_IG17:                ;; offset=0154H
        6E32A650          umaxp   v16.16b, v18.16b, v18.16b
        4E083E00          umov    x0, v16.d[0]
        DAC00000          rbit    x0, x0
        DAC01000          clz     x0, x0
        13037C00          asr     w0, w0, #3
        0B040004          add     w4, w0, w4

G_M000_IG18:                ;; offset=016CH
        2A0403E0          mov     w0, w4

G_M000_IG19:                ;; offset=0170H
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

; Total bytes of code 376
ExtractMostSignificantBits
; Assembly listing for method SpanHelpers:IndexOfAny(byref,ushort,ushort,int):int
; Emitting BLENDED_CODE for generic ARM64 CPU - Unix
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 7 single block inlinees; 1 inlinees without PGO data

G_M000_IG01:                ;; offset=0000H
        A9BF7BFD          stp     fp, lr, [sp,#-16]!
        910003FD          mov     fp, sp

G_M000_IG02:                ;; offset=0008H
        AA1F03E4          mov     x4, xzr
        2A0303E5          mov     w5, w3
        93407C63          sxtw    x3, w3
        D1002063          sub     x3, x3, #8
        F100007F          cmp     x3, #0
        540003AB          blt     G_M000_IG05

G_M000_IG03:                ;; offset=0020H
        AA0303E5          mov     x5, x3
        14000033          b       G_M000_IG14
                          align   [0 bytes for IG07]
                          align   [0 bytes]
                          align   [0 bytes]
                          align   [0 bytes]

G_M000_IG04:                ;; offset=0028H
        D37FF883          lsl     x3, x4, #1
        8B030003          add     x3, x0, x3
        79400066          ldrh    w6, [x3]
        53003C27          uxth    w7, w1
        6B0600FF          cmp     w7, w6
        54000580          beq     G_M000_IG13
        53003C48          uxth    w8, w2
        6B06011F          cmp     w8, w6
        54000520          beq     G_M000_IG13
        79400466          ldrh    w6, [x3,#2]
        6B0600FF          cmp     w7, w6
        54000480          beq     G_M000_IG12
        6B06011F          cmp     w8, w6
        54000440          beq     G_M000_IG12
        79400866          ldrh    w6, [x3,#4]
        6B0600FF          cmp     w7, w6
        540003A0          beq     G_M000_IG11
        6B06011F          cmp     w8, w6
        54000360          beq     G_M000_IG11
        79400C66          ldrh    w6, [x3,#6]
        6B0600FF          cmp     w7, w6
        540002C0          beq     G_M000_IG10
        6B06011F          cmp     w8, w6
        54000280          beq     G_M000_IG10
        91001084          add     x4, x4, #4
        D10010A5          sub     x5, x5, #4

G_M000_IG05:                ;; offset=0090H
        F10010BF          cmp     x5, #4
        54FFFCA2          bhs     G_M000_IG04

G_M000_IG06:                ;; offset=0098H
        B4000185          cbz     x5, G_M000_IG08
        53003C27          uxth    w7, w1

G_M000_IG07:                ;; offset=00A0H
        D37FF886          lsl     x6, x4, #1
        78666806          ldrh    w6, [x0, x6]
        6B0600FF          cmp     w7, w6
        54000200          beq     G_M000_IG13
        53003C48          uxth    w8, w2
        6B06011F          cmp     w8, w6
        540001A0          beq     G_M000_IG13
        91000484          add     x4, x4, #1
        D10004A5          sub     x5, x5, #1
        B5FFFEE5          cbnz    x5, G_M000_IG07

G_M000_IG08:                ;; offset=00C8H
        12800000          movn    w0, #0

G_M000_IG09:                ;; offset=00CCH
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

G_M000_IG10:                ;; offset=00D4H
        11000C84          add     w4, w4, #3
        14000039          b       G_M000_IG18
                          align   [0 bytes for IG15]
                          align   [0 bytes]
                          align   [0 bytes]
                          align   [0 bytes]

G_M000_IG11:                ;; offset=00DCH
        11000884          add     w4, w4, #2
        14000037          b       G_M000_IG18

G_M000_IG12:                ;; offset=00E4H
        11000484          add     w4, w4, #1
        14000035          b       G_M000_IG18

G_M000_IG13:                ;; offset=00ECH
        14000034          b       G_M000_IG18

G_M000_IG14:                ;; offset=00F0H
        53003C27          uxth    w7, w1
        4E020CF0          dup     v16.8h, w7
        53003C48          uxth    w8, w2
        4E020D11          dup     v17.8h, w8
        B40002C5          cbz     x5, G_M000_IG16
        9C000672          ldr     q18, [@RWD00]

G_M000_IG15:                ;; offset=0108H
        D37FF881          lsl     x1, x4, #1
        3CE16813          ldr     q19, [x0, x1]
        6E738E14          cmeq    v20.8h, v16.8h, v19.8h
        6E738E33          cmeq    v19.8h, v17.8h, v19.8h
        4EB31E93          orr     v19.8h, v20.8h, v19.8h
        4E321E73          and     v19.16b, v19.16b, v18.16b
        9C000614          ldr     q20, [@RWD16]
        6E344673          ushl    v19.16b, v19.16b, v20.16b
        4F000414          movi    v20.4s, #0x00
        6E144274          ext     v20.16b, v19.16b, v20.16b, #8
        0E31BA94          addv    b20, v20.8b
        0E013E81          umov    w1, v20.b[0]
        53185C21          lsl     w1, w1, #8
        0E31BA73          addv    b19, v19.8b
        0E013E62          umov    w2, v19.b[0]
        2A020021          orr     w1, w1, w2
        350002E1          cbnz    w1, G_M000_IG17
        91002084          add     x4, x4, #8
        EB0400BF          cmp     x5, x4
        54FFFDA8          bhi     G_M000_IG15

G_M000_IG16:                ;; offset=0158H
        D37FF8A4          lsl     x4, x5, #1
        3CE46812          ldr     q18, [x0, x4]
        AA0503E4          mov     x4, x5
        6E728E10          cmeq    v16.8h, v16.8h, v18.8h
        6E728E32          cmeq    v18.8h, v17.8h, v18.8h
        4EB21E10          orr     v16.8h, v16.8h, v18.8h
        9C000312          ldr     q18, [@RWD00]
        4E321E12          and     v18.16b, v16.16b, v18.16b
        9C000351          ldr     q17, [@RWD16]
        6E314650          ushl    v16.16b, v18.16b, v17.16b
        4F000411          movi    v17.4s, #0x00
        6E114211          ext     v17.16b, v16.16b, v17.16b, #8
        0E31BA31          addv    b17, v17.8b
        0E013E20          umov    w0, v17.b[0]
        53185C00          lsl     w0, w0, #8
        0E31BA10          addv    b16, v16.8b
        0E013E01          umov    w1, v16.b[0]
        2A010001          orr     w1, w0, w1
        34FFF941          cbz     w1, G_M000_IG08

G_M000_IG17:                ;; offset=01A4H
        5AC00020          rbit    w0, w1
        5AC01000          clz     w0, w0
        2A0003E0          mov     w0, w0
        D341FC00          lsr     x0, x0, #1
        8B000084          add     x4, x4, x0
        17FFFFCD          b       G_M000_IG13

G_M000_IG18:                ;; offset=01BCH
        2A0403E0          mov     w0, w4

G_M000_IG19:                ;; offset=01C0H
        A8C17BFD          ldp     fp, lr, [sp],#16
        D65F03C0          ret     lr

RWD00  	dq	8080808080808080h, 8080808080808080h
RWD16  	dq	00FFFEFDFCFBFAF9h, 00FFFEFDFCFBFAF9h

; Total bytes of code 456

@stephentoub
Copy link
Member

Is this PR needed given #73469?

@SwapnilGaikwad
Copy link
Contributor Author

Is this PR needed given #73469?

Sure, we don't need this PR. I will close this one once we transfer the useful parts of this to #73469 .

@bartonjs
Copy link
Member

I will close this one once we transfer the useful parts of this to #73469 .

73469 has been merged, so should this be closed now?

@stephentoub
Copy link
Member

I'm going to close this as it doesn't appear to be actionable now. @SwapnilGaikwad, if there are specific pieces that should be ported over, can you open a new PR for that? Thanks!

@SwapnilGaikwad
Copy link
Contributor Author

Slightly mistimed the updates. It seems we can squeeze some performance at the cost of readability.
Created a new PR #74010 .

@ghost ghost locked as resolved and limited conversation to collaborators Sep 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Memory community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants