Skip to content

Add Neon implementation of search_n#6108

Merged
StephanTLavavej merged 3 commits intomicrosoft:mainfrom
hazzlim:search_n-neon-pr
Feb 28, 2026
Merged

Add Neon implementation of search_n#6108
StephanTLavavej merged 3 commits intomicrosoft:mainfrom
hazzlim:search_n-neon-pr

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Feb 25, 2026

This PR adds a Neon implementation of search_n, for small values of n.

The notable difference in approach to the SSE4.2/AVX2 paths is that here we use a 64-bit nibble mask, rather than a packed bitmask, in order to save on expensive MOVEMASK emulation. As a result of this, we handle the Carry via CLZ rather than mask concatenation.

I also included a 64-bit path, as this did actually seem to be a (modest) gain.

Results are a little mixed and variable, and contain some 'spurious' results for n > 8 where the vector implementation is not hit but the differences in the header implementation and the fallback are being measured instead. However they seem to roughly match he results for SSE4.2 in #5544.

Benchmark results 🕐:

  MSVC Clang
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/40 0.89 0.897
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/18 0.878 0.95
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/16 0.909 1
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/14 0.88 0.979
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/10 0.87 0.994
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/8 1.024 1.299
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/5 1.638 2
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/4 1.964 2.363
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/3 2.601 3.22
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/2 3.884 4.781
bm<uint8_t, AlgType::Std, PatternType::TwoZones>/3000/1 1 1
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/40 1.094 1.35
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/18 1.122 1.469
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/16 1.111 1.5
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/14 1.146 1.464
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/10 1.146 1.467
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/8 1.398 1.829
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/5 2.14 2.795
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/4 2.683 3.419
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/3 3.405 4.498
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/2 5 6.707
bm<uint8_t, AlgType::Rng, PatternType::TwoZones>/3000/1 1 1
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 1.93 1.736
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 2.051 1.823
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 1.979 1.793
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 1.952 1.727
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 1.72 1.508
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 3.636 3.043
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 3.953 3.083
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 3.824 3.113
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 3.739 3.083
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 3.794 3.251
bm<uint8_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 0.976 0.986
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 0.972 1.122
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 0.995 1.166
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 0.977 1.131
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 1.04 1.19
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 0.995 1.122
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 2.174 2.312
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 2.609 2.846
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 2.905 3.327
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 3.25 3.824
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 4.444 5.687
bm<uint8_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 1 1
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/40 0.963 0.844
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/18 0.953 0.907
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/16 0.979 0.875
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/14 0.977 0.928
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/10 0.953 0.902
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/8 0.943 0.937
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/5 0.936 0.938
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/4 1.449 1.579
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/3 1.917 2.134
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/2 2.803 3.095
bm<uint16_t, AlgType::Std, PatternType::TwoZones>/3000/1 1.011 1.064
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/40 1.163 1.19
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/18 1.312 1.209
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/16 1.283 1.225
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/14 1.333 1.244
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/10 1.343 1.286
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/8 1.352 1.256
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/5 1.368 1.267
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/4 2.091 2.091
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/3 2.8 2.833
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/2 4.167 4.051
bm<uint16_t, AlgType::Rng, PatternType::TwoZones>/3000/1 0.989 1.049
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 1.8 1.795
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 1.95 1.98
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 2.013 1.875
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 1.905 1.833
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 1.853 1.607
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 1.673 1.495
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 1.515 1.273
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 2.174 1.87
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 2.13 1.957
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 2.2 2.087
bm<uint16_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 1 1.067
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 0.974 0.948
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 0.998 1.059
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 0.989 1.027
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 1 1.01
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 0.98 1
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 1 0.977
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 1.034 0.977
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 1.789 1.598
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 2.152 2.005
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 4.065 3.272
bm<uint16_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 0.978 1.054
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/40 1.091 0.798
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/18 0.945 0.852
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/16 0.987 0.87
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/14 0.932 0.891
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/10 0.956 0.875
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/8 0.942 0.894
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/5 0.955 0.894
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/4 0.924 0.923
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/3 0.937 0.937
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/2 2.065 2.053
bm<uint32_t, AlgType::Std, PatternType::TwoZones>/3000/1 0.994 1
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/40 1.216 1.182
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/18 1.336 1.237
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/16 1.3 1.277
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/14 1.33 1.275
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/10 1.395 1.308
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/8 1.391 1.26
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/5 1.4 1.272
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/4 1.452 1.306
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/3 1.439 1.321
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/2 3.125 2.826
bm<uint32_t, AlgType::Rng, PatternType::TwoZones>/3000/1 0.969 1
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 2.108 1.895
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 2.114 2.022
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 1.985 1.875
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 1.913 1.878
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 1.796 1.643
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 1.677 1.491
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 1.562 1.364
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 1.408 1.224
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 1.232 1.026
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 1.396 1.247
bm<uint32_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 0.978 1.022
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 0.998 1
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 1.05 1.053
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 1.038 1.024
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 1.024 1.024
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 1.026 1.016
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 1 0.977
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 1.045 0.974
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 1.061 1.045
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 1.125 1.022
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 1.831 1.864
bm<uint32_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 1 1.002
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/40 0.942 0.809
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/18 0.953 0.855
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/16 0.957 0.855
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/14 0.95 0.852
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/10 0.956 0.869
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/8 0.937 0.942
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/5 0.936 0.906
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/4 0.924 0.895
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/3 0.917 0.941
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/2 0.902 0.919
bm<uint64_t, AlgType::Std, PatternType::TwoZones>/3000/1 1.022 0.932
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/40 1.27 1.228
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/18 1.391 1.227
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/16 1.436 1.225
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/14 1.393 1.219
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/10 1.438 1.232
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/8 1.489 1.25
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/5 1.471 1.239
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/4 1.5 1.278
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/3 1.481 1.292
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/2 1.488 1.255
bm<uint64_t, AlgType::Rng, PatternType::TwoZones>/3000/1 1 0.978
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/40 1.875 1.895
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/18 2.068 1.964
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/16 1.979 1.875
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/14 1.933 1.833
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/10 1.764 1.695
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/8 1.677 1.495
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/5 1.466 1.303
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/4 1.347 1.197
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/3 1.143 1.071
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/2 0.942 0.853
bm<uint64_t, AlgType::Std, PatternType::DenseSmallSequences>/3000/1 1 0.991
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/40 0.949 1.026
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/18 1.029 1.071
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/16 1.075 1.073
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/14 1.049 1.034
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/10 0.949 0.995
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/8 1 0.982
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/5 1.023 1.023
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/4 1.073 1.024
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/3 1.167 1.095
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/2 1.4 1.3
bm<uint64_t, AlgType::Rng, PatternType::DenseSmallSequences>/3000/1 0.991 0.991

@hazzlim hazzlim requested a review from a team as a code owner February 25, 2026 09:45
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Feb 25, 2026
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture ARM64EC I can't believe it's not x64! labels Feb 25, 2026
@StephanTLavavej StephanTLavavej self-assigned this Feb 25, 2026
@StephanTLavavej StephanTLavavej removed their assignment Feb 25, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Feb 25, 2026
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Feb 26, 2026
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo. Please notify me if any further changes are pushed, otherwise no action is required.

@StephanTLavavej StephanTLavavej merged commit d37e8bb into microsoft:main Feb 28, 2026
49 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Feb 28, 2026
@StephanTLavavej
Copy link
Member

🦾 🕵️ 🖖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64EC I can't believe it's not x64! ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants