Skip to content

Conversation

@Henr1k80
Copy link
Contributor

@Henr1k80 Henr1k80 commented Oct 2, 2025

I found that there were room for improvement in Array.FindAll

Now a small amount of matches are stack allocated and the List is only optionally allocated.

It is always faster with less allocations.

The stack allocated size can be made bigger, but it is fixed size, so it will have a price for no-match schenarios.

InlineArray16<T> is only in .NET 10, but if this should be backported to .NET 8, we can create our local InlineMatchArrayX

BenchmarkDotNet v0.15.8, Windows 11 (10.0.26200.6899/25H2/2025Update/HudsonValley2)
Intel Core i7-10750H CPU 2.60GHz (Max: 2.59GHz), 1 CPU, 12 logical and 6 physical cores
.NET SDK 10.0.102
  [Host]             : .NET 10.0.2 (10.0.2, 10.0.225.61305), X64 RyuJIT x86-64-v3
  ShortRun-.NET 10.0 : .NET 10.0.2 (10.0.2, 10.0.225.61305), X64 RyuJIT x86-64-v3
Method Size Mean Error StdDev Median Ratio RatioSD Gen0 Allocated Alloc Ratio
NewFindAllMatchAlways 1 8.612 ns 0.1054 ns 0.0986 ns 8.649 ns 0.44 0.01 0.0051 32 B 0.31
OldFindAllMatchAlways 1 19.612 ns 0.1859 ns 0.1553 ns 19.612 ns 1.00 0.01 0.0166 104 B 1.00
NewFindAllMatchAlways 4 11.605 ns 0.1418 ns 0.1257 ns 11.583 ns 0.47 0.01 0.0064 40 B 0.36
OldFindAllMatchAlways 4 24.465 ns 0.1829 ns 0.1711 ns 24.481 ns 1.00 0.01 0.0179 112 B 1.00
NewFindAllMatchAlways 5 12.565 ns 0.1595 ns 0.1332 ns 12.556 ns 0.32 0.00 0.0076 48 B 0.27
OldFindAllMatchAlways 5 38.987 ns 0.2646 ns 0.2346 ns 39.031 ns 1.00 0.01 0.0280 176 B 1.00
NewFindAllMatchAlways 8 16.465 ns 0.5138 ns 1.5068 ns 15.716 ns 0.36 0.03 0.0089 56 B 0.30
OldFindAllMatchAlways 8 46.208 ns 0.6164 ns 0.5147 ns 46.041 ns 1.00 0.02 0.0293 184 B 1.00
NewFindAllMatchAlways 9 23.627 ns 0.3274 ns 0.2902 ns 23.615 ns 0.38 0.01 0.0191 120 B 0.43
OldFindAllMatchAlways 9 62.204 ns 1.2656 ns 1.1838 ns 61.899 ns 1.00 0.03 0.0446 280 B 1.00
NewFindAllMatchHalf 1 9.687 ns 0.2352 ns 0.2310 ns 9.703 ns 0.46 0.01 0.0051 32 B 0.31
OldFindAllMatchHalf 1 20.935 ns 0.4699 ns 0.4615 ns 20.813 ns 1.00 0.03 0.0166 104 B 1.00
NewFindAllMatchHalf 4 13.175 ns 0.3239 ns 0.3977 ns 13.039 ns 0.55 0.02 0.0051 32 B 0.31
OldFindAllMatchHalf 4 23.811 ns 0.5150 ns 0.5058 ns 23.746 ns 1.00 0.03 0.0166 104 B 1.00
NewFindAllMatchHalf 5 12.973 ns 0.3205 ns 0.3291 ns 12.894 ns 0.50 0.02 0.0063 40 B 0.36
OldFindAllMatchHalf 5 26.193 ns 0.5801 ns 0.5426 ns 26.252 ns 1.00 0.03 0.0179 112 B 1.00
NewFindAllMatchHalf 8 14.521 ns 0.3448 ns 0.3832 ns 14.484 ns 0.51 0.02 0.0063 40 B 0.36
OldFindAllMatchHalf 8 28.599 ns 0.6119 ns 0.7047 ns 28.575 ns 1.00 0.03 0.0178 112 B 1.00
NewFindAllMatchHalf 9 23.859 ns 0.2977 ns 0.3655 ns 23.859 ns 0.54 0.01 0.0166 104 B 0.59
OldFindAllMatchHalf 9 44.463 ns 0.8785 ns 0.7788 ns 44.551 ns 1.00 0.02 0.0280 176 B 1.00
NewFindAllMatchNever 1 1.506 ns 0.0304 ns 0.0269 ns 1.502 ns 0.40 0.01 - - 0.00
OldFindAllMatchNever 1 3.785 ns 0.0430 ns 0.0381 ns 3.786 ns 1.00 0.01 0.0051 32 B 1.00
NewFindAllMatchNever 4 2.249 ns 0.0378 ns 0.0354 ns 2.251 ns 0.48 0.01 - - 0.00
OldFindAllMatchNever 4 4.736 ns 0.0892 ns 0.0834 ns 4.730 ns 1.00 0.02 0.0051 32 B 1.00
NewFindAllMatchNever 5 2.738 ns 0.0298 ns 0.0264 ns 2.733 ns 0.49 0.01 - - 0.00
OldFindAllMatchNever 5 5.622 ns 0.0644 ns 0.0603 ns 5.612 ns 1.00 0.01 0.0051 32 B 1.00
NewFindAllMatchNever 8 3.424 ns 0.0244 ns 0.0216 ns 3.426 ns 0.66 0.01 - - 0.00
OldFindAllMatchNever 8 5.212 ns 0.0459 ns 0.0429 ns 5.205 ns 1.00 0.01 0.0051 32 B 1.00
NewFindAllMatchNever 9 3.798 ns 0.0202 ns 0.0169 ns 3.796 ns 0.67 0.02 - - 0.00
OldFindAllMatchNever 9 5.714 ns 0.1782 ns 0.1667 ns 5.688 ns 1.00 0.04 0.0051 32 B 1.00

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 2, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 2, 2025
@huoyaoyuan
Copy link
Member

Thanks you for your contribution. The code is not performing well with best practices. Instead, just replace the List with ValueListBuilder and add a Dispose call to the builder will serve you all the optimizations.

@huoyaoyuan huoyaoyuan added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 3, 2025
@huoyaoyuan
Copy link
Member

huoyaoyuan commented Oct 3, 2025

There may be more unintentional affect after rethinking. ArrayPool<T> is quite heavy for each T, especially when the pool is not widely shared. That's also why the BCL only uses ArrayPool over limited set of types (byte, char and other integers). Even ArrayPool<object> is never used in BCL.

Initializing the array pool for each T has not acceptable overhead to whole program. Microbenchmark improvement doesn't result in benefit to whole program. This PR could not be accepted because of this.

@Henr1k80
Copy link
Contributor Author

Henr1k80 commented Oct 3, 2025

There may be more unintentional affect after rethinking. ArrayPool<T> is quite heavy for each T, especially when the pool is not widely shared. That's also why the BCL only uses ArrayPool over limited set of types (byte, char and other integers). Even ArrayPool<object> is never used in BCL.

Initializing the array pool for each T has not acceptable overhead to whole program. Microbenchmark improvement doesn't result in benefit to whole program. This PR could not be accepted because of this.

SegmentedArrayBuilder uses ArrayPool<T> and SegmentedArrayBuilder is used within LINQ

@Henr1k80 Henr1k80 marked this pull request as ready for review October 3, 2025 14:14
@tarekgh tarekgh added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 21, 2026
@jkotas
Copy link
Member

jkotas commented Jan 21, 2026

I wanted to use this method, dived into the implementation and found that I was better off not using this method, unless improved

What would be the typical distribution of the array lengths passed into this API and returned by this API in your case?

could you please guide how can do that to ensure no regression?

This type of change tends to be shifting the cost. It improves the cases that we expect to matter more, and degrades the cases that we expect to be unimportant.

My gut-feel is that:

  • 16 is probably too big of a fixed buffer. Something like 4 may be more appropriate.
  • Depending on List pulls in a bunch of extra code. It may be leaner to open code array resizing as part of the method (without any helpers). It would save both List<T> allocation and List code.

@dotnet-policy-service dotnet-policy-service bot removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 22, 2026
@Henr1k80
Copy link
Contributor Author

@jkotas

16 is probably too big of a fixed buffer. Something like 4 may be more appropriate.

Does it make sense to check the length of the incoming array first?
If it is less than or equal to 16, we could call an implementation that only stack allocates, without using any List code at all.
There could also be implementations for less than or equal to 4 and/or 8, at the cost of more code & branches.

I presume that the dotnet runtime will move the unused code paths out of the way, when it is optimizing.
I do not know enough about NativeAOT to tell if it would do the same?

@Henr1k80
Copy link
Contributor Author

@jkotas

What would be the typical distribution of the array lengths passed into this API and returned by this API in your case?

I can no longer remember the context, but likely less than 4 and almost certainly less than 16 inputted.
As this is reasonable to stack allocate, the output size doesn't matter that much.

@tarekgh
Copy link
Member

tarekgh commented Jan 22, 2026

@jkotas would the following implementation match what you are suggesting?

        public static T[] FindAll<T>(T[] array, Predicate<T> match)
        {
            if (array == null)
            {
                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.array);
            }

            if (match == null)
            {
                ThrowHelper.ThrowArgumentNullException(ExceptionArgument.match);
            }

            InlineArray4<T> stackAllocatedMatches = default;
            Span<T> span = stackAllocatedMatches;
            int foundCount = 0;

            for (int i = 0; i < array.Length; i++)
            {
                T value = array[i];
                if (match(value))
                {
                    if (foundCount >= span.Length)
                    {
                        T[] values = new T[span.Length * 2];
                        span.CopyTo(values);
                        span = values;
                    }

                    span[foundCount++] = value;
                }
            }

            return span.Slice(0, foundCount).ToArray();
        }

@jkotas
Copy link
Member

jkotas commented Jan 22, 2026

Yes, I think something like this would have a better balance of perf characteristics. You do not need to special case Array,Empty in the implementation. Span.ToArray has that special case already.

@tarekgh
Copy link
Member

tarekgh commented Jan 22, 2026

@Henr1k80 could you please try the implementation #120336 (comment) and get the benchmark numbers for it? Thanks!

@jkotas
Copy link
Member

jkotas commented Jan 22, 2026

If it is less than or equal to 16, we could call an implementation that only stack allocates, without using any List code at all.
There could also be implementations for less than or equal to 4 and/or 8, at the cost of more code & branches.

This is a convenience API that is always going to be leaving perf on the table. I do not think it makes sense to overengineer the implementation like this.

@Henr1k80
Copy link
Contributor Author

@tarekgh & @jkotas, I have updated the code & updated the benchmarks

Copy link
Member

@tarekgh tarekgh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas the latest update LGTM. Please let us know if you have any more feedback.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise. Thanks!

@jkotas jkotas merged commit 687c8d9 into dotnet:main Jan 24, 2026
148 of 150 checks passed
@jkotas jkotas added the tenet-performance Performance related issue label Jan 24, 2026
@Henr1k80
Copy link
Contributor Author

I just updated the bench results in the original comment to the newest suggestions.
I wish there was an easy way to compare with Toubs last change, it could avoid some overallocations, especially compared to the original implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Runtime community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants