Skip to content

Enumerable.ToArray performance improvement using InlineArray #90459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

neuecc
Copy link

@neuecc neuecc commented Aug 12, 2023

I've optimized the Enumerable.ToArray method for the case when the source is a pure IEnumerable<T>.
Typically, when the buffer overflows, an array with double the previous capacity is created, and the copying process repeats.
In this PR, although we still create an array of double the size, instead of copying immediately, we add the array to a list and perform all the copying at the end.
By reducing the number of copy operations, we see a significant performance improvement.

Benchmark

BenchmarkDotNet v0.13.7, Windows 10 (10.0.19045.3324/22H2/2022Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100-preview.7.23376.3
  [Host]     : .NET 8.0.0 (8.0.23.37506), X64 RyuJIT AVX2
  Job-NGJHUQ : .NET 8.0.0 (8.0.23.37506), X64 RyuJIT AVX2

IterationCount=1  WarmupCount=1  
Method Length Mean Error Gen0 Gen1 Gen2 Allocated
ToList 10 53.12 ns NA 0.0153 - - 256 B
ToArray 10 71.83 ns NA 0.0153 - - 256 B
PR_ToArray 10 38.40 ns NA 0.0119 - - 200 B
ToList 100 220.14 ns NA 0.0730 - - 1224 B
ToArray 100 326.60 ns NA 0.0710 - - 1192 B
PR_ToArray 100 180.50 ns NA 0.0644 - - 1080 B
ToList 1000 1,679.32 ns NA 0.5054 0.0057 - 8464 B
ToArray 1000 2,081.20 ns NA 0.5074 - - 8536 B
PR_ToArray 1000 1,384.58 ns NA 0.4978 - - 8336 B
ToList 10000 17,214.66 ns NA 7.8430 1.2817 - 131440 B
ToArray 10000 21,014.18 ns NA 6.3171 - - 106224 B
PR_ToArray 10000 13,196.72 ns NA 6.3171 - - 105872 B
ToList 100000 338,986.18 ns NA 285.6445 285.6445 285.6445 1049112 B
ToArray 100000 382,722.36 ns NA 249.5117 249.5117 249.5117 925132 B
PR_ToArray 100000 358,964.21 ns NA 249.5117 249.5117 249.5117 924780 B
ToList 1000000 4,533,538.28 ns NA 1984.3750 1984.3750 1984.3750 8389748 B
ToArray 1000000 3,067,601.56 ns NA 796.8750 796.8750 796.8750 8195588 B
PR_ToArray 1000000 2,418,546.88 ns NA 800.7813 800.7813 800.7813 8195040 B
ToList 10000000 197,990,525.00 ns NA 3875.0000 3875.0000 3875.0000 134219664 B
ToArray 10000000 229,693,275.00 ns NA 1875.0000 1875.0000 1875.0000 107110743 B
PR_ToArray 10000000 29,184,353.12 ns NA 1968.7500 1968.7500 1968.7500 107110478 B
ToList 100000000 583,235,200.00 ns NA 6000.0000 6000.0000 6000.0000 1073744896 B
ToArray 100000000 501,066,800.00 ns NA 3000.0000 3000.0000 3000.0000 936873600 B
PR_ToArray 100000000 270,801,200.00 ns NA 2000.0000 2000.0000 2000.0000 936872432 B
ToList 1000000000 4,677,241,700.00 ns NA 9000.0000 9000.0000 9000.0000 8589938744 B
ToArray 1000000000 5,229,577,600.00 ns NA 4000.0000 4000.0000 4000.0000 8294970392 B
PR_ToArray 1000000000 3,136,817,900.00 ns NA 4000.0000 4000.0000 4000.0000 8294969760 B
public class ToArrayBenchmark
{
    [Params(10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)]
    public int Length { get; set; }

    [Benchmark]
    public List<int> ToList()
    {
        return GenerateNumber(Length).ToList();
    }

    [Benchmark]
    public int[] ToArray()
    {
        return GenerateNumber(Length).ToArray();
    }

    [Benchmark]
    public int[] PR_ToArray()
    {
        return EnumerableHelpers.ToArray2(GenerateNumber(Length));
    }

    public static IEnumerable<int> GenerateNumber(int count)
    {
        for (int i = 0; i < count; i++)
        {
            yield return i;
        }
    }
}

Implementation detail

In this scenario, there's a fixed maximum length for the list of arrays that should be allocated.
Starting with 4 and doubling the array size each time, we reach the maximum length after 29 arrays.
Therefore, using [InlineArray(29)], I've reserved a fixed-length space for the T[].

Option

It's also possible to use ArrayPool<T>.Shared.Rent instead of new T[].
In that case, by returning the array when calling ToArray, both copying and returning can be efficiently handled.
I wasn't sure if using ArrayPool was appropriate for this scenario, so in this PR, I've opted for creating new arrays.

use InlineArray byte[] sequence instead of LargeArrayBuilder
@ghost ghost added area-System.Collections community-contribution Indicates that the PR has been added by a community member labels Aug 12, 2023
@ghost
Copy link

ghost commented Aug 12, 2023

Tagging subscribers to this area: @dotnet/area-system-collections
See info in area-owners.md if you want to be subscribed.

Issue Details

I've optimized the Enumerable.ToArray method for the case when the source is a pure IEnumerable<T>.
Typically, when the buffer overflows, an array with double the previous capacity is created, and the copying process repeats.
In this PR, although we still create an array of double the size, instead of copying immediately, we add the array to a list and perform all the copying at the end.
By reducing the number of copy operations, we see a significant performance improvement.

Benchmark

BenchmarkDotNet v0.13.7, Windows 10 (10.0.19045.3324/22H2/2022Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.100-preview.7.23376.3
  [Host]     : .NET 8.0.0 (8.0.23.37506), X64 RyuJIT AVX2
  Job-NGJHUQ : .NET 8.0.0 (8.0.23.37506), X64 RyuJIT AVX2

IterationCount=1  WarmupCount=1  
Method Length Mean Error Gen0 Gen1 Gen2 Allocated
ToList 10 53.12 ns NA 0.0153 - - 256 B
ToArray 10 71.83 ns NA 0.0153 - - 256 B
PR_ToArray 10 38.40 ns NA 0.0119 - - 200 B
ToList 100 220.14 ns NA 0.0730 - - 1224 B
ToArray 100 326.60 ns NA 0.0710 - - 1192 B
PR_ToArray 100 180.50 ns NA 0.0644 - - 1080 B
ToList 1000 1,679.32 ns NA 0.5054 0.0057 - 8464 B
ToArray 1000 2,081.20 ns NA 0.5074 - - 8536 B
PR_ToArray 1000 1,384.58 ns NA 0.4978 - - 8336 B
ToList 10000 17,214.66 ns NA 7.8430 1.2817 - 131440 B
ToArray 10000 21,014.18 ns NA 6.3171 - - 106224 B
PR_ToArray 10000 13,196.72 ns NA 6.3171 - - 105872 B
ToList 100000 338,986.18 ns NA 285.6445 285.6445 285.6445 1049112 B
ToArray 100000 382,722.36 ns NA 249.5117 249.5117 249.5117 925132 B
PR_ToArray 100000 358,964.21 ns NA 249.5117 249.5117 249.5117 924780 B
ToList 1000000 4,533,538.28 ns NA 1984.3750 1984.3750 1984.3750 8389748 B
ToArray 1000000 3,067,601.56 ns NA 796.8750 796.8750 796.8750 8195588 B
PR_ToArray 1000000 2,418,546.88 ns NA 800.7813 800.7813 800.7813 8195040 B
ToList 10000000 197,990,525.00 ns NA 3875.0000 3875.0000 3875.0000 134219664 B
ToArray 10000000 229,693,275.00 ns NA 1875.0000 1875.0000 1875.0000 107110743 B
PR_ToArray 10000000 29,184,353.12 ns NA 1968.7500 1968.7500 1968.7500 107110478 B
ToList 100000000 583,235,200.00 ns NA 6000.0000 6000.0000 6000.0000 1073744896 B
ToArray 100000000 501,066,800.00 ns NA 3000.0000 3000.0000 3000.0000 936873600 B
PR_ToArray 100000000 270,801,200.00 ns NA 2000.0000 2000.0000 2000.0000 936872432 B
ToList 1000000000 4,677,241,700.00 ns NA 9000.0000 9000.0000 9000.0000 8589938744 B
ToArray 1000000000 5,229,577,600.00 ns NA 4000.0000 4000.0000 4000.0000 8294970392 B
PR_ToArray 1000000000 3,136,817,900.00 ns NA 4000.0000 4000.0000 4000.0000 8294969760 B
public class ToArrayBenchmark
{
    [Params(10, 100, 1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000)]
    public int Length { get; set; }

    [Benchmark]
    public List<int> ToList()
    {
        return GenerateNumber(Length).ToList();
    }

    [Benchmark]
    public int[] ToArray()
    {
        return GenerateNumber(Length).ToArray();
    }

    [Benchmark]
    public int[] PR_ToArray()
    {
        return EnumerableHelpers.ToArray2(GenerateNumber(Length));
    }

    public static IEnumerable<int> GenerateNumber(int count)
    {
        for (int i = 0; i < count; i++)
        {
            yield return i;
        }
    }
}

Implementation detail

In this scenario, there's a fixed maximum length for the list of arrays that should be allocated.
Starting with 4 and doubling the array size each time, we reach the maximum length after 29 arrays.
Therefore, using [InlineArray(29)], I've reserved a fixed-length space for the T[].

Option

It's also possible to use ArrayPool<T>.Shared.Rent instead of new T[].
In that case, by returning the array when calling ToArray, both copying and returning can be efficiently handled.
I wasn't sure if using ArrayPool was appropriate for this scenario, so in this PR, I've opted for creating new arrays.

Author: neuecc
Assignees: -
Labels:

area-System.Collections

Milestone: -

@@ -80,9 +80,112 @@ internal static T[] ToArray<T>(IEnumerable<T> source)
return result;
}

LargeArrayBuilder<T> builder = new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. What would it take to augment your PR to the point where we could delete LargeArrayBuilder? That'd make me much more interested in this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LargeArrayBuilder is used in many places, and given that it includes numerous methods specific to certain scenarios, the amount of rewrites, including from the user's side, would likely be extensive. Ideally, we'd like to replace it entirely, but we want to exclude that from the initial implementation.

Using the ArrayPool as suggested in the option might not be advisable if we replace the LargeArrayBuilder.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that these code paths are used in a bunch of places on common paths, and adding more generic types will increase code size. It'd be nice to get rid of the LargeArrayBuilder so that it's a net win for both throughput and size, rather than improving throughput at the expense of size.

There's no rush here as it's not going to make .NET 8, anyway, so there's plenty of time to make and evaluate the larger change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.
If this implementation is acceptable as a basic policy,
I'll take on the challenge of replacing the LargeArrayBuilder.
(Indeed, I was hoping it might make it into .NET 8! That would be great!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this implementation is acceptable as a basic policy,

We'd support it in principle. Would you like to build on top of this PR or would you prefer to open a new one?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am willing to work on this PR.
NET 8 release seems to be coming soon, so I will do it from the branch there when it is released. ......

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main is the correct branch. Any such change will be for .NET 9.

@neuecc
Copy link
Author

neuecc commented Aug 12, 2023

@dotnet-policy-service agree

@Windows10CE
Copy link
Contributor

Windows10CE commented Aug 12, 2023

Your benchmarks show this being faster than ToList (almost) across the board, could it use the same (or similar) optimization?


public T[] ToArray(int lastBlockCount)
{
T[] array = GC.AllocateUninitializedArray<T>(_count + lastBlockCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllocateUninitializedArray will introduce additional overhead for small arrays here.

Copy link
Contributor

@reflectronic reflectronic Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for AllocateUninitializedArray already has a special case for small-ish arrays:

// small arrays are allocated using `new[]` as that is generally faster.
#pragma warning disable 8500 // sizeof of managed types
if (length < 2048 / sizeof(T))
#pragma warning restore 8500
{
return new T[length];
}

Given that AllocateUninitializedArray is marked as AggressiveInlining, there's likely no discernable perf impact here (modulo tuning on the threshold).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when length(_count + lastBlockCount) == 0, return Array.Empty<byte>().

Comment on lines +181 to +185
for (int i = 0; i < _index; i++)
{
_blocks[i].CopyTo(dest);
dest = dest.Slice(_blocks[i].Length);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < _index; i++)
{
_blocks[i].CopyTo(dest);
dest = dest.Slice(_blocks[i].Length);
}
ReadOnlySpan<T[]> blocks = _blocks;
foreach (T[] block in blocks.Slice(0, _index))
{
block.CopyTo(dest);
dest = dest.Slice(block.Length);
}

This should remove the bounds checks here too. Not sure whether this is the best syntax possible here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ref var item = ref _blocks[i].

@reflectronic
Copy link
Contributor

reflectronic commented Aug 13, 2023

Something that'd be neat to add here (though the feasibility is questionable) is a special case for small arrays (say, fewer than 32 elements) using an InlineArray of T. Something like:

struct ArrayBuilder<T>
{
    [InlineArray(32)] struct ElementsBlock { T _field; }
    [InlineArray(32)] struct ArraysBlock { T[] _field; }

    ElementsBlock _elements;
    ArraysBlock _arrays;

   // ...
}

Inside the implementation of ArrayBuilder, you'd first fill up ElementsBlock before adding any new elements to ArraysBlock. This eliminates all intermediate allocations for small arrays.

The problem with this idea is that it'll chew through the stack for large structs. This could cause stack overflows when user-defined types come into the picture, even if the entire ElementsBlock is not filled.

The solutions are not pretty. You'd have to defer the stack allocation of ElementsBlock until you're sure that its size is reasonable. But, as far as I'm aware, the only way to reliably do this is with a non-inlineable method at every callsite of ArrayBuilder, plus a very unpleasant contortion of the callsite's actual logic. Doesn't seem very appealing.

I'm curious to hear if there's a way to salvage this idea.

@eiriktsarpalis eiriktsarpalis added this to the 9.0.0 milestone Aug 13, 2023
@eiriktsarpalis eiriktsarpalis added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 27, 2023
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 31, 2023
@eiriktsarpalis eiriktsarpalis added the needs-author-action An issue or pull request that requires more info or actions from the author. label Oct 31, 2023
@ghost ghost added the no-recent-activity label Nov 14, 2023
@ghost
Copy link

ghost commented Nov 14, 2023

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@eiriktsarpalis eiriktsarpalis self-assigned this Nov 15, 2023
@ghost ghost removed the no-recent-activity label Nov 15, 2023
@ghost ghost added the no-recent-activity label Nov 29, 2023
@ghost
Copy link

ghost commented Nov 29, 2023

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@ghost
Copy link

ghost commented Dec 13, 2023

This pull request will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the pull request, but please note that it will be locked if it remains inactive for another 30 days.

This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Collections community-contribution Indicates that the PR has been added by a community member needs-author-action An issue or pull request that requires more info or actions from the author. no-recent-activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants