Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Linq Orderby performance optimization #46414

Closed
wants to merge 1 commit into from

Conversation

johnthcall
Copy link
Contributor

When performing OrderBy an int[] map is used to track the order of the items in the buffer. In the cases when there isn't a secondary sort however the items in the buffer can be sorted in place using ArraySortHelper which takes a Span of keys and a separate Span of items. In GetEnumerator and ToList this removes the need for the int[] map allocation and in ToArray it reuses the buffer array instead of allocating a new one.
The two open issues are:

  1. Is there a better way to decide when to apply this optimization.
  2. Whether classes should be allowed, for lower length it is slightly slower but has significantly lower memory allocation for ToArray
    https://github.com/dotnet/performance/compare/master...johnthcall:OrderBySingleSort?expand=1
Method Toolchain N Mean Error Ratio Allocated
StringToList pr 10 2,558.9 ns 135.43 ns 1.00 488 B
StringToList upstream 10 2,524.6 ns 144.37 ns 0.99 616 B
StringToArray pr 10 2,503.1 ns 102.37 ns 1.00 352 B
StringToArray upstream 10 2,516.3 ns 100.05 ns 1.01 584 B
LongToList pr 10 244.6 ns 11.17 ns 1.00 456 B
LongToList upstream 10 381.1 ns 22.69 ns 1.56 584 B
LongToArray pr 10 185.5 ns 8.24 ns 1.00 320 B
LongToArray upstream 10 368.7 ns 16.02 ns 1.99 552 B
ByteToList pr 10 208.5 ns 7.97 ns 1.00 264 B
ByteToList upstream 10 281.9 ns 14.82 ns 1.35 392 B
ByteToArray pr 10 162.1 ns 10.05 ns 1.00 192 B
ByteToArray upstream 10 265.9 ns 12.39 ns 1.65 360 B
StringToList pr 200 93,643.9 ns 3,403.92 ns 1.00 5048 B
StringToList upstream 200 97,151.2 ns 4,114.54 ns 1.04 5936 B
StringToArray pr 200 94,181.4 ns 4,961.81 ns 1.00 3392 B
StringToArray upstream 200 94,923.1 ns 3,305.63 ns 1.02 5904 B
LongToList pr 200 2,787.4 ns 121.50 ns 1.00 5016 B
LongToList upstream 200 13,183.1 ns 494.75 ns 4.76 5904 B
LongToArray pr 200 2,635.8 ns 90.17 ns 1.00 3360 B
LongToArray upstream 200 13,197.0 ns 598.67 ns 5.01 5872 B
ByteToList pr 200 2,565.9 ns 116.72 ns 1.00 816 B
ByteToList upstream 200 12,553.4 ns 538.90 ns 4.90 1704 B
ByteToArray pr 200 2,501.8 ns 158.85 ns 1.00 560 B
ByteToArray upstream 200 12,389.3 ns 617.73 ns 4.97 1672 B
StringToList pr 3000 2,445,032.7 ns 95,625.27 ns 1.00 72249 B
StringToList upstream 3000 2,445,737.2 ns 118,153.78 ns 1.00 84337 B
StringToArray pr 3000 2,435,115.9 ns 128,939.81 ns 1.00 48193 B
StringToArray upstream 3000 2,482,575.2 ns 101,295.32 ns 1.03 84305 B
LongToList pr 3000 147,456.0 ns 6,842.61 ns 1.00 72216 B
LongToList upstream 3000 347,817.5 ns 18,424.76 ns 2.36 84304 B
LongToArray pr 3000 142,437.5 ns 5,539.96 ns 1.00 48160 B
LongToArray upstream 3000 358,586.4 ns 17,107.91 ns 2.53 84272 B
ByteToList pr 3000 125,136.6 ns 6,217.83 ns 1.00 9216 B
ByteToList upstream 3000 366,918.6 ns 17,157.38 ns 2.93 21304 B
ByteToArray pr 3000 121,837.0 ns 3,701.15 ns 1.00 6160 B
ByteToArray upstream 3000 357,398.5 ns 13,957.32 ns 2.94 21272 B

@ghost
Copy link

ghost commented Dec 27, 2020

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

When performing OrderBy an int[] map is used to track the order of the items in the buffer. In the cases when there isn't a secondary sort however the items in the buffer can be sorted in place using ArraySortHelper which takes a Span of keys and a separate Span of items. In GetEnumerator and ToList this removes the need for the int[] map allocation and in ToArray it reuses the buffer array instead of allocating a new one.
The two open issues are:

  1. Is there a better way to decide when to apply this optimization.
  2. Whether classes should be allowed, for lower length it is slightly slower but has significantly lower memory allocation for ToArray
    https://github.com/dotnet/performance/compare/master...johnthcall:OrderBySingleSort?expand=1
Method Toolchain N Mean Error Ratio Allocated
StringToList pr 10 2,558.9 ns 135.43 ns 1.00 488 B
StringToList upstream 10 2,524.6 ns 144.37 ns 0.99 616 B
StringToArray pr 10 2,503.1 ns 102.37 ns 1.00 352 B
StringToArray upstream 10 2,516.3 ns 100.05 ns 1.01 584 B
LongToList pr 10 244.6 ns 11.17 ns 1.00 456 B
LongToList upstream 10 381.1 ns 22.69 ns 1.56 584 B
LongToArray pr 10 185.5 ns 8.24 ns 1.00 320 B
LongToArray upstream 10 368.7 ns 16.02 ns 1.99 552 B
ByteToList pr 10 208.5 ns 7.97 ns 1.00 264 B
ByteToList upstream 10 281.9 ns 14.82 ns 1.35 392 B
ByteToArray pr 10 162.1 ns 10.05 ns 1.00 192 B
ByteToArray upstream 10 265.9 ns 12.39 ns 1.65 360 B
StringToList pr 200 93,643.9 ns 3,403.92 ns 1.00 5048 B
StringToList upstream 200 97,151.2 ns 4,114.54 ns 1.04 5936 B
StringToArray pr 200 94,181.4 ns 4,961.81 ns 1.00 3392 B
StringToArray upstream 200 94,923.1 ns 3,305.63 ns 1.02 5904 B
LongToList pr 200 2,787.4 ns 121.50 ns 1.00 5016 B
LongToList upstream 200 13,183.1 ns 494.75 ns 4.76 5904 B
LongToArray pr 200 2,635.8 ns 90.17 ns 1.00 3360 B
LongToArray upstream 200 13,197.0 ns 598.67 ns 5.01 5872 B
ByteToList pr 200 2,565.9 ns 116.72 ns 1.00 816 B
ByteToList upstream 200 12,553.4 ns 538.90 ns 4.90 1704 B
ByteToArray pr 200 2,501.8 ns 158.85 ns 1.00 560 B
ByteToArray upstream 200 12,389.3 ns 617.73 ns 4.97 1672 B
StringToList pr 3000 2,445,032.7 ns 95,625.27 ns 1.00 72249 B
StringToList upstream 3000 2,445,737.2 ns 118,153.78 ns 1.00 84337 B
StringToArray pr 3000 2,435,115.9 ns 128,939.81 ns 1.00 48193 B
StringToArray upstream 3000 2,482,575.2 ns 101,295.32 ns 1.03 84305 B
LongToList pr 3000 147,456.0 ns 6,842.61 ns 1.00 72216 B
LongToList upstream 3000 347,817.5 ns 18,424.76 ns 2.36 84304 B
LongToArray pr 3000 142,437.5 ns 5,539.96 ns 1.00 48160 B
LongToArray upstream 3000 358,586.4 ns 17,107.91 ns 2.53 84272 B
ByteToList pr 3000 125,136.6 ns 6,217.83 ns 1.00 9216 B
ByteToList upstream 3000 366,918.6 ns 17,157.38 ns 2.93 21304 B
ByteToArray pr 3000 121,837.0 ns 3,701.15 ns 1.00 6160 B
ByteToArray upstream 3000 357,398.5 ns 13,957.32 ns 2.94 21272 B
Author: johnthcall
Assignees: -
Labels:

area-System.Linq

Milestone: -

@stephentoub
Copy link
Member

Do you have numbers for the case where there is a secondary sort? e.g. a small array with a trivial primary and secondary sort? In that case we're paying for the virtual dispatch and reflection checks but not getting anything for them.

@johnthcall
Copy link
Contributor Author

@stephentoub
For 10 strings containing random int's
StringToList is Strings.OrderBy(s => s).ThenBy(s => s).ToList();
StringToArray is Strings.OrderBy(s => s).ThenBy(s => s).ToArray();

Method Toolchain N Mean Error Ratio Allocated
StringToList new 10 2.880 us 0.2544 us 1.00 864 B
StringToList upstream 10 2.801 us 0.2216 us 0.99 864 B
StringToArray new 10 2.913 us 0.2185 us 1.00 832 B
StringToArray upstream 10 2.934 us 0.2236 us 1.01 832 B

@eiriktsarpalis eiriktsarpalis self-assigned this Feb 8, 2021
Base automatically changed from master to main March 1, 2021 09:07
@johnthcall
Copy link
Contributor Author

@eiriktsarpalis Is anything else needed here?

Copy link
Member

@eiriktsarpalis eiriktsarpalis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional feedback:

  • I'd be interested to see benchmark results when the elements are large-ish structs.
  • There were build failures but I couldn't see them due to the age of the build. Would it be possible to rebase your changes over the latest main?

return false;
}
Type elementType = typeof(TElement);
return elementType.IsValueType || (elementType.IsClass && !elementType.IsArray);
Copy link
Member

@eiriktsarpalis eiriktsarpalis Mar 8, 2021

@@ -158,6 +173,16 @@ internal override CachingComparer<TElement> GetComparer(CachingComparer<TElement
: new CachingComparerWithChild<TElement, TKey>(_keySelector, _comparer, _descending, childComparer);
return _parent != null ? _parent.GetComparer(cmp) : cmp;
}

protected override bool CanInPlaceSort()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good place to write a comment detailing the motivations for this optimization.

@@ -16,6 +16,8 @@ internal abstract partial class OrderedEnumerable<TElement> : IOrderedEnumerable

private int[] SortedMap(Buffer<TElement> buffer) => GetEnumerableSorter().Sort(buffer._items, buffer._count);

private void SortElements(Buffer<TElement> buffer) => GetEnumerableSorter().SortElements(buffer._items, buffer._count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use a name that emphasizes that this is an in-place sort, i.e. buffer is going to get mutated.

@@ -343,6 +376,9 @@ internal override int CompareAnyKeys(int index1, int index2)
protected override void QuickSort(int[] keys, int lo, int hi) =>
new Span<int>(keys, lo, hi - lo + 1).Sort(CompareAnyKeys);

protected override void QuickSortElements(TElement[] element, int lo, int hi) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I would call this something like QuickSortInPlace or something to that effect.

@@ -343,6 +376,9 @@ internal override int CompareAnyKeys(int index1, int index2)
protected override void QuickSort(int[] keys, int lo, int hi) =>
new Span<int>(keys, lo, hi - lo + 1).Sort(CompareAnyKeys);

protected override void QuickSortElements(TElement[] element, int lo, int hi) =>
new Span<TKey>(_keys, lo, hi - lo + 1).Sort(new Span<TElement>(element, lo, hi-lo +1), _comparer);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem to be taking descending sorts into account.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch, I tried using the overload taking Comparison<TKey> and made a function like CompareAnyKeys. This is causing a huge perf degredation, it appears that it is due to the additional method call to the Comparable<TKey> created in Sort and and the call to new new Comparison<TKey>. Since Sort doesn't support descending the only way to maintain this Perf is to restrict this to only when doing ascending which I think is a bad idea. If you don't have any further suggestion I'll abandon this PR.

Method Job Toolchain N Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
LongToList Job-KMOQSK IComparison<TKey> 200 15,966.5 ns 736.64 ns 788.20 ns 15,986.1 ns 14,693.0 ns 17,493.0 ns 1.1421 - - 5104 B
LongToList Job-BJFYTT upstream 200 15,546.1 ns 773.05 ns 859.24 ns 15,410.0 ns 14,292.5 ns 17,462.7 ns 1.3152 - - 5904 B
LongToList Job-BCHDBG Comparable<TKey> 200 3,087.8 ns 177.38 ns 189.79 ns 3,065.8 ns 2,846.0 ns 3,512.2 ns 1.1559 - - 5016 B

@stephentoub
Copy link
Member

If you don't have any further suggestion I'll abandon this PR.

Thanks for all your efforts to try to improve things here, @johnthcall. Given the data, let's go ahead and close this.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 15, 2021
@karelz karelz added this to the 6.0.0 milestone May 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants