Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Further improve performance of Memory<T>.Span property getter #22829

Closed

Conversation

GrabYourPitchforks
Copy link
Member

Working through further experiments with minimizing the overhead of the {ReadOnly}Memory<T>.Span property getter. Trying to minimize the number of comparisons performed, minimize the stack space required, and utilize register allocations more effectively.

Results from an x64 run are below. I didn't perform an x86 run yet.

Method Toolchain Mean Error StdDev Ratio RatioSD
GetSpan_MemOfBytesFromArray 3.0-preview3 1,955.4 us 26.541 us 23.528 us 1.00 0.00
GetSpan_MemOfBytesFromArray exp. 27423-58 1,708.6 us 12.064 us 10.074 us 0.87 0.01
GetSpan_MemOfBytesFromMemMgr 3.0-preview3 3,132.1 us 38.245 us 31.937 us 1.00 0.00
GetSpan_MemOfBytesFromMemMgr exp. 27423-58 4,304.9 us 50.077 us 46.842 us 1.38 0.02
GetSpan_MemOfBytesEmpty 3.0-preview3 1,128.9 us 7.760 us 6.879 us 1.00 0.00
GetSpan_MemOfBytesEmpty exp. 27423-58 860.6 us 5.582 us 5.222 us 0.76 0.01
GetSpan_MemOfCharsFromArray 3.0-preview3 2,364.8 us 36.351 us 32.225 us 1.00 0.00
GetSpan_MemOfCharsFromArray exp. 27423-58 1,788.2 us 32.345 us 27.010 us 0.76 0.02
GetSpan_MemOfCharsFromString 3.0-preview3 1,791.3 us 24.862 us 23.256 us 1.00 0.00
GetSpan_MemOfCharsFromString exp. 27423-58 1,642.9 us 12.501 us 11.694 us 0.92 0.01
GetSpan_MemOfCharsFromMemMgr 3.0-preview3 3,152.8 us 23.027 us 21.540 us 1.00 0.00
GetSpan_MemOfCharsFromMemMgr exp. 27423-58 4,573.2 us 64.132 us 59.989 us 1.45 0.02
GetSpan_MemOfCharsEmpty 3.0-preview3 1,427.8 us 11.145 us 8.701 us 1.00 0.00
GetSpan_MemOfCharsEmpty exp. 27423-58 923.9 us 18.260 us 16.187 us 0.65 0.01

Codegen for the x64 implementation of the ReadOnlyMemory<char>.Span property getter is also around 20% smaller: 118 bytes before, 95 bytes after.

The regression in the from MemoryManager<T> runs isn't surprising due to the extra call now required here. This was a deliberate trade-off: if MemoryManager<T> usage is expected to be rare, deprioritize it and optimize the codegen for the more common cases. Since we already made improvements in this code path between 2.1 and 3.0 (see #20386), we're still within around 5% of the 2.1 performance numbers.

Marked WIP because it's not fully implemented (haven't touched Memory<T>.Span yet) or tested.

@GrabYourPitchforks GrabYourPitchforks changed the title Further improve performance of Memory<T>.Span property getter [WIP] Further improve performance of Memory<T>.Span property getter Feb 25, 2019
@stephentoub
Copy link
Member

if MemoryManager usage is expected to be rare

I'm not convinced it is. This is the only way today to use a pointer/length with Memory, and we're already seeing developers relying on that.

It of course still might be worth a trade-off here, but we should still aim not to regress it.

// it occupies the same position as String.Length and SzArray.Length; but because
// this value is negative and those values are not, we can easily determine whether
// any given Memory<T> is backed by a variable-length object (string, array) or
// a fixed-length object (MemoryManager).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the runtime decide to rearrange the fields in this class?

Copy link
Member Author

@GrabYourPitchforks GrabYourPitchforks Feb 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make it a single IntPtr field per your suggestion this should address the rearrangement concern within the base class. The runtime cannot rearrange fields between a base class and its derived types, so we're good there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, this padding is penalizing MemoryManager path to make the other path a bit faster. And it is pretty hacky change. I do not think this is a good tradeoff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had believed the most common usage of Memory<T> was to be backed by String or SzArray. (In fact, I had believed this to be overwhelmingly so.) If this assumption was invalid, then yeah, we shouldn't take the bulk of this PR.

There is still one thing we can take advantage of if we wish: in 32-bit procs, String and SzArray have the exact same layout. This means we could condition the if (typeof(T) == typeof(char)) check on 64-bit only, which should give a perf boost to 32-bit procs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had believed the most common usage of Memory was to be backed by String or SzArray

I think that's true for workloads like those in ASP.NET. But I expect that won't be true for certain apps or domains, in particular those that interop with native memory as the lingua franca. The improvements cited for the strings/arrays cases appears to generally be < 10%, whereas the regression for the MemoryManager case looks to be closer to ~40%.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improvements cited for the strings/arrays cases appears to generally be < 10%, whereas the regression for the MemoryManager case looks to be closer to ~40%.

There are three very specific improvements this PR provides when compared to baseline: (a) improving the performance for String/SzArray, (b) reducing the codegen size by around ~20%, and (c) reducing stack spillage in the calling method due to needing to have temporary Span instances on the stack.

The trade-off for gaining (b) and (c) was to introduce the indirection in the MemoryManager code path. (And remember: even with this regression we're still within 5% of 2.1 RTM since we already made substantial improvements in this code path between 2.1 and 3.0 preview earlier.)

It is possible to make this PR so that we give up (b) and (c). This will improve perf for all three cases: String / SzArray / and MemoryManager. The trade-off is that we reduce the codegen size by only an insignificant amount (~3 - 5%), and we'd still have the dummy -1 field in MemoryManager.

If this still isn't an appropriate trade-off, that's fine. We can just close the PR - no harm. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I shouldn't have said "only an insignificant amount (~3 - 5%)" above. Since this property getter is marked with aggressive inlining it's not unreasonable to think it'd be expanded multiple times in the calling method, so even small improvements would add up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @GrabYourPitchforks. If I'm understanding you correctly, you're saying there's a version of this change that improves the throughout for strings and arrays equivalent to this one, improves rather than regresses the throughput for MemoryManager, and still shrinks code size some albeit not as much as in this change? That sounds better to me, personally.

private readonly int _dummy = -1;
#if BIT64
private readonly int _dummyPadding = 0;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than an ifdef with padding, could this just be a -1 IntPtr?

/// it means the caller only has to allocate space for a single Int32 on the stack.
/// </summary>
[MethodImpl(MethodImplOptions.NoInlining)]
internal ByReference<T> GetSpanAndDeconstruct(out int spanLength)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it's ok to penalize the MemoryManager case in this way.

@GrabYourPitchforks
Copy link
Member Author

GrabYourPitchforks commented Mar 25, 2019

@jkotas @stephentoub Is it worthwhile for me to experiment with a change where we can make all three scenarios faster, but at the cost of increasing the size of MemoryManager<T> by sizeof(IntPtr)? Or is increasing the size of that type so unpalatable as to make this a non-starter?

Edit: Jan, you mentioned that smuggling a fake Length field through this type was hacky and you seemed unsettled by it. Is there something that would help ease your concerns here? It's also ok for us to say "no - everything is fine as is." :)

@jkotas
Copy link
Member

jkotas commented Mar 25, 2019

I do not think that this optimization is worth looking into.

I think there needs to be a ballance between performance and understandability/maintanability of the code. The fake length field does not strike the right ballance for me.

@GrabYourPitchforks
Copy link
Member Author

Thanks both for the feedback! I'm going to go ahead and close this since the Length trick really was the primary mechanism to increase perf, and if that doesn't sit well with folks then this probably isn't worth pursuing further.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants