Description
Brief intro:
I'm developing a binary serialization library with two major requirements:
- offering high performance
- minimizing user errors by utilizing the type system as much as possible.
For requirement #1
, I'm heavily using Span. I'm quite happy with the performance and how clean the internal serializer implementation is.
For requirement #2
, I'm trying to prevent users from accidentally passing the wrong Span to a method.
As a solution, I'm wrapping Span in my own struct: WrappedSpan
. All my methods are requesting a WrappedSpan
and they won't accept a Span
.
This helps prevent mistakes via compile-time errors in the case that the user passes a Span
instead of a WrappedSpan
to any of the serialization methods.
The problem:
I noticed that a regular Span<byte>
is performing 100% better than a Span
wrapped in an otherwise-empty struct in some scenarios (many, small, chained method calls).
The method using Span<byte>
:
public MethodHost_Span WriteInt32_Span(ref Span<byte> span, int value)
{
MemoryMarshal.Cast<byte, int>(span)[0] = value;
span = span.Slice(sizeof(int));
// Return of zero-size struct is needed for method chaining.
return default(MethodHost_Span);
}
The method using WrappedSpan
:
public MethodHost_WrappedSpan WriteInt32_WrappedSpan(ref WrappedSpan wrapper, int value)
{
MemoryMarshal.Cast<byte, int>(wrapper.Span)[0] = value;
wrapper.Span = wrapper.Span.Slice(sizeof(int));
// Return of zero-size struct is needed for method chaining.
return default(MethodHost_WrappedSpan);
}
public ref struct WrappedSpan
{
public Span<byte> Span;
}
Actual benchmark method is here.
More details:
- All this is happening on .NET 7.0.100-preview.3.22179.4, but I got similar results on .NET 6 and 5.
- Setting
DOTNET_TieredPGO
doesn't impact the performance. - The full benchmark project can be found here.
- Outputs from Disasmo can be found here in the same repo.
- Local runtime build used by Disasmo is at commit: 3535e0769f202ae4cd820bea24afd20cee313966
- I'm on Windows 10, Version 10.0.19044 Build 19044
- Building for amd64.
BenchmarkDotNet results with different data types:
Method | Mean | Error | StdDev |
---|---|---|---|
WriteMany_Int32_Span | 1.163 us | 0.0122 us | 0.0114 us |
WriteMany_Int32_WrappedSpan | 2.169 us | 0.0194 us | 0.0172 us |
WriteMany_Single_Span | 1.085 us | 0.0119 us | 0.0106 us |
WriteMany_Single_WrappedSpan | 2.183 us | 0.0206 us | 0.0161 us |
WriteMany_Double_Span | 1.096 us | 0.0215 us | 0.0221 us |
WriteMany_Double_WrappedSpan | 2.192 us | 0.0181 us | 0.0161 us |
WriteMany_Mixed_Span | 1.134 us | 0.0127 us | 0.0119 us |
WriteMany_Mixed_WrappedSpan | 2.183 us | 0.0242 us | 0.0226 us |
Expected
I would expect the wrapping struct to have no impact on the generated byte code. In other words, WrappedSpan
performs just as fast as a regular Span.
My questions
- Considering JIT internals, is this an expected result? If yes, could you share the decision-making process of JIT that results in such perf difference?
- Are there any options/tricks that I can use to get better results with the WrappedSpan?
- Would you consider improving JIT to generate better performing code for such scenarios?
category:cq
theme:structs
skill-level:expert
cost:large
impact:medium