-
Notifications
You must be signed in to change notification settings - Fork 5k
Convert JIT_Box* to C# #115134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Convert JIT_Box* to C# #115134
Conversation
Tagging subscribers to this area: @mangod9 |
543d2ae
to
ad38529
Compare
@EgorBot -amd -arm using System.Reflection;
using BenchmarkDotNet.Attributes;
public class Bench
{
public readonly int AReadonlyIntField = 42;
[Benchmark]
public int ReflectionFieldBoxing()
{
FieldInfo fieldInfo = typeof(Bench).GetTypeInfo().GetDeclaredField(nameof(AReadonlyIntField));
Bench myInstance = new();
object current = fieldInfo.GetValue(myInstance);
fieldInfo.SetValue(myInstance, int.MinValue);
return (int)current + (int)fieldInfo.GetValue(myInstance);
}
} |
aaadc26
to
bf23c6d
Compare
@EgorBot -windows_intel -amd -arm using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
public class Bench
{
[Benchmark] public object Box8() => Box(new Buffer8());
[Benchmark] public object Box16() => Box(new Buffer16());
[Benchmark] public object Box32() => Box(new Buffer32());
[Benchmark] public object Box256() => Box(new Buffer256());
[MethodImpl(MethodImplOptions.NoInlining |
MethodImplOptions.NoOptimization)]
static object Box<T>(T value) => (object)value;
}
[InlineArray(1)] public struct Buffer8 {
long _element0;
}
[InlineArray(2)] public struct Buffer16 {
long _element0;
}
[InlineArray(4)] public struct Buffer32 {
long _element0;
}
[InlineArray(32)] public struct Buffer256 {
long _element0;
} |
Debug.Assert(typeMT != null); | ||
Debug.Assert(typeMT->IsValueType); | ||
|
||
object boxed = RuntimeTypeHandle.InternalAllocNoChecks(typeMT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite an expensive QCall with coop -> pre-emptive -> coop transition. We may need to fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added allocation via thread local variable FCall, as we had in the original HCall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It helped significantly for the microbenchmark numbers. Unfortunately the performance gap is still quite steep and I would like to understand where the overhead comes from...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if we could perhaps take some inspiration from BoxCache
and it's GetBoxInfo
. It's not usable as-is since it's relatively expensive QCall but it does some interesting things, like passing the allocator back to managed code (not necessarily useful for single-use box) but also passing back GetNumInstanceFieldBytes
which could avoid an extra FCall...
@EgorBot -windows_intel -amd -arm using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;
public class Bench
{
[Benchmark] public object Box8() => Box(new Buffer8());
[Benchmark] public object Box16() => Box(new Buffer16());
[Benchmark] public object Box32() => Box(new Buffer32());
[Benchmark] public object Box256() => Box(new Buffer256());
[MethodImpl(MethodImplOptions.NoInlining |
MethodImplOptions.NoOptimization)]
static object Box<T>(T value) => (object)value;
}
[InlineArray(1)] public struct Buffer8 {
long _element0;
}
[InlineArray(2)] public struct Buffer16 {
long _element0;
}
[InlineArray(4)] public struct Buffer32 {
long _element0;
}
[InlineArray(32)] public struct Buffer256 {
long _element0;
} |
arm32 doesn't seem to be happy with the TLV commit despite |
arm32 had the special alignment code path (RequiresAlign8) |
10ef59b
to
841c89e
Compare
...coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.CoreCLR.cs
Outdated
Show resolved
Hide resolved
else | ||
{ | ||
SpanHelpers.Memmove(ref boxed.GetRawData(), ref unboxedData, typeMT->GetNumInstanceFieldBytes()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can unroll it manually if this is the bottleneck
else | |
{ | |
SpanHelpers.Memmove(ref boxed.GetRawData(), ref unboxedData, typeMT->GetNumInstanceFieldBytes()); | |
} | |
else | |
{ | |
ref byte dst = ref boxed.GetRawData(); | |
nuint bytes = typeMT->GetNumInstanceFieldBytes(); | |
switch (bytes) | |
{ | |
case 1: | |
Unsafe.As<byte, byte>(ref dst) = Unsafe.As<byte, byte>(ref unboxedData); | |
break; | |
case 2: | |
Unsafe.As<byte, ushort>(ref dst) = Unsafe.As<byte, ushort>(ref unboxedData); | |
break; | |
case 4: | |
Unsafe.As<byte, uint>(ref dst) = Unsafe.As<byte, uint>(ref unboxedData); | |
break; | |
case 8: | |
Unsafe.As<byte, ulong>(ref dst) = Unsafe.As<byte, ulong>(ref unboxedData); | |
break; | |
default: | |
SpanHelpers.Memmove(ref dst, ref unboxedData, bytes); | |
break; | |
} | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied it for testing: EgorBot/runtime-utils#348 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@filipnavara, it showed a slight improvement, but still this is not it. 🥲
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is InternalAllocNoChecks
getting inlined? You can try to mark it no-inline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shows some interesting data 👍 I was mainly wondering whether the overhead is in the allocation or the copying. The code paths for small sizes improved to the point that 2/3 of the platforms are nearly on par so that seems to point to the copying being the primary bottleneck.
0b8d8f9
to
0f4cd04
Compare
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/CastHelpers.cs
Outdated
Show resolved
Hide resolved
0f4cd04
to
7468c56
Compare
Unsafe.As<byte, ulong>(ref dst) = Unsafe.As<byte, ulong>(ref unboxedData); | ||
break; | ||
default: | ||
SpanHelpers.Memmove(ref dst, ref unboxedData, bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, did you try a version with Unsafe.CopyBlock
? I know they are both intrinsics but SpanHelpers.Memmove
may get special treatment just for constant sizes (+ PGO expansion for constant sizes, possibly) while CopyBlock
is expanded somewhat differently.
I'm asking mainly because NativeAOT version of the helper uses Unsafe.CopyBlock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://godbolt.org/z/7dzo5hdYY M3 calls CORINFO_HELP_MEMCPY
which is METHOD__SPAN_HELPERS__MEMCOPY
. The HCall was calling this
runtime/src/coreclr/vm/object.cpp
Lines 351 to 393 in 7468c56
void STDCALL CopyValueClassUnchecked(void* dest, void* src, MethodTable *pMT) | |
{ | |
STATIC_CONTRACT_NOTHROW; | |
STATIC_CONTRACT_GC_NOTRIGGER; | |
STATIC_CONTRACT_FORBID_FAULT; | |
STATIC_CONTRACT_MODE_COOPERATIVE; | |
_ASSERTE(!pMT->IsArray()); // bunch of assumptions about arrays wrong. | |
if (pMT->ContainsGCPointers()) | |
{ | |
memmoveGCRefs(dest, src, pMT->GetNumInstanceFieldBytesIfContainsGCPointers()); | |
} | |
else | |
{ | |
DWORD numInstanceFieldBytes = pMT->GetNumInstanceFieldBytes(); | |
switch (numInstanceFieldBytes) | |
{ | |
case 1: | |
*(UINT8*)dest = *(UINT8*)src; | |
break; | |
#ifndef ALIGN_ACCESS | |
// we can hit an alignment fault if the value type has multiple | |
// smaller fields. Example: if there are two I4 fields, the | |
// value class can be aligned to 4-byte boundaries, yet the | |
// NumInstanceFieldBytes is 8 | |
case 2: | |
*(UINT16*)dest = *(UINT16*)src; | |
break; | |
case 4: | |
*(UINT32*)dest = *(UINT32*)src; | |
break; | |
case 8: | |
*(UINT64*)dest = *(UINT64*)src; | |
break; | |
#endif // !ALIGN_ACCESS | |
default: | |
memcpyNoGCRefs(dest, src, numInstanceFieldBytes); | |
break; | |
} | |
} | |
} |
runtime/src/coreclr/vm/qcall.cpp
Lines 45 to 54 in 7468c56
void QCall::ObjectHandleOnStack::SetByteArray(const BYTE * p, COUNT_T length) | |
{ | |
STANDARD_VM_CONTRACT; | |
GCX_COOP(); | |
BASEARRAYREF arr = (BASEARRAYREF) AllocatePrimitiveArray(ELEMENT_TYPE_U1, length); | |
memcpyNoGCRefs(arr->GetDataPtr(), p, length * sizeof(BYTE)); | |
Set(arr); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think the memcpy is where the perf loss is. I think that the perf loss is due the fact that there are two layers of calls with this change, and there was just one layer before this change.
I would not worry about matching the perf. The JIT will inline the copy in optimized code, and Tier 0 getting a bit slower is not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not worry about matching the perf.
While I agree in principle I'd still like to understand why ARM64 has 25% regression across the board while x64 is pretty okay except for the SpanHelpers.Memmove
branch. I can JITDump it myself later, just wanted to rule out any obvious culprits first and not waste time on something that was already an explored dead-end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like on x64 the remaining overhead is mainly the non-inlined GetNumInstanceFieldBytes
(since it's an FCall). That accounts for all the (amortized) overhead on my AMD Ryzen 9950x machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you can check flamegraph/asm here EgorBot/runtime-utils#348 (comment)
seems like the PR has quite a few JIT_InitPinvokeFrame samples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the inlining mentioned in #115134 (comment). Note that on x64 the inlining is not deemed profitable so it doesn't happen.
No description provided.