Skip to content

Convert JIT_Box* to C# #115134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

am11
Copy link
Member

@am11 am11 commented Apr 28, 2025

No description provided.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Apr 28, 2025
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@am11
Copy link
Member Author

am11 commented Apr 29, 2025

@EgorBot -amd -arm

using System.Reflection;
using BenchmarkDotNet.Attributes;

public class Bench
{
    public readonly int AReadonlyIntField = 42;

    [Benchmark]
    public int ReflectionFieldBoxing()
    {
        FieldInfo fieldInfo = typeof(Bench).GetTypeInfo().GetDeclaredField(nameof(AReadonlyIntField));
        Bench myInstance = new();

        object current = fieldInfo.GetValue(myInstance);

        fieldInfo.SetValue(myInstance, int.MinValue);

        return (int)current + (int)fieldInfo.GetValue(myInstance);
    }
}

@am11
Copy link
Member Author

am11 commented Apr 29, 2025

@jkotas, @EgorBo, WDYT? We can get rid of this HMF and later remove the managed helpers when JIT starts inlining boxes (in Tier 0 as it does in Tier 1).

@am11 am11 marked this pull request as ready for review April 29, 2025 08:45
@am11 am11 force-pushed the feature/HMF-removal/JIT_Box branch from aaadc26 to bf23c6d Compare April 29, 2025 11:19
@EgorBo
Copy link
Member

EgorBo commented Apr 29, 2025

@EgorBot -windows_intel -amd -arm

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Bench
{
    [Benchmark] public object Box8() => Box(new Buffer8());
    [Benchmark] public object Box16() => Box(new Buffer16());
    [Benchmark] public object Box32() => Box(new Buffer32());
    [Benchmark] public object Box256() => Box(new Buffer256());


    [MethodImpl(MethodImplOptions.NoInlining | 
                MethodImplOptions.NoOptimization)]
    static object Box<T>(T value) => (object)value;
}

[InlineArray(1)] public struct Buffer8 {
    long _element0;
}
[InlineArray(2)] public struct Buffer16 {
    long _element0;
}
[InlineArray(4)] public struct Buffer32 {
    long _element0;
}
[InlineArray(32)] public struct Buffer256 {
    long _element0;
}

Debug.Assert(typeMT != null);
Debug.Assert(typeMT->IsValueType);

object boxed = RuntimeTypeHandle.InternalAllocNoChecks(typeMT);
Copy link
Member

@filipnavara filipnavara Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite an expensive QCall with coop -> pre-emptive -> coop transition. We may need to fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added allocation via thread local variable FCall, as we had in the original HCall.

Copy link
Member

@filipnavara filipnavara Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helped significantly for the microbenchmark numbers. Unfortunately the performance gap is still quite steep and I would like to understand where the overhead comes from...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if we could perhaps take some inspiration from BoxCache and it's GetBoxInfo. It's not usable as-is since it's relatively expensive QCall but it does some interesting things, like passing the allocator back to managed code (not necessarily useful for single-use box) but also passing back GetNumInstanceFieldBytes which could avoid an extra FCall...

@am11
Copy link
Member Author

am11 commented Apr 29, 2025

@EgorBot -windows_intel -amd -arm

using System.Runtime.CompilerServices;
using BenchmarkDotNet.Attributes;

public class Bench
{
    [Benchmark] public object Box8() => Box(new Buffer8());
    [Benchmark] public object Box16() => Box(new Buffer16());
    [Benchmark] public object Box32() => Box(new Buffer32());
    [Benchmark] public object Box256() => Box(new Buffer256());


    [MethodImpl(MethodImplOptions.NoInlining | 
                MethodImplOptions.NoOptimization)]
    static object Box<T>(T value) => (object)value;
}

[InlineArray(1)] public struct Buffer8 {
    long _element0;
}
[InlineArray(2)] public struct Buffer16 {
    long _element0;
}
[InlineArray(4)] public struct Buffer32 {
    long _element0;
}
[InlineArray(32)] public struct Buffer256 {
    long _element0;
}

@am11
Copy link
Member Author

am11 commented Apr 29, 2025

arm32 doesn't seem to be happy with the TLV commit despite src/coreclr/vm/arm/stubs.cpp was using JIT_Box_MP_FastPortable. 🤔

@filipnavara
Copy link
Member

filipnavara commented Apr 29, 2025

arm32 doesn't seem to be happy with the TLV commit

arm32 had the special alignment code path (RequiresAlign8)

@am11 am11 force-pushed the feature/HMF-removal/JIT_Box branch from 10ef59b to 841c89e Compare April 29, 2025 17:52
Comment on lines 568 to 574
else
{
SpanHelpers.Memmove(ref boxed.GetRawData(), ref unboxedData, typeMT->GetNumInstanceFieldBytes());
}
Copy link
Member Author

@am11 am11 Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can unroll it manually if this is the bottleneck

Suggested change
else
{
SpanHelpers.Memmove(ref boxed.GetRawData(), ref unboxedData, typeMT->GetNumInstanceFieldBytes());
}
else
{
ref byte dst = ref boxed.GetRawData();
nuint bytes = typeMT->GetNumInstanceFieldBytes();
switch (bytes)
{
case 1:
Unsafe.As<byte, byte>(ref dst) = Unsafe.As<byte, byte>(ref unboxedData);
break;
case 2:
Unsafe.As<byte, ushort>(ref dst) = Unsafe.As<byte, ushort>(ref unboxedData);
break;
case 4:
Unsafe.As<byte, uint>(ref dst) = Unsafe.As<byte, uint>(ref unboxedData);
break;
case 8:
Unsafe.As<byte, ulong>(ref dst) = Unsafe.As<byte, ulong>(ref unboxedData);
break;
default:
SpanHelpers.Memmove(ref dst, ref unboxedData, bytes);
break;
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied it for testing: EgorBot/runtime-utils#348 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@filipnavara, it showed a slight improvement, but still this is not it. 🥲

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is InternalAllocNoChecks getting inlined? You can try to mark it no-inline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shows some interesting data 👍 I was mainly wondering whether the overhead is in the allocation or the copying. The code paths for small sizes improved to the point that 2/3 of the platforms are nearly on par so that seems to point to the copying being the primary bottleneck.

@am11 am11 force-pushed the feature/HMF-removal/JIT_Box branch from 0b8d8f9 to 0f4cd04 Compare April 29, 2025 18:09
@am11 am11 force-pushed the feature/HMF-removal/JIT_Box branch from 0f4cd04 to 7468c56 Compare April 29, 2025 18:49
Unsafe.As<byte, ulong>(ref dst) = Unsafe.As<byte, ulong>(ref unboxedData);
break;
default:
SpanHelpers.Memmove(ref dst, ref unboxedData, bytes);
Copy link
Member

@filipnavara filipnavara Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, did you try a version with Unsafe.CopyBlock? I know they are both intrinsics but SpanHelpers.Memmove may get special treatment just for constant sizes (+ PGO expansion for constant sizes, possibly) while CopyBlock is expanded somewhat differently.

I'm asking mainly because NativeAOT version of the helper uses Unsafe.CopyBlock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://godbolt.org/z/7dzo5hdYY M3 calls CORINFO_HELP_MEMCPY which is METHOD__SPAN_HELPERS__MEMCOPY. The HCall was calling this

void STDCALL CopyValueClassUnchecked(void* dest, void* src, MethodTable *pMT)
{
STATIC_CONTRACT_NOTHROW;
STATIC_CONTRACT_GC_NOTRIGGER;
STATIC_CONTRACT_FORBID_FAULT;
STATIC_CONTRACT_MODE_COOPERATIVE;
_ASSERTE(!pMT->IsArray()); // bunch of assumptions about arrays wrong.
if (pMT->ContainsGCPointers())
{
memmoveGCRefs(dest, src, pMT->GetNumInstanceFieldBytesIfContainsGCPointers());
}
else
{
DWORD numInstanceFieldBytes = pMT->GetNumInstanceFieldBytes();
switch (numInstanceFieldBytes)
{
case 1:
*(UINT8*)dest = *(UINT8*)src;
break;
#ifndef ALIGN_ACCESS
// we can hit an alignment fault if the value type has multiple
// smaller fields. Example: if there are two I4 fields, the
// value class can be aligned to 4-byte boundaries, yet the
// NumInstanceFieldBytes is 8
case 2:
*(UINT16*)dest = *(UINT16*)src;
break;
case 4:
*(UINT32*)dest = *(UINT32*)src;
break;
case 8:
*(UINT64*)dest = *(UINT64*)src;
break;
#endif // !ALIGN_ACCESS
default:
memcpyNoGCRefs(dest, src, numInstanceFieldBytes);
break;
}
}
}
which uses memcpyNoGCRefs -> memcpy in the end, that is optimized by the native compiler. We could use something like:
void QCall::ObjectHandleOnStack::SetByteArray(const BYTE * p, COUNT_T length)
{
STANDARD_VM_CONTRACT;
GCX_COOP();
BASEARRAYREF arr = (BASEARRAYREF) AllocatePrimitiveArray(ELEMENT_TYPE_U1, length);
memcpyNoGCRefs(arr->GetDataPtr(), p, length * sizeof(BYTE));
Set(arr);
}
with FCall to use the same memcpy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think the memcpy is where the perf loss is. I think that the perf loss is due the fact that there are two layers of calls with this change, and there was just one layer before this change.

I would not worry about matching the perf. The JIT will inline the copy in optimized code, and Tier 0 getting a bit slower is not a big deal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not worry about matching the perf.

While I agree in principle I'd still like to understand why ARM64 has 25% regression across the board while x64 is pretty okay except for the SpanHelpers.Memmove branch. I can JITDump it myself later, just wanted to rule out any obvious culprits first and not waste time on something that was already an explored dead-end.

Copy link
Member

@filipnavara filipnavara Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like on x64 the remaining overhead is mainly the non-inlined GetNumInstanceFieldBytes (since it's an FCall). That accounts for all the (amortized) overhead on my AMD Ryzen 9950x machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you can check flamegraph/asm here EgorBot/runtime-utils#348 (comment)

seems like the PR has quite a few JIT_InitPinvokeFrame samples

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the inlining mentioned in #115134 (comment). Note that on x64 the inlining is not deemed profitable so it doesn't happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-VM-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants