Skip to content

Move JIT_InitPInvokeFrame implementation to assembler on x86 and ARM32 #113791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 23, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 1 addition & 5 deletions src/coreclr/inc/jithelpers.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
#endif

// pfnHelper is set to NULL if it is a stubbed helper.
// It will be set in InitJITHelpers2
// It will be set in InitJITHelpers1

JITHELPER(CORINFO_HELP_UNDEF, NULL, METHOD__NIL)

Expand Down Expand Up @@ -214,11 +214,7 @@

JITHELPER(CORINFO_HELP_GETCURRENTMANAGEDTHREADID, JIT_GetCurrentManagedThreadId, METHOD__NIL)

#ifdef TARGET_64BIT
JITHELPER(CORINFO_HELP_INIT_PINVOKE_FRAME, JIT_InitPInvokeFrame, METHOD__NIL)
#else
DYNAMICJITHELPER(CORINFO_HELP_INIT_PINVOKE_FRAME, NULL, METHOD__NIL)
#endif

DYNAMICJITHELPER(CORINFO_HELP_MEMSET, NULL, METHOD__SPAN_HELPERS__MEMSET)
DYNAMICJITHELPER(CORINFO_HELP_MEMZERO, NULL, METHOD__SPAN_HELPERS__MEMZERO)
Expand Down
5 changes: 0 additions & 5 deletions src/coreclr/vm/arm/cgencpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,6 @@ struct ArgLocDesc;

extern PCODE GetPreStubEntryPoint();

// CPU-dependent functions
Stub * GenerateInitPInvokeFrameHelper();

EXTERN_C void checkStack(void);

#define THUMB_CODE 1
Expand Down Expand Up @@ -555,8 +552,6 @@ class StubLinkerCPU : public StubLinker
ThumbEmitJumpRegister(thumbRegLr);
}

void ThumbEmitGetThread(ThumbReg dest);

void ThumbEmitNop()
{
// nop
Expand Down
36 changes: 36 additions & 0 deletions src/coreclr/vm/arm/pinvokestubs.S
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,42 @@ LOCAL_LABEL(RarePath):

LEAF_END JIT_PInvokeEnd, _TEXT

// ------------------------------------------------------------------
// IN:
// InlinedCallFrame (r4) = pointer to the InlinedCallFrame data
// OUT:
// Thread (r5) = pointer to Thread
//
//
LEAF_ENTRY JIT_InitPInvokeFrame, _TEXT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it actually help perf to have this method in assembly on Linux arm32? I would expect that the C/C++ version is going to have better perf thanks to better optimized TLS access that will outweigh the few extra instructions in JITed code to deal with the standard calling convention.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dtto for Linux x86

Copy link
Member Author

@filipnavara filipnavara Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They use custom calling convention, unfortunately, and I didn't want to modify JIT to account for that. On x86 this generated equivalent code to the old stub. On arm32 I moved some register saves to prolog/epilog because it was easier and more efficient.

Moreover, the C version currently behaves differently in regards to who pushes the frame so that would need to be unified too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(We can optimize the TLS access; the JIT_PInvokeBegin helper used on R2R would benefit from that too; and it's very similar code also written in assembly.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They use custom calling convention, unfortunately

My guess is that it is a left-over from fragile NGen where we were happy to do a lot of complicated things to save a few bytes of binary size here and there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the asm implementation is ok for win-x86 since win-x86 is weird in many different ways, but I would switch to regular C++ for everything else.

Copy link
Member Author

@filipnavara filipnavara Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be a much larger change to do that. I'm not necessarily opposed to that in principle, but I went with a change that is less disruptive for a few reasons. Namely, there's already assembler code for this in the R2R PInvoke helpers, so this puts similar code in a similar place which makes any changes to it easier to keep in sync. It removes some of the odd cases for x86/ARM32 even if not all. It produces slightly more optimal code on ARM32 at no expense to readability. It removes unnecessary startup cost (albeit a tiny one) and one place with dynamic code generation.

If we were to revisit this with JIT changes then I would prefer to address more things. Currently we have disparity with usage of USE_PER_FRAME_PINVOKE_INIT on various platforms. There's also the fact that JIT_PInvokeBegin and JIT_InitPInvokeFrame helpers do almost the same thing, except in the later case the JIT has to know more of the internal layout of InlinedCallFrame. I tried to switch the JIT to use the R2R P/Invoke helpers for all P/Invokes but that resulted in triggering hidden JIT bug (fix submitted in PR) and it painfully made me aware of the fact that JIT_PInvokeBegin/JIT_PInvokeEnd is not aware of SuppressGCTransition. Compared to all these open questions I consider the custom calling convention just a small problem.

(Lastly, the reason I run into this in the first place is that in my x86/new-EH experiment I need to errect a SEH frame on P/Invoke transitions. That's difficult to address with the current state of the code. At very least this removes the need to write a custom code generator for that.)

Copy link
Member Author

@filipnavara filipnavara Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second look it seems that going with JIT_PInvokeBegin/JIT_PInvokeEnd for all P/Invokes may be a viable alternative for a long term solution (unless it causes some unforeseen regressions; possibly restricted to 32-bit platforms if there are reasons to keep the frame optimizations in 64-bit JIT). There's still the issue with the behavior not being the same with SuppressGCTransition, but it seems that the GC starvation issues I have seen are unrelated (caused by the recent JIT_PollGC rewrite, issue #113807).

That said, I would still like to get this in as an improvement over status quo that is unlikely to cause any regressions.


PROLOG_PUSH "{r0-r4, lr}"

bl C_FUNC(GetThreadHelper)
mov r5, r0

// set first slot to the value of InlinedCallFrame identifier (checked by runtime code)
mov r6, #FRAMETYPE_InlinedCallFrame
str r6, [r4]

// pFrame->m_Next = pThread->m_pFrame;
ldr r6, [r5, #Thread_m_pFrame]
str r6, [r4, #Frame__m_Next]

str r11, [r4, #InlinedCallFrame__m_pCalleeSavedFP]
str r9, [r4, #InlinedCallFrame__m_pSPAfterProlog]
mov r6, 0
str r6, [r4, #InlinedCallFrame__m_pCallerReturnAddress]
add r6, sp, 24
str r6, [r4, #InlinedCallFrame__m_pCallSiteSP]

// pThread->m_pFrame = pFrame;
str r4, [r5, #Thread_m_pFrame]

EPILOG_POP "{r0-r4, pc}"

LEAF_END JIT_InitPInvokeFrame, _TEXT

// ------------------------------------------------------------------
// VarargPInvokeStub & VarargPInvokeGenILStub
// There is a separate stub when the method has a hidden return buffer arg.
Expand Down
82 changes: 0 additions & 82 deletions src/coreclr/vm/arm/stubs.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1105,88 +1105,6 @@ void ResolveHolder::Initialize(ResolveHolder* pResolveHolderRX,
_ASSERTE(patcherTarget == (PCODE)NULL);
}

Stub *GenerateInitPInvokeFrameHelper()
{
CONTRACT(Stub*)
{
THROWS;
GC_NOTRIGGER;
MODE_ANY;

POSTCONDITION(CheckPointer(RETVAL));
}
CONTRACT_END;

CPUSTUBLINKER sl;
CPUSTUBLINKER *psl = &sl;

CORINFO_EE_INFO::InlinedCallFrameInfo FrameInfo;
InlinedCallFrame::GetEEInfo(&FrameInfo);

ThumbReg regFrame = ThumbReg(4);
ThumbReg regThread = ThumbReg(5);
ThumbReg regScratch = ThumbReg(6);
ThumbReg regR9 = ThumbReg(9);

// Erect frame to perform call to GetThread
psl->ThumbEmitProlog(1, sizeof(ArgumentRegisters), FALSE); // Save r4 for aligned stack

// Save argument registers around the GetThread call. Don't bother with using ldm/stm since this inefficient path anyway.
for (int reg = 0; reg < 4; reg++)
psl->ThumbEmitStoreRegIndirect(ThumbReg(reg), thumbRegSp, offsetof(ArgumentRegisters, r) + sizeof(*ArgumentRegisters::r) * reg);

psl->ThumbEmitGetThread(regThread);

for (int reg = 0; reg < 4; reg++)
psl->ThumbEmitLoadRegIndirect(ThumbReg(reg), thumbRegSp, offsetof(ArgumentRegisters, r) + sizeof(*ArgumentRegisters::r) * reg);

// mov [regFrame], FrameIdentifier::InlinedCallFrame
psl->ThumbEmitMovConstant(regScratch, (DWORD)FrameIdentifier::InlinedCallFrame);
psl->ThumbEmitStoreRegIndirect(regScratch, regFrame, 0);

// ldr regScratch, [regThread + offsetof(Thread, m_pFrame)]
// str regScratch, [regFrame + FrameInfo.offsetOfFrameLink]
psl->ThumbEmitLoadRegIndirect(regScratch, regThread, offsetof(Thread, m_pFrame));
psl->ThumbEmitStoreRegIndirect(regScratch, regFrame, FrameInfo.offsetOfFrameLink);

// str FP, [regFrame + FrameInfo.offsetOfCalleeSavedFP]
psl->ThumbEmitStoreRegIndirect(thumbRegFp, regFrame, FrameInfo.offsetOfCalleeSavedFP);

// str R9, [regFrame + FrameInfo.offsetOfSPAfterProlog]
psl->ThumbEmitStoreRegIndirect(regR9, regFrame, FrameInfo.offsetOfSPAfterProlog);

// mov [regFrame + FrameInfo.offsetOfReturnAddress], 0
psl->ThumbEmitMovConstant(regScratch, 0);
psl->ThumbEmitStoreRegIndirect(regScratch, regFrame, FrameInfo.offsetOfReturnAddress);

DWORD cbSavedRegs = sizeof(ArgumentRegisters) + 2 * 4; // r0-r3, r4, lr
psl->ThumbEmitAdd(regScratch, thumbRegSp, cbSavedRegs);
psl->ThumbEmitStoreRegIndirect(regScratch, regFrame, FrameInfo.offsetOfCallSiteSP);

// mov [regThread + offsetof(Thread, m_pFrame)], regFrame
psl->ThumbEmitStoreRegIndirect(regFrame, regThread, offsetof(Thread, m_pFrame));

// leave current Thread in R4

psl->ThumbEmitEpilog();

// A single process-wide stub that will never unload
RETURN psl->Link(SystemDomain::GetGlobalLoaderAllocator()->GetStubHeap());
}

void StubLinkerCPU::ThumbEmitGetThread(ThumbReg dest)
{
ThumbEmitMovConstant(ThumbReg(0), (TADDR)GetThreadHelper);

ThumbEmitCallRegister(ThumbReg(0));

if (dest != ThumbReg(0))
{
ThumbEmitMovRegReg(dest, ThumbReg(0));
}
}


// Emits code to adjust for a static delegate target.
VOID StubLinkerCPU::EmitShuffleThunk(ShuffleEntry *pShuffleEntryArray)
{
Expand Down
1 change: 0 additions & 1 deletion src/coreclr/vm/ceemain.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -887,7 +887,6 @@ void EEStartupHelper()
// Before setting up the execution manager initialize the first part
// of the JIT helpers.
InitJITHelpers1();
InitJITHelpers2();

SyncBlockCache::Attach();

Expand Down
2 changes: 1 addition & 1 deletion src/coreclr/vm/hosting.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ BOOL ClrVirtualProtect(LPVOID lpAddress, SIZE_T dwSize, DWORD flNewProtect, PDWO
// JIT_PatchedCode. Thus, their pages have the same protection, they live
// in the same region (and thus, its size is the same).
//
// In EEStartupHelper, when we setup the UEF and then invoke InitJitHelpers1 and InitJitHelpers2,
// In EEStartupHelper, when we setup the UEF and then invoke InitJitHelpers1,
// they perform some optimizations that result in the memory page protection being changed. When
// the UEF is to be invoked, the OS does the check on the UEF's cached details against the current
// memory pages. This check used to fail when on 64bit retail builds when JIT_PatchedCode was
Expand Down
33 changes: 33 additions & 0 deletions src/coreclr/vm/i386/PInvokeStubs.asm
Original file line number Diff line number Diff line change
Expand Up @@ -98,4 +98,37 @@ RarePath:

_JIT_PInvokeEnd@4 ENDP

;
; in:
; InlinedCallFrame (edi) = pointer to the InlinedCallFrame data
; out:
; Thread (esi) = pointer to Thread data
;
;
_JIT_InitPInvokeFrame@4 PROC public

;; esi = GetThread(). Trashes eax
INLINE_GETTHREAD esi, eax

;; edi = pFrame
;; esi = pThread

;; set first slot to the value of InlinedCallFrame identifier (checked by runtime code)
mov dword ptr [edi], FRAMETYPE_InlinedCallFrame

;; pFrame->m_Next = pThread->m_pFrame;
mov eax, dword ptr [esi + Thread_m_pFrame]
mov dword ptr [edi + Frame__m_Next], eax

mov dword ptr [edi + InlinedCallFrame__m_pCalleeSavedFP], ebp
mov dword ptr [edi + InlinedCallFrame__m_pCallerReturnAddress], 0

;; pThread->m_pFrame = pFrame;
mov dword ptr [esi + Thread_m_pFrame], edi

;; leave current Thread in ESI
ret

_JIT_InitPInvokeFrame@4 ENDP

end
3 changes: 0 additions & 3 deletions src/coreclr/vm/i386/cgencpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,6 @@ class FramedMethodFrame;
class Module;
class ComCallMethodDesc;

// CPU-dependent functions
Stub * GenerateInitPInvokeFrameHelper();

#define GetEEFuncEntryPoint(pfn) GFN_TADDR(pfn)

#define COMMETHOD_PREPAD 8 // # extra bytes to allocate in addition to sizeof(ComCallMethodDesc)
Expand Down
46 changes: 0 additions & 46 deletions src/coreclr/vm/i386/cgenx86.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -862,52 +862,6 @@ WORD GetUnpatchedCodeData(LPCBYTE pAddr)

#ifndef DACCESS_COMPILE

Stub *GenerateInitPInvokeFrameHelper()
{
CONTRACT(Stub*)
{
STANDARD_VM_CHECK;
POSTCONDITION(CheckPointer(RETVAL));
}
CONTRACT_END;

CPUSTUBLINKER sl;
CPUSTUBLINKER *psl = &sl;

CORINFO_EE_INFO::InlinedCallFrameInfo FrameInfo;
InlinedCallFrame::GetEEInfo(&FrameInfo);

// EDI contains address of the frame on stack

// mov esi, GetThread()
psl->X86EmitCurrentThreadFetch(kESI, (1 << kEDI) | (1 << kEBX) | (1 << kECX) | (1 << kEDX));

// mov [edi], InlinedCallFrame::GetFrameVtable()
psl->X86EmitOffsetModRM(0xc7, (X86Reg)0x0, kEDI, 0);
psl->Emit32((DWORD)FrameIdentifier::InlinedCallFrame);

// mov eax, [esi + offsetof(Thread, m_pFrame)]
// mov [edi + FrameInfo.offsetOfFrameLink], eax
psl->X86EmitIndexRegLoad(kEAX, kESI, offsetof(Thread, m_pFrame));
psl->X86EmitIndexRegStore(kEDI, FrameInfo.offsetOfFrameLink, kEAX);

// mov [edi + FrameInfo.offsetOfCalleeSavedEbp], ebp
psl->X86EmitIndexRegStore(kEDI, FrameInfo.offsetOfCalleeSavedFP, kEBP);

// mov [edi + FrameInfo.offsetOfReturnAddress], 0
psl->X86EmitOffsetModRM(0xc7, (X86Reg)0x0, kEDI, FrameInfo.offsetOfReturnAddress);
psl->Emit32(0);

// mov [esi + offsetof(Thread, m_pFrame)], edi
psl->X86EmitIndexRegStore(kESI, offsetof(Thread, m_pFrame), kEDI);

// leave current Thread in ESI
psl->X86EmitReturn(0);

// A single process-wide stub that will never unload
RETURN psl->Link(SystemDomain::GetGlobalLoaderAllocator()->GetExecutableHeap());
}

//////////////////////////////////////////////////////////////////////////////
//
// JITInterface
Expand Down
38 changes: 38 additions & 0 deletions src/coreclr/vm/i386/pinvokestubs.S
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,41 @@ LEAF_ENTRY JIT_PInvokeEnd, _TEXT
int 3
ret
LEAF_END JIT_PInvokeEnd, _TEXT

//
// IN:
// InlinedCallFrame (edi) = pointer to the InlinedCallFrame data
// OUT:
// Thread (esi) = pointer to Thread data
//
//
LEAF_ENTRY JIT_InitPInvokeFrame, _TEXT

// esi = GetThread(). Trashes eax
push ecx
push edx
call GetThreadHelper
pop edx
pop ecx
mov esi, eax

// edi = pFrame
// esi = pThread

// set first slot to the value of InlinedCallFrame identifier (checked by runtime code)
mov dword ptr [edi], FRAMETYPE_InlinedCallFrame

// pFrame->m_Next = pThread->m_pFrame;
mov eax, dword ptr [esi + Thread_m_pFrame]
mov dword ptr [edi + Frame__m_Next], eax

mov dword ptr [edi + InlinedCallFrame__m_pCalleeSavedFP], ebp
mov dword ptr [edi + InlinedCallFrame__m_pCallerReturnAddress], 0

// pThread->m_pFrame = pFrame;
mov dword ptr [esi + Thread_m_pFrame], edi

// leave current Thread in ESI
ret

LEAF_END JIT_InitPInvokeFrame, _TEXT
52 changes: 0 additions & 52 deletions src/coreclr/vm/i386/stublinkerx86.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2144,58 +2144,6 @@ static const X86Reg c_argRegs[] = {

#ifdef TARGET_X86

VOID StubLinkerCPU::X86EmitCurrentThreadFetch(X86Reg dstreg, unsigned preservedRegSet)
{
CONTRACTL
{
STANDARD_VM_CHECK;

// It doesn't make sense to have the destination register be preserved
PRECONDITION((preservedRegSet & (1 << dstreg)) == 0);
AMD64_ONLY(PRECONDITION(dstreg < 8)); // code below doesn't support high registers
}
CONTRACTL_END;

#ifdef TARGET_UNIX

X86EmitPushRegs(preservedRegSet & ((1 << kEAX) | (1 << kEDX) | (1 << kECX)));

// call GetThread
X86EmitCall(NewExternalCodeLabel((LPVOID)GetThreadHelper), 0);

// mov dstreg, eax
X86EmitMovRegReg(dstreg, kEAX);

X86EmitPopRegs(preservedRegSet & ((1 << kEAX) | (1 << kEDX) | (1 << kECX)));

#ifdef _DEBUG
// Trash caller saved regs that we were not told to preserve, and that aren't the dstreg.
preservedRegSet |= 1 << dstreg;
if (!(preservedRegSet & (1 << kEAX)))
X86EmitDebugTrashReg(kEAX);
if (!(preservedRegSet & (1 << kEDX)))
X86EmitDebugTrashReg(kEDX);
if (!(preservedRegSet & (1 << kECX)))
X86EmitDebugTrashReg(kECX);
#endif // _DEBUG

#else // TARGET_UNIX

BYTE code[] = { 0x64,0x8b,0x05 }; // mov dstreg, dword ptr fs:[IMM32]
static const int regByteIndex = 2;

code[regByteIndex] |= (dstreg << 3);

EmitBytes(code, sizeof(code));
Emit32(offsetof(TEB, ThreadLocalStoragePointer));

X86EmitIndexRegLoad(dstreg, dstreg, sizeof(void *) * _tls_index);

X86EmitIndexRegLoad(dstreg, dstreg, (int)Thread::GetOffsetOfThreadStatic(&gCurrentThreadInfo));

#endif // TARGET_UNIX
}

#ifdef TARGET_UNIX
namespace
{
Expand Down
2 changes: 0 additions & 2 deletions src/coreclr/vm/i386/stublinkerx86.h
Original file line number Diff line number Diff line change
Expand Up @@ -209,8 +209,6 @@ class StubLinkerCPU : public StubLinker
VOID X86EmitCall(CodeLabel *target, int iArgBytes);
VOID X86EmitReturn(WORD wArgBytes);

VOID X86EmitCurrentThreadFetch(X86Reg dstreg, unsigned preservedRegSet);

VOID X86EmitCurrentThreadAllocContextFetch(X86Reg dstreg, unsigned preservedRegSet);

VOID X86EmitIndexRegLoad(X86Reg dstreg, X86Reg srcreg, int32_t ofs = 0);
Expand Down
Loading
Loading