-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Remove relocations for vtable chunks #17147
Remove relocations for vtable chunks #17147
Conversation
@dotnet/arm32-contrib PTAL |
a3caca9
to
58af16f
Compare
cc @vitek-karas This private memory optimization is specific to the FragileNonVersionable NGen. If it is not used (the default), this change is a regression with no benefits. We have been working towards including Linux ARM among officially supported .NET Core platforms for .NET Core 2.1, and just recently started looking at performance. I think we need to step back and agree on a plan how to evolve the native binary file format(s). Here are some points that come to mind:
Thoughts? |
What do you think about adding this optimization under This way, FragileNonVersionable or R2R specific optimizations, which affect each other, could be defined under different
Does this mean that precodes won't be saved to FNV images and will be created at run time? |
Yes, it is what can be done for now (together with other FragileNonVersionable optimizations that are under ARM && UNIX ifdef today). I am worried that these ifdefs become unmanageable over time.
The prototype that we have done saves the precodes as read-only (https://github.com/vitek-karas/coreclr/tree/NGenWX). This reminds me that we need an issue for this. I have just created one: https://github.com/dotnet/coreclr/issues/17230 |
58af16f
to
a39f5c5
Compare
We'll consider applying this approach on ARM after May, as we have to measure performance and memory consumption of such changes. Could you, please, take a look at this pull request? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to make arm LEGACY_BACKEND
work with this? It's likely we will delete this code path soon.
@@ -2342,7 +2342,15 @@ void MethodDesc::Reset() | |||
|
|||
InterlockedUpdateFlags2(enum_flag2_HasStableEntryPoint | enum_flag2_HasPrecode, FALSE); | |||
|
|||
*GetAddrOfSlot() = GetTemporaryEntryPoint(); | |||
TADDR slot = GetAddrOfSlot(); | |||
if (IsVtableSlot()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you have not converted all slots to this scheme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-virtual slots and non-vtable slots are plain pointers in non-NGEN mode and relative pointers in NGEN mode, so, for IsVtableSlot()
slot is always a relative pointer (for ARM with FEATURE_FNV_MEM_OPTIMIZATIONS
), otherwise it's not.
src/vm/ceeload.cpp
Outdated
@@ -9088,7 +9088,7 @@ void Module::PlaceType(DataImage *image, TypeHandle th, DWORD profilingFlags) | |||
MethodTable::VtableIndirectionSlotIterator it = pMT->IterateVtableIndirectionSlots(); | |||
while (it.Next()) | |||
{ | |||
image->PlaceInternedStructureForAddress(it.GetIndirectionSlot(), CORCOMPILE_SECTION_READONLY_SHARED_HOT, CORCOMPILE_SECTION_READONLY_HOT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you use PlaceInternedStructureForAddress anymore? PlaceInternedStructureForAddress was used here as binary size optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've missed the check for MethodTable::VTableIndir2_t::isRelative
there: it should be PlaceStructureForAddress
if this condition is met, otherwise, as before. I'll fix this.
When vtable chunks consist of relative pointers, they can't be reused, as these chunks can be the same for different method tables, but base addresses must be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The base address is always the address of the slot itself. I do not see what prevents reusing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was that without this optimization if there are two method tables MT1
with chunk1
and MT2
with chunk2
, these chunks differ, as they consist of different absolute addresses. If addresses are relative, memory blocks of these chunks might become equal, when all methods in both chunks have the same relative offsets. Then, base addresses of these chunks have to be different in order to store different addresses in them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If addresses are relative, memory blocks of these chunks might become equal
PlaceInternedStructureForAddress
does not look at bytes inside the blocks. It just looks at the addresses. I do not think that the addresses can ever become accidentally equal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this place could be left unchanged, as m_reusedStructures
in PlaceInternedStructureForAddress
will not contain vtable chunks, and this change is unneeded. Thank you.
src/jit/morph.cpp
Outdated
&isRelative); | ||
&isRelative, &isRelativeAfterIndirection); | ||
|
||
if (isRelativeAfterIndirection) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have just one flag for it that turns it on/off for the whole thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two separate flags could be useful for enabling more or less aggressive memory optimizations for FNV. Now there are next possible combinations:
isRelative==false
and isRelativeAfterIndirection==false
isRelative==true
and isRelativeAfterIndirection==false
isRelative==true
and isRelativeAfterIndirection==true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it is more flexibility, but I doubt that we will ever take advantage of it and it increases the number of options one has to worry about when working on the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We target different devices and this option might be useful, as memory reduction of this optimization comes with performance reduction on benchmarks.
Maybe this and similar places could be refactored, so that it will be easier to maintain them. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically, fine grained controls like these did not survive for long. I think we just need two options: fully position independent data structures on/off. Similar to how position-independent-code works in C/C++. You either turn it on or off. There are no fine-grained controls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
a39f5c5
to
bb8ae14
Compare
src/jit/codegencommon.cpp
Outdated
@@ -9740,6 +9740,33 @@ void CodeGen::genFnEpilog(BasicBlock* block) | |||
} | |||
break; | |||
|
|||
case IAT_RELPVALUE: | |||
{ | |||
#if defined(_TARGET_ARM_) && defined(FEATURE_FNV_MEM_OPTIMIZATIONS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FEATURE_FNV_MEM_OPTIMIZATION
ifdefs should be just on the VM side. They are not necessary on the JIT side since the JIT gets instructions on what to do from the VM side, and the cost of the extra checks is negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Yes, without legacy backend part of this change we won't be able to run legacy jit with |
clrdefinitions.cmake
Outdated
@@ -195,6 +195,9 @@ add_definitions(-DFEATURE_STRONGNAME_MIGRATION) | |||
if (CLR_CMAKE_PLATFORM_UNIX OR CLR_CMAKE_TARGET_ARCH_ARM64) | |||
add_definitions(-DFEATURE_STUBS_AS_IL) | |||
endif () | |||
if (FEATURE_FNV_MEM_OPTIMIZATIONS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the name of the ifdef should be something like FEATURE_NGEN_RELOCS_OPTIMIZATIONS
to better describe what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And switch the existing places that use #if defined(PLATFORM_UNIX) && defined(_TARGET_ARM_)
to use it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add build with FEATURE_NGEN_RELOCS_OPTIMIZATIONS for Linux ARM to default set of tests on pull requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's certainly possible.
1cb5a66
to
93459f6
Compare
@dotnet-bot help |
Welcome to the dotnet/coreclr Repository The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
Welcome to the dotnet/coreclr Repository The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
Welcome to the dotnet/coreclr Perf help The following is a list of valid commands on this PR. To invoke a command, comment the indicated phrase on the PR The following commands are valid for all PRs and repositories. Click to expand
The following jobs are launched by default for each PR against dotnet/coreclr:master. Click to expand
The following optional jobs are available in PRs against dotnet/coreclr:master. Click to expand
Have a nice day! |
Could you, please, take a look? |
src/inc/corinfo.h
Outdated
IAT_PVALUE, // The value needs to be accessed via an indirection | ||
IAT_PPVALUE // The value needs to be accessed via a double indirection | ||
IAT_PVALUE, // The value needs to be accessed via an indirection | ||
IAT_RELPVALUE, // The value needs to be accessed via a relative indirection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is changing JIT/EE interface definition. The JIT/EE interface definition will need new GUID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to add IAT_RELPVALUE
at the end of the enum rather than in the middle.
It less of a breaking change that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
The non-JIT part of the changes looks good to me otherwise. @BruceForstall Could you please take a look at the JIT part of the changes? |
8241858
to
6d886f7
Compare
@jkotas @BruceForstall @briansull |
src/jit/codegencommon.cpp
Outdated
|
||
regNumber vptrReg1 = REG_R11; | ||
regMaskTP vptrReg1Mask = genRegMask(vptrReg1); | ||
inst_IV(INS_push, (int)vptrReg1Mask); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a fundamental problem here, or maybe two: (1) you are temporarily trashing R11, which is the frame pointer. This will break stack walking, and local variable access (e.g., debugging), and if an exception occurs after FP has been trashed, FP wouldn't get restored, and the stack would also be off, and (2) we never change the stack pointer outside of the prolog or epilog. I think that's required, also for stack walking.
Presumably if you ran GCStress in a fully-interruptible method with this type of call, you would hit a problem (since GC would be attempted after the push
). Actually, I'm guessing GC would fail dramatically because we don't track stack pointer moves outside the prolog/epilog, so the GC info would be incorrect.
The correct way to do this would be to make information known to the register allocator that we need a temporary register here, and then grab that temporary. This might be a case where we've never done that, so I don't know if there's an established mechanism for communicating this.
I notice also that StubLinkerCPU::ThumbEmitCallManagedMethod
has apparently the same call sequence, which uses R11 in the same way. I'm not sure what the requirements are on stubs, and whether this is ok in that context or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceForstall, what if we use some other register for this instead of R11? For example, R1, to fix problem (1), which you have mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceForstall For problem (2), we can reserve additional space on stack in prolog, and store helper register there instead of push
. This will allow to not change stack pointer. What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two things would probably make this work. However, it wouldn't be the "right" way to do it. The right way is to get the register allocator to allocate a temporary register that would be available at this point. It done, then it would automatically save/restore any conflicting register (if needed), and allocate stack space (if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BruceForstall I've pushed new commits related to these. Could you, please, take a look?
0510018
to
67a651a
Compare
src/jit/codegencommon.cpp
Outdated
@@ -8734,6 +8774,22 @@ void CodeGen::genFnEpilog(BasicBlock* block) | |||
} | |||
break; | |||
|
|||
case IAT_RELPVALUE: | |||
#ifdef FEATURE_NGEN_RELOCS_OPTIMIZATIONS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not use FEATURE ifdefs for cases like these when the VM asked the JIT to do something. We just leave the handling in unconditionally, even though it creates a bit of unreachable code in the JIT. It simplifies the matrix of different build combinations we need to worry about. It is nice to have one JIT library that can handle all of them (per host/target combination).
…TURE_NGEN_RELOCS_OPTIMIZATIONS is enabled. Introduce FEATURE_NGEN_RELOCS_OPTIMIZATIONS, under which NGEN specific relocations optimizations are enabled
- str/ldr of R4 in space reserved in epilog for non-tail calls - usage of R4 with hybrid-tail calls (same as for EmitShuffleThunk)
…r register right before its restore from stack
032544b
to
0f5c702
Compare
@dotnet-bot test this please |
@dotnet-bot test Ubuntu arm Cross Checked normal Build and Test |
@jkotas do you have any more concerns with this change, or is it ready to merge from your perspective? |
src/jit/codegencommon.cpp
Outdated
@@ -8644,6 +8644,45 @@ void CodeGen::genFnEpilog(BasicBlock* block) | |||
unwindStarted = true; | |||
} | |||
|
|||
#ifdef FEATURE_NGEN_RELOCS_OPTIMIZATIONS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to avoid FEATURE ifdefs like these in the JIT. Can this be driven by what the VM tells JIT to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas, yes, this would work without #ifdef FEATURE_NGEN_RELOCS_OPTIMIZATIONS
. I'll remove it
src/vm/methodtable.cpp
Outdated
@@ -5050,7 +5057,7 @@ void MethodTable::Fixup(DataImage *image) | |||
ZapNode * importThunk = image->GetVirtualImportThunk(pMD->GetMethodTable(), pMD, slotNumber); | |||
// On ARM, make sure that the address to the virtual thunk that we write into the | |||
// vtable "chunk" has the Thumb bit set. | |||
image->FixupFieldToNode(slotBase, slotOffset, importThunk ARM_ARG(THUMB_CODE)); | |||
image->FixupFieldToNode(slotBase, slotOffset, importThunk ARM_ARG_OR_ZERO(THUMB_CODE), relocType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to write this as ARM_ARG
+ NOT_ARM_ARG
instead of introducing a new macro for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas, thank you! i'll fix that
src/jit/codegencommon.cpp
Outdated
|
||
CORINFO_METHOD_HANDLE methHnd = (CORINFO_METHOD_HANDLE)jmpNode->gtVal.gtVal1; | ||
CORINFO_CONST_LOOKUP addrInfo; | ||
compiler->info.compCompHnd->getFunctionEntryPoint(methHnd, &addrInfo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will call getFunctionEntryPoint
again later when jmpEpilog
is set. Should we remember the result from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, fixed
src/jit/codegencommon.cpp
Outdated
@@ -8644,6 +8644,43 @@ void CodeGen::genFnEpilog(BasicBlock* block) | |||
unwindStarted = true; | |||
} | |||
|
|||
#if !FEATURE_FASTTAILCALL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If/once FEATURE_FASTTAILCALL
is enabled for ARM, this will need to change. Would it make sense to change this to something like this so that it is ready?
#if FEATURE_FASTTAILCALL
if (jmpEpilog && jmpNode->gtOper == GT_JMP)
#else
if (jmpEpilog)
#endif
Or maybe not have this under ifdef at all and just do if (jmpEpilog && jmpNode->gtOper == GT_JMP)
always to keep the code simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated latest commit with your last suggestion, thank you!
b54399e
to
ca2576d
Compare
ca2576d
to
a3992d9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
Is anybody fixing desktop errors introduced by this change and porting |
I will take care of it |
@jkotas Thank you! |
* Separate sections READONLY_VCHUNKS and READONLY_DICTIONARY * Remove relocations for second-level indirection of Vtable in case FEATURE_NGEN_RELOCS_OPTIMIZATIONS is enabled. Introduce FEATURE_NGEN_RELOCS_OPTIMIZATIONS, under which NGEN specific relocations optimizations are enabled * Replace push/pop of R11 in stubs with - str/ldr of R4 in space reserved in epilog for non-tail calls - usage of R4 with hybrid-tail calls (same as for EmitShuffleThunk) * Replace push/pop of R11 for function epilog with usage of LR as helper register right before its restore from stack Commit migrated from dotnet/coreclr@61146b5
This pull request replaces absolute pointers with relative in vtable chunks for Linux ARM.
On application, which is referenced in #10380 (comment), the following memory consumption changes occur for mappings of images (based on 08c87c5):
Rss: 8872 -> 8756 (1.3% improvement)
Private_Dirty: 2048 -> 1796 (12.3% improvement)
Private_Clean: 2028 -> 2036
Shared_Clean: 4796 -> 4924
No difference in startup time for GUI applications in case system dlls are crossgened. In case no dlls are crossgened: 0.7% increase of startup time. On micro benchmarks with lots of virtual calls: up to 29% increase of execution time.
cc @Dmitri-Botcharnikov @alpencolt @kvochko