Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with TR_StressTrampolines enabled #11296

Open
dmitry-ten opened this issue Nov 27, 2020 · 8 comments · May be fixed by #11426
Open

Segmentation fault with TR_StressTrampolines enabled #11296

dmitry-ten opened this issue Nov 27, 2020 · 8 comments · May be fixed by #11426

Comments

@dmitry-ten
Copy link
Contributor

dmitry-ten commented Nov 27, 2020

The segfault happens every time when running -version with TR_StressTrampolines enabled.

$JAVA_BIN/java -Xjit:count=0,stressTrampolines,verbose -version

The crash occurs on x86, but I haven't checked other platforms.
The stack trace:

Thread 2 "java" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7193700 (LWP 31893)]
0x00007fffeedd8cbf in OMR::CodeCache::replaceTrampoline (this=0x7ffff00b3c80, method=0x51718, oldTrampoline=0x0, oldTargetPC=0x0,
    newTargetPC=0x7fffd9434b48, needSync=false) at /home/dmitry-ten/src/omr/compiler/runtime/OMRCodeCache.cpp:623
623           entry->_info._resolved._currentTrampoline = trampoline;
(gdb) bt
#0  0x00007fffeedd8cbf in OMR::CodeCache::replaceTrampoline (this=0x7ffff00b3c80, method=0x51718, oldTrampoline=0x0, oldTargetPC=0x0,
    newTargetPC=0x7fffd9434b48, needSync=false) at /home/dmitry-ten/src/omr/compiler/runtime/OMRCodeCache.cpp:623
#1  0x00007fffeeddca1b in OMR::CodeCacheManager::replaceTrampoline (this=0x7ffff00a8cc0, method=0x51718, callSite=0x7fffd9433170,
    oldTrampoline=0x0, oldTargetPC=0x0, newTargetPC=0x7fffd9434b48, needSync=false)
    at /home/dmitry-ten/src/omr/compiler/runtime/OMRCodeCacheManager.cpp:560
#2  0x00007fffeed71a78 in mcc_replaceTrampoline (method=0x51718, callSite=0x7fffd9433170, oldTrampoline=0x0, oldTargetPC=0x0,
    newTargetPC=0x7fffd9434b48, needSync=false) at /home/dmitry-ten/src/compiler/runtime/J9CodeCache.cpp:755
#3  0x00007fffeedd2735 in amd64CodePatching (theMethod=0x51718, callSite=0x7fffd9433170, currentPC=0x0, currentTramp=0x0,
    newPC=0x7fffd9434b48, extra=0x0) at /home/dmitry-ten/src/compiler/runtime/Trampoline.cpp:632
#4  0x00007fffeedd90ee in OMR::CodeCache::patchCallPoint (this=0x7ffff00b3c80, method=0x51718, callSite=0x7fffd9433170,
    newStartPC=0x7fffd9434b48, extraArg=0x0) at /home/dmitry-ten/src/omr/compiler/runtime/OMRCodeCache.cpp:764
#5  0x00007fffeed7191b in mcc_callPointPatching_unwrapper (argsPtr=0x17d28, resPtr=0x17d28)
    at /home/dmitry-ten/src/compiler/runtime/J9CodeCache.cpp:726
#6  0x00007fffef63c888 in old_slow_jitCallCFunction (currentThread=0x18100) at cnathelp.cpp:2868
#7  0x00007fffef65b076 in jitCallCFunction () at xnathelp.s:9797
#8  0x0000000000018100 in ?? ()
#9  0x00007ffff718fa40 in ?? ()
#10 0x00007fffeed718a6 in J9::CodeCache::resetCodeCache (this=0x7ffff718fa40) at /home/dmitry-ten/src/compiler/runtime/J9CodeCache.cpp:713

I'm pretty sure that the last frame should be ignored, as resetCodeCache is only called in one place on JITServer server, so it's impossible for it to be called in this case and it's probably gdb messing up the stack trace.

entry is NULL, which indicates that method 0x51718 hasn't been added to the hash table of methods for which we know we might need a trampoline. To reserve a trampoline, TR_J9VMBase::reserveTrampolineIfNecessary should be called.

All of this information leads me to conclude that somewhere we force a call through trampoline under TR_StressTrampolines but don't account for it with reserveTrampolineIfNecessary.

@dmitry-ten
Copy link
Contributor Author

Using limit files and tracing, I narrowed down the crash to compilation of 3 methods:

+ (warm) java/util/HashMap.putVal(ILjava/lang/Object;Ljava/lang/Object;ZZ)Ljava/lang/Object; @ 00007FFFD9433068-00007FFFD9434315 OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=000000000004E998 bcsz=300 sync GCR compThreadID=0 CpuLoad=5%(0%avg) JvmCpu=0%
+ (warm) java/util/HashMap.resize()[Ljava/util/HashMap$Node; @ 00007FFFD9434360-00007FFFD9434AEA OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=12 j9m=000000000004E9B8 bcsz=359 sync GCR compThreadID=0 CpuLoad=101%(16%avg) JvmCpu=98%
addResolvedMethod=0x3a8c0
+ (warm) java/util/HashMap$Node.<init>(ILjava/lang/Object;Ljava/lang/Object;Ljava/util/HashMap$Node;)V @ 00007FFFD9434B48-00007FFFD9434E9C OrdinaryMethod - Q_SZ=0 Q_SZI=0 QW=6 j9m=0000000000051718 bcsz=26 sync GCR compThreadID=0 CpuLoad=101%(16%avg) JvmCpu=98%

As you can see the problematic method 0x51718 is java/util/HashMap$Node.<init>(ILjava/lang/Object;Ljava/lang/Object;Ljava/util/HashMap$Node;)V.

Looking at the trace files for these methods the address of the method is directly used here:

[0x7fffd852c820]       lea     GPR_0193, qword ptr [$0x0000000000051718]               # LEA8RegMem, SymRef  <J9Method>[#310  Static] [flags 0x
[0x7fffd852c9c0]       lea     GPR_0194, qword ptr [$0x0000000000000000]               # LEA8RegMem, SymRef  <start-PC>[#309  Static] [flags 0x
       4000303 0x40 ]
[0x7fffd852cad0]       mov     GPR_0195, 0x00080000    # MOV4RegImm4
[0x7fffd852cd40]       Label L0096:                    # LABEL # (Start of internal control flow)
         PRE: [GPR_0193 : eax] [GPR_0194 : esi] [GPR_0195 : edx]
[0x7fffd852cc20]       call    jitRetranslateCallerWithPreparation# CALLImm4 (00007FFFEF657B30)# CALLImm4

@dmitry-ten
Copy link
Contributor Author

And this is another where the address of the method is used:

[0x7fffd8490210]                                    Label L0020:                        # LABEL
[0x7fffd8490380] 48 bf 18 17 05 00 00 00 00 00      mov rdi, 0x0000000000051718 # MOV8RegImm64
[0x7fffd84906d0] 48 89 44 24 28                     mov qword ptr [rsp+0x28], rax               # S8MemReg, SymRef [#367 +40]
[0x7fffd8490910] 89 74 24 20                        mov dword ptr [rsp+0x20], esi               # S4MemReg, SymRef [#368 +32]
[0x7fffd8490b50] 48 89 54 24 18                     mov qword ptr [rsp+0x18], rdx               # S8MemReg, SymRef [#369 +24]
[0x7fffd8490d90] 48 89 4c 24 10                     mov qword ptr [rsp+0x10], rcx               # S8MemReg, SymRef [#370 +16]
[0x7fffd8491070]                                    assocRegs                       # ASSOCREGS
[0x7fffd8490e20] e9 9e d8 1f 00                     jmp j2iTransition           # JMP4 (00007FFFEF65BD40)# JMP4

@dmitry-ten
Copy link
Contributor Author

The actual reason this segfault happens is because when we generate a call to java/util/HashMap$Node.<init> from java/util/HashMap.putVal, the constructor is still unresolved and not compiled but once it gets compiled, the code to patch the code looks for a resolved trampoline where only unresolved one exists.

In this part of the call snippet code, we reserve a trampoline, however since J9Method is NULL at this point, we will reserve unresolved trampoline.
https://github.com/eclipse/openj9/blob/bec65f62c6ac6375b50a71a8ad083a2109463f61/runtime/compiler/x/codegen/CallSnippet.cpp#L809-L814

However, when the method gets compiled, the patching code expects a reserved trampoline:
https://github.com/eclipse/openj9/blob/bec65f62c6ac6375b50a71a8ad083a2109463f61/runtime/compiler/x/runtime/X86Unresolveds.nasm#L987-L1005

There is a method called mcc_reservationAdjustment_unwrapper that calls J9::CodeCache::adjustTrampolineReservation which will add resolved method entry to the method hashtable and allow resolved trampoline patching. However, I don't see it used anywhere in the x86 code.

@dmitry-ten
Copy link
Contributor Author

It seems like the solution would be to modify assembly of interpreterStaticAndSpecialGlue to call mcc_reservationAdjustment_unwrapper before calling mcc_callPointPatching_unwrapper but I'm not sure if that's correct or how to proceed from here.

@dmitry-ten
Copy link
Contributor Author

@mpirvu @0xdaryl could you take a look? You can ignore everything but the last 2 comments.

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 8, 2020

There used to be a call to the trampoline adjustment code from PicBuilder in pre-open source releases, but I see it has been removed for some reason. I'll try and find out when/why.

@mpirvu mpirvu added the arch:x86 label Dec 8, 2020
@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 9, 2020

This is a good find and good detective work to get to the bottom of it.

The original implementation of the unresolved dispatch logic had the call to adjust the reservation but when the code was rewritten almost 15 years ago that logic was missed for some reason. I can't think of a reason why it would have been intentionally omitted. I think it is needed--and in the spot that you identified--because otherwise I can't see how a trampoline would move from the unresolved to the resolved hash table.

The original code looked like this. I imagine a lot of it could be re-used with some freshening up for the current implementation (register usage, stack offsets, comments, etc.)

        ; Call to adjust trampoline reservations, since this callsite is now resolved
        ; The parameter layout:
        ; rsp+24    cpIndex
        ; rsp+16    cpAddress
        ; rsp+8     j9method
        ; rsp       callSite
        ;
        mov     rax, qword ptr[rsp]                ; load return address
        push    rsi                                ; p) preserve rsi
        mov     edx, dword ptr[rdi + 9]            ; load cpIndex
        push    rdx
        mov     rdx, qword ptr[rdi + 1]            ; load cpAddress
        push    rdx
        push    rsi                                ; resolved J9Method
        push    rax
        MoveHelper rax, mcc_reservationAdjustment_unwrapper
        lea     rsi, [rsp]
        lea     rdx, [rsp]
        call    jitCallCFunction
        add     rsp, 32
        pop     rsi

@dmitry-ten
Copy link
Contributor Author

Alright, thanks for providing the original implementation.
I'll try to adapt it to the current code.

@mpirvu mpirvu linked a pull request Dec 10, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants