Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT SIGTRAP Signal Handler Issue #10744

Closed
babsingh opened this issue Sep 29, 2020 · 14 comments
Closed

JIT SIGTRAP Signal Handler Issue #10744

babsingh opened this issue Sep 29, 2020 · 14 comments

Comments

@babsingh
Copy link
Contributor

babsingh commented Sep 29, 2020

Issue

Incorrect behaviour: OpenJ9 terminates abruptly or generates a core file for TRAP instructions when run in RPM package manager on RHEL8.2 ppc64le. This behaviour is not seen outside of RPM package manager. Similar behaviour is also seen on SLES15 ppc64le

The issue is introduced because of eclipse-omr/omr#4554, which impacts JIT's SIGTRAP signal handler.

Reverting eclipse-omr/omr#4554 corrects the behaviour. eclipse-omr/omr#4554 can't be reverted since it fixes #7749.

There is no direct link between unblocking a signal and JIT's SIGTRAP signal handler. Also, SIGTRAP is categorized as a synchronous signal, and only asynchronous signals are unblocked.

We need to investigate how eclipse-omr/omr#4554 impacts JIT's SIGTRAP signal handler.

The below workaround, test case and instructions to reproduce the behaviour are provided by @klangman.

Workaround

The JVM command line option, -Xjit:NoResumableTrapHandler, will avoid the above reported problem. This option prevents the JIT from using TRAP instructions.

Steps to reproduce

  • Perform the machine setup
  • Create a WORK directory: mkdir /root/work; export WORK=/root/work
  • cd $WORK
  • Unzip OpenJ9 JDK8 in $WORK/jdk8
  • Create the test case: NPETest.java
  • Create the RPM spec: test.jvmtrap.spec
  • Create the run script: run.sh
  • Execute ./run.sh should reproduce the incorrect behaviour

Machine setup

  • RHEL 8.2 ppc64le
  • rpmbuild
  • rpm
  • OpenJ9 JDK8 build for Linux ppc64le from AdoptOpenJDK

Create the test case: NPETest.java

public class NPETest {
    static String[] strings = {"Kevin", "Alice", "Phill", "Tilly", "Noel", "Anthony", "Bruce", "Julie", null, null};
    public static void main(String[] args) {
        for( int i=0 ; i<20000 ; i++){
            throwMaybe(i%10);
        }
        System.out.println("Test finished without a crash!");
    }
    static void throwMaybe(int x){
        try{
           System.out.println( "Name: " + strings[x].toString() );
        }catch(NullPointerException e){
            // Do nothing
        }
    }
}

Create the RPM spec: test.jvmtrap.spec

Name:       TEST
Version:        8.1.11
Release:        15
License:        None
Summary:        Testing if I can see a SEGV using JVM in a scriptlet
Group:      Utilities/Archiving
Vendor:     N/A
URL:        n/a
%description

%files

%post
#!/bin/bash
echo "Hello World!"
#/etc/init.d/webserver start
$WORK/jdk8/bin/java -Xjit:verbose -cp $WORK NPETest |& tee $WORK/output.log

Create the run script: run.sh

cd $WORK
jdk8/bin/javac NPETest.java
rpm -evh TEST-8.1.11-15.ppc64le
rpmbuild -ba test.jvmtrap.spec
rpm -ivh /root/rpmbuild/RPMS/ppc64le/TEST-8.1.11-15.ppc64le.rpm
@klangman
Copy link
Contributor

The problem also happens on SLES 15 for PPCle.

@andrewcraik
Copy link
Contributor

@babsingh / @klangman so which platforms are affected and what is the next step on this?

@babsingh
Copy link
Contributor Author

babsingh commented Nov 3, 2020

so which platforms are affected and what is the next step on this?

@andrewcraik The failure was only observed on ppc64le platforms, specifically RHEL8.2 and SLES15, since JIT's SIGTRAP signal handler only gets used on these platforms. There is a workaround, -Xjit:NoResumableTrapHandler, so resources are being prioritized towards other high priority tasks. The next step would be to investigate how eclipse-omr/omr#4554 impacts JIT's SIGTRAP signal handler since reverting eclipse-omr/omr#4554 resolves the issue.

@klangman
Copy link
Contributor

klangman commented Nov 3, 2020

I tested zLinux and was unable to recreate the problem. It's unknown if the problem exists on LinuxPPCbe because there is no up to date distribution on PPC BE. The problem only appears on newer distributions (older SLES and RHEL distributions of PPC LE do not show the problem).

@andrewcraik
Copy link
Contributor

OK since this is P specific - FYI @gita-omr and I am going to tag this with the arch:p label

@babsingh
Copy link
Contributor Author

babsingh commented Dec 2, 2020

The problem also happens on SLES 15 for PPCle.

@klangman SLES has YaST as the package manager. It also supports RPM. Was the problem recreated using YaST or RPM on SLES 15?

@klangman
Copy link
Contributor

klangman commented Dec 2, 2020

@klangman SLES has YaST as the package manager. It also supports RPM. Was the problem recreated using YaST or RPM on SLES 15?

@babsingh I used RPM on SLES 15 to recreate the problem.

@tajila
Copy link
Contributor

tajila commented Dec 4, 2020

I'm going to move this to the next release as it is unlikely that it will be resolved for the code split.

@babsingh @klangman Let me know if you think otherwise.

@rmnattas
Copy link
Contributor

rmnattas commented Dec 24, 2020

Looking at the signal masks in /proc/<PID>/status every ~0.02 seconds I see the following changes:

Timestampjava process (PID=7527)main thread (PID=7530)
SigBlkSigIgnSigCgtSigBlkSigIgnSigCgt
32.2620xfa170x00060x0000---------
32.352//0x10060x04f80xfa170x10060x04f8
32.448//0x10020x44fd0xba120x10020x44fd
33.083////0x44ed0xba02//0x44ed
33.084SIGTRAP received in main thread

* SIGTRAP bit = 0x0010

What of interest here is the last change, the SIGTRAP handler is not set with the OS anymore and it was unblocked in the main thread.
The main difference between crashed runs and successful runs (by using -Xjit:NoResumableTrapHandler, reverting the causing commit, or by chance as its not failing %100 of the time for me) is not having the last change (33.083 row).

There is no related strace to that change:

java(7527)-+-{JIT Compilation}(7548)
           ...
           |-{JIT Compilation}(7555)
           |-{JIT IProfiler}(7557)
           |-{JIT Sampler}(7556)
           |-{Signal Reporter}(7538)
           `-{main}(7530)



[pid  7530 {main}] 15:57:32.264961 rt_sigaction(SIGTRAP, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
[pid  7530 {main}] 15:57:32.265027 rt_sigaction(SIGTRAP, {sa_handler=0x7fff9c7835c0, sa_mask=[], sa_flags=SA_RESTART|SA_NODEFER|SA_SIGINFO}, NULL, 8) = 0

[pid  7530 {main}] 15:57:33.084218 --- SIGTRAP {si_signo=SIGTRAP, si_code=TRAP_BRKPT, si_pid=1929882744, si_uid=32767} ---

@gita-omr
Copy link
Contributor

gita-omr commented Jan 4, 2021

@rmnattas is looking at it.

@rmnattas
Copy link
Contributor

rmnattas commented Jan 4, 2021

Adding an OR condition || portLibrarySignalNo == OMRPORT_SIG_FLAG_SIGTRAP to line [1] allowing SIGTRAP to get unblocked works without showing the problem (finishes without crashing). For some reason when SIGTRAP is unblocked (only in the main thread) the handler does not get unset and the signals get received by the JVM not the OS.

[1]
https://github.com/eclipse/openj9-omr/blob/d19d2fff4fab1bc47f7288c029793468969e9ef3/port/unix/omrsignal.c#L1278

@rmnattas
Copy link
Contributor

rmnattas commented Jan 5, 2021

@babsingh Any reason why would a change in a signal mask/handler like the last row change in #10744 (comment) happen in OpenJ9/OMR? But then no related strace is showing which could mean it's something outside the java process and its threads.

@klangman mentioned that it could be a PPCle kernel bug which I agree that its a plausible theory given the strange circumstances of the signal mask/handler change without seeing the java process and its threads doing the change. I'm trying to test that somehow with a C process.

Also, if it's urgent, I think a temporary fix could be my last comment. Allowing synchronous signals (or maybe just SIGTRAP) to get unblocked under registerSignalHandlerWithOS should satisfy "a signal should not be unblocked if it is not used i.e. a signal handler is not registered for it."

rmnattas added a commit to rmnattas/omr that referenced this issue Jan 5, 2021
Allow synchronous signals to get unblocked when
a signal handler is set for them.

Issue: eclipse-openj9/openj9#10744

Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>
@babsingh
Copy link
Contributor Author

babsingh commented Jan 5, 2021

Any reason why would a change ...

None. I agree that there is an external entity responsible for interfering with the SIGTRAP handler, which is environment specific (RPM + RHEL + PPCLE).

... temporary fix ... unblocking SIGTRAP fixes the behaviour of the SIGTRAP handler

As per the Notes in https://man7.org/linux/man-pages/man2/sigprocmask.2.html, there are only side-effects for blocking signals such as SIGBUS, SIGFPE, SIGILL and SIGSEGV . Unblocking signals doesn't have any side-effects. So, we can unblock all signals (synchronous + asynchronous) after registering a handler against the signal in OMR. This can be treated as a permanent fix. This will also add consistency to unblocking signals.

@rmnattas is working on the FIX: https://github.com/rmnattas/omr/commits/unix-signal-unblock. Currently, only asynchronous signals are unblocked. The FIX will unblock ALL signals. The FIX will also unblock after the signal handler is registered in order to avoid receiving signals between unblock and registration.

The following tests have been suggested to verify functionality:

  1. SunMiscSignalTest (checks registration and invocation of handlers): https://github.com/ibmruntimes/openj9-openjdk-jdk15/blob/openj9/test/jdk/sun/misc/SunMiscSignalTest.java
  2. SigIntTest (checks shutdown hooks): Investigate JVM signal handling functions #54 (comment)
  3. For checking generation of core files, you can just run sanity+extended functional testing in a personal build.

rmnattas added a commit to rmnattas/omr that referenced this issue Jan 6, 2021
Allow synchronous signals to get unblocked when
a signal handler is set for them.

Issue: eclipse-openj9/openj9#10744

Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>
rmnattas added a commit to rmnattas/omr that referenced this issue Jan 11, 2021
Allow synchronous signals to get unblocked when
a signal handler is set for them.

Issue: eclipse-openj9/openj9#10744

Signed-off-by: Abdulrahman Alattas <rmnattas@gmail.com>
@babsingh
Copy link
Contributor Author

babsingh commented Jan 18, 2021

On a RHEL 8.3 (Ootpa) ppc64le machine, I ran the failing test, #10744 (comment), 100 times using a nightly build (https://ci.eclipse.org/openj9/job/Build_JDK8_ppc64le_linux_Nightly/636/), which has @rmnattas's fix (eclipse-omr/omr#5742).

No failures were seen, which verifies that eclipse-omr/omr#5742 works as a fix. Thus, closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants