Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JCL: ensure JNI cache is initialized only once #17800

Merged

Conversation

gireeshpunathil
Copy link
Contributor

@gireeshpunathil gireeshpunathil commented Jul 16, 2023

initIDCache is currently subjected for race condition, and if concurrently entered by multiple threads can result in reference leaks. Streamline the cache population by invoking it from a class static initialiser which is inherently synchronised.

runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
@gireeshpunathil
Copy link
Contributor Author

@keithc-ca - thanks for the review, all the suggestions have been addressed. regarding squashing of mixup commits - not sure what is customary in this project (squash right away or wait for the review to complete and squash it before landing, or the committer squashes the commits while landing) . pls suggest.

runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
@keithc-ca
Copy link
Contributor

My preference is that you squash the commits before this change is merged. You can do it as the change evolves or all at once at the end - either way works for me.

@gireeshpunathil
Copy link
Contributor Author

@keithc-ca - done, addressed review comments and squashed commits into one, pls have a look.

@keithc-ca keithc-ca requested a review from gacholio July 19, 2023 16:49
@keithc-ca
Copy link
Contributor

I'd like a second opinion from @gacholio.

@gacholio
Copy link
Contributor

I don't think any of these changes are actually necessary - the changes are in two static functions that are called only from a JNI native, so looping case is simply not going to happen.

Copy link
Contributor

@gacholio gacholio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved for correctness, but I'm still not convinced the extra code provides any benefit.

@gireeshpunathil
Copy link
Contributor Author

import java.lang.management.*;

public class TX {
	public static void main(String args[]) {
		while(true) {
			start();
			collect();
			sleep();
		}
	}
	public static void start() {
		Thread t[] = new Thread[32];
		for(int i=0; i<32; i++) {
			t[i]  = new Thread() {
				public void run() {
					measure();
				}
			};
		}
		for(int i=0; i<32; i++) {
			t[i].start();
		}
		for(int i=0; i<32; i++) {
			try {
				t[i].join();
			} catch(Exception e) {}
		}
	}

	public static void measure() {
           ThreadMXBean tb = ManagementFactory.getThreadMXBean();
           long[] ids = tb.getAllThreadIds();
           ThreadInfo[] info = tb.getThreadInfo(ids, true, true);
	}
	public static void collect() {
		System.gc();
	}
	public static void sleep() {
		try {
			Thread.sleep(1);
		} catch(Exception e) {}
	}
}

I don't think any of these changes are actually necessary - the changes are in two static functions that are called only from a JNI native, so looping case is simply not going to happen.

@gacholio - here is a test case that manifests the leak. You can take dumps and look at the number of jniLocalReferences or their pool counts. The actual test case is much larger, in which case the program makes use of a service loop in the JNI code (not in Java), and frequently invoke leaking functions or its caller. Given the fact that there is no exit into Java, the frames are never auto-cleared.

@gacholio
Copy link
Contributor

Calling VM-provided natives from a C loop is not something we can support - they will change or be removed without notice.

@tajila Comments?

@gireeshpunathil
Copy link
Contributor Author

Calling VM-provided natives from a C loop is not something we can support - they will change or be removed without notice.

sorry, I should have been more explicit. The C loop is calling the public JNI function - in this case Java_openj9_internal_tools_attach_target_DiagnosticUtils_dumpAllThreadsImpl, not the helper function getThreadInfo.

@gacholio
Copy link
Contributor

My comment stands - this is not a dependable way to use the VM.

@gireeshpunathil
Copy link
Contributor Author

My comment stands - this is not a dependable way to use the VM.

ok, fair point, let us focus only on the above test case then.

@gacholio
Copy link
Contributor

Oh, the java test leaks as well?

@gireeshpunathil
Copy link
Contributor Author

Oh, the java test leaks as well?

yes. reviewing my original use case, I think it is slightly different:

  • [1] a C service loop calls a Java method java.lang.management.ThreadMXBean.getThreadInfo through JNI semantics
  • [2] this java method internally invokes the JNI method getThreadInfoImpl
  • [3] this leverages the private static helper functiongetThreadInfo to get the information

looks like the local references are not cleared when the JNI returns to immediate calling Java frame [2], instead it expects all the native frames (including [1]) to be unwound?

I can share my test case if need be.

@gacholio
Copy link
Contributor

That case makes a lot more sense, though I can't see a reason why the references aren't automatically freed by the native return (there's nothing special about that native).

@gacholio
Copy link
Contributor

And yes, please do share your testcase - maybe there's something unexpected there that's exposing this issue.

@gireeshpunathil
Copy link
Contributor Author

gireeshpunathil commented Jul 20, 2023

java -Xhealthcenter -Dcom.ibm.java.diagnostics.healthcenter.thread.collection.interval=1 -Dcom.ibm.diagnostics.healthcenter.data.threads=on <any long running app>

produced ~2k leak in a day's time for me. the leak can be seen through walking through the memory pools attached with the vmthread->jniLocalReferences.

edit: the leaking objects found were of type java/lang/StackTraceElement

@gacholio
Copy link
Contributor

Stepping through the JNI code performs as expected:

2300				J9SFJNINativeMethodFrame *nativeMethodFrame = recordJNIReturn(REGISTER_ARGS, bp);
(gdb) p _currentThread->jniLocalReferences
$12 = (UDATA *) 0x7f2e6c003ea8
(gdb) next
2301				_currentThread->jitStackFrameFlags = nativeMethodFrame->specialFrameFlags & J9_SSF_JIT_NATIVE_TRANSITION_FRAME;
(gdb) p _currentThread->jniLocalReferences
$13 = (UDATA *) 0x0

@gacholio
Copy link
Contributor

You mention java/lang/StackTraceElement - is it instances of references to the class? initIDCache can leak global refs if it's run concurrently in multiple threads. One of those global refs is to the java/lang/StackTraceElement class.

@gireeshpunathil
Copy link
Contributor Author

Stepping through the JNI code performs as expected:

ok - so you mean the jniLocalReferences is fully cleared when we return back to Java? that is great!

You mention java/lang/StackTraceElement - is it instances of references to the class?

Yes, those are class references, not object references. So is it root caused by a different issue?

@gireeshpunathil
Copy link
Contributor Author

I guess I addressed all the review comments, pls have a look @gacholio @keithc-ca !

@gireeshpunathil gireeshpunathil changed the title jcl: remove dangling local references icl: streamline JNI handle cache population Aug 3, 2023
runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
runtime/jcl/common/mgmtthread.c Outdated Show resolved Hide resolved
@keithc-ca keithc-ca changed the title icl: streamline JNI handle cache population JCL: streamline JNI handle cache population Aug 3, 2023
@gireeshpunathil gireeshpunathil force-pushed the fix-jni-local-ref-leak branch 2 times, most recently from 02ce70d to a22141f Compare August 7, 2023 11:53
@gacholio
Copy link
Contributor

gacholio commented Aug 8, 2023

Back to you @keithc-ca

@gireeshpunathil gireeshpunathil force-pushed the fix-jni-local-ref-leak branch 2 times, most recently from d6a9bd2 to 6094ead Compare August 8, 2023 03:33
goto initIDCache_fail;
}
if (!(gcls = (*env)->NewGlobalRef(env, cls))) {
err = JNI_ENOMEM;
goto initIDCache_fail;
}
(*env)->DeleteLocalRef(env, cls);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove all of these from this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@gacholio
Copy link
Contributor

gacholio commented Aug 8, 2023

I think the PR title and description should be updated - the history of this change isn't really relevant.

@gireeshpunathil
Copy link
Contributor Author

I think the PR title and description should be updated - the history of this change isn't really relevant.

the title is already modified to reflect the current change set. pls let me know if you think it can be improved.

the description is changed to reflect the current change set, the history is removed.

@gacholio
Copy link
Contributor

gacholio commented Aug 8, 2023

I question "streamline" when we're actually fixing a leak problem. Something more like "ensure JNI cache is initialized only once".

@gireeshpunathil gireeshpunathil changed the title JCL: streamline JNI handle cache population JCL: ensure JNI cache is initialized only once Aug 9, 2023
@keithc-ca
Copy link
Contributor

Please update the commit message as well: this isn't a change to "streamline" anything.

initIDCache is currently subjected for race condition,
and if concurrently entered by multiple threads can
result in reference leaks. Fix this by introducing a
wrapper class that acts as a bridge, invoking the method
from the static initializer of the wrapper class, so that
the synchronization is taken care through class initializer.
@gireeshpunathil
Copy link
Contributor Author

@keithc-ca - done

@keithc-ca
Copy link
Contributor

Jenkins test sanity amac jdk17

@keithc-ca keithc-ca merged commit 1f59317 into eclipse-openj9:master Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants