[BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue #2791

dbbaughe · 2022-04-06T16:44:20Z

Describe the bug
When running Index Management Dashboards cypress tests against an OpenSearch 2.0 cluster, the tests will all pass but if I leave the cluster up for 0-10 minutes then at some point the memory usage starts to spike and go crazy, more specifically the Compiler and Arena Chunk memory grow incredibly fast until my entire instance runs out of memory and then OS kills the process.

I haven't tried this with other plugins yet or OpenSearch by itself, so I can't 100% say this is a problem purely in core and not something in Index Management or Job Scheduler, but so far while debugging what I have done is after the Index Management tests are run, I deleted the .opendistro-ism-config index which is where all the IM jobs are located which deschedules anything IM related and it still happened after a few minutes. That along with the major changes done in core vs minimal in Index Management makes me think it's currently something in core, but still trying to root cause it. If no other plugin though is able to replicate this at all, then perhaps it is something in IM still.

Here's NMT baseline
https://gist.github.com/dbbaughe/b3eb7457e8380f39db7cfa13ae22c5e7

Here's a diff after it's been spiking and about to go OOM in the next minute.
https://gist.github.com/dbbaughe/85975886d291e8edc0c63bdf85247b72

Went from 5GB committed to 50GB committed, with a 32GB increase in Compiler and 16GB increase in Arena Chunk.

This has happened on my dev machine using JDK11 and Log4jhotpatch disabled.

dmesg:

[535702.922079] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=java,pid=30080,uid=6606827
[535702.931813] Out of memory: Killed process 30080 (java) total-vm:70035316kB, anon-rss:60797644kB, file-rss:0kB, shmem-rss:20kB, UID:6606827 pgtables:120164kB oom_score_adj:0
[535704.307968] oom_reaper: reaped process 30080 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

To Reproduce
Steps to reproduce the behavior: See further below for updated shortened steps..

Set JDK to 11
Pull down Index Management
2.1. While on main branch run ./gradlew run and make sure JD
2.2 If you'd like to see NMT info use ./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=summary" or ./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=detail
2.3 While cluster is running you can execute jcmd <pid> VM.native_memory summary or jcmd <pid> VM.native_memory detail to see information and jcmd <pid> VM.native_memory baseline to take a baseline and then compare against jcmd <pid> VM.native_memory summary.diff or jcmd <pid> VM.native_memory detail.diff
Pull down OpenSearch Dashboards and make sure you're on the 2.0 version (as of my testing that was main branch)
3.0 Note: I specifically had the backend running on my dev machine and the dashboards running on my mac and ssh tunneled to my dev machine so they could communicate, i.e. ssh -L 9200:localhost:9200 <dev host>
3.1 cd into plugins directory and pull down Index Management Dashboards
3.2 Make sure your node version is set to 14.18.2 (nvm use 14.18.2)
3.3 Run yarn osd bootstrap while inside the Index Management Dashboards directory
3.4 Once that's done, from the OpenSearch Dashboards directory run yarn start --no-base-path --no-watch
3.5 Once OpenSearch Dashboards is up then from the Index Management Dashboards plugin run yarn run cypress run to run the cypress test suite.
3.6 After the cypress tests are done... usually it's anywhere from 0 to 10 minutes until memory starts spiking and then goes OOM after a short while. During this time you can run the jcmd commands above to see if the committed memory is growing.

I'm trying to see if I can scope down the steps needed to reproduce to just some backend API calls.
Leaving the cluster up by itself with no calls to it did not cause it to run OOM eventually.

Looks like I was able to also reproduce by just running our backend integ tests against the cluster now..

[599373.011209] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=java,pid=23132,uid=6606827
[599373.022039] Out of memory: Killed process 23132 (java) total-vm:68547404kB, anon-rss:59905372kB, file-rss:0kB, shmem-rss:20kB, UID:6606827 pgtables:118544kB oom_score_adj:0
[599374.185090] oom_reaper: reaped process 23132 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So shortened steps would be steps 1 and 2 above and then just:
./gradlew integTestRemote -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9200 -Dtests.clustername=integTest
From another terminal in the same directory (Index Management).
This wasn't able to make it through IM tests now.. but I believe we do have CI running in Index Management.. checking it looks like a recent failure from 17 hours ago for a commit merged to main.
https://github.com/opensearch-project/index-management/actions/runs/2099364761

Expected behavior
Not going OOM

Plugins
Index Management / Job Scheduler

The text was updated successfully, but these errors were encountered:

nknize · 2022-04-06T18:21:19Z

Just out of curiosity. Did you try setting -XX:MaxDirectMemorySize to something reasonable (e.g., 40GB)? It will limit the direct memory usage. A few other things that come to mind:

-Djdk.nio.maxCachedBufferSize will limit the memory used by the temporary buffer cache. This should limit netty from eating direct memory through it's Direct Memory Buffers and force reuse of existing buffers.

Were you able to see anything in the slow logs (e.g., "index.search.slowlog.level", "index.search.slowlog.threshold.query.trace": "0s", "index.indexing.slowlog.level", etc)?

Your profiling logs from before testing to right before the crash showed the majority of direct memory consumption in the JIT compiler. This is likely just a side effect of some other issue (e.g., a job thread not terminating, etc) but for anyone wanting to dig further into the jvm this sounds something like an offending lambda or functional interface similar to this stackOverflow discussion.

-                  Compiler (reserved=6472956KB, committed=6472956KB)
                            (malloc=1167KB #1891) 
                            (arena=6471789KB #10)
vs
-                  Compiler (reserved=34817884KB, committed=34817884KB)
                            (malloc=1180KB #1832) 
                            (arena=34816704KB #11)

nknize · 2022-04-06T18:45:13Z

Looks like Index Management also depends on job scheduler? I, admittedly, know zero about the job scheduler. Are there trace logs that can be enabled here to ensure jobs aren't going rogue or zombie and continuing after all tests "finish" despite dashboards reporting otherwise?

I'm also curious if kotlin is doing some JIT funniness here? Did this behavior occur in 1.3? Does kotlin compiler try to be smart and re-interpret to java functional interfaces or lambdas that might be doing something funny leading to the compiler direct memory explosion?

/cc @rmuir @uschindler enjoy digging into the jvm and may have some thoughts

dbbaughe · 2022-04-06T19:00:02Z

Looks like Index Management also depends on job scheduler? I, admittedly, know zero about the job scheduler. Are there trace logs that can be enabled here to ensure jobs aren't going rogue or zombie and continuing after all tests "finish" despite dashboards reporting otherwise?

I'm also curious if kotlin is doing some JIT funniness here? Did this behavior occur in 1.3? Does kotlin compiler try to be smart and re-interpret to java functional interfaces or lambdas that might be doing something funny leading to the compiler direct memory explosion?

/cc @rmuir @uschindler enjoy digging into the jvm and may have some thoughts

Yes, it uses job scheduler. I did add logs at one point to confirm no jobs were continuously running on their own, but if there was some method inside of a job that somehow caused it then I could have missed it.

I did enable tracelogs the other day while looking, here are two separate times it happened and the last 5000 logs before the node died.
TRACE logs 1
TRACE logs 2

Yep, IM is using Kotlin, would like to see if any other plugins are affected at all by this if they run their tests against a cluster and then do not kill the cluster.

We have run more tests and so far have only been able to repro on JDK 11, not JDK 17 or 14.

I tested once on JDK 11 for 1.3 and did not see it, and also have never seen this specific error previously in general.

uschindler · 2022-04-06T19:12:47Z

Did you try latest 11.0.14.1 release or an older one? Also I hope the garbage collector is default one and not any of the new ones!

dbbaughe · 2022-04-06T19:30:50Z

Did you try latest 11.0.14.1 release or an older one? Also I hope the garbage collector is default one and not any of the new ones!
Using OpenJDK
JDK Version : 11 (OpenJDK)
JAVA_HOME : /local/home/dbbaughe/jdk-11.0.2

Is 11.0.2 newer than 11.0.14.1 or is 11.0.14.1 for a different distribution?

I see JVM arg:
-XX:+UseConcMarkSweepGC

That being said, I am not passing in any new args or customizing any of it on my end, this is the same ./gradlew run command that has been used since 1.x and ODFE versions, so if something changed w/ those args it would mean something possibly changed w/ the buildSrc in core.

uschindler · 2022-04-06T20:43:31Z

11.0.2 is stone aged. 😆

11.0.14.1 is newest one. The 1 at end is because of a bugfix due to a regression in 11.0.14.

dbbaughe · 2022-04-06T20:45:46Z

Ah I think I had a brainfart moment and put a dot between the 14 and figured 2 was greater than 1.. lol

We can try testing with later version.

downsrob · 2022-04-06T21:28:43Z

I was running these tests on JDK 11.0.2 on three different ec2 boxes and I was able to reproduce the failure fairly consistently on all of them. When running ./gradlew integTestRemote -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9200 -Dtests.clustername=integTest --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.*" on repeat, I was typically able to reproduce this failure in around five minutes.

After upgrading to 11.0.14 I have not yet been able to reproduce the OOM failure while looping the same set of tests for over half an hour. I am going to keep trying for a bit longer and with the full test suite to see if I am able to reproduce the OOM failure in any way on 11.0.14

dbbaughe · 2022-04-06T21:33:26Z

If it does turn out to be fixed w/ newer JDK 11 version.. wondering what our recourse here is.
I had just gone to https://jdk.java.net/11/ which pointed me to the OpenJDK Archive and the latest JDK 11 there is 11.0.2 which I figured was the newest one. I would assume others would possibly be doing the same... unless I just went to the complete wrong resource to try and grab a JDK 11 download.

Do we just release OS 2.0 potentially w/ a caveat that says only JDK 11.x.x.x and up is supported?
@dblock @nknize
Ideally we could still try to track down what the issue is.. but if it's something in the JDK fixes itself then it might not be worth the effort.

Edit: Seems like Nick mentioned there's a minimum jdk variable we can set so it's more enforced.

dblock · 2022-04-06T21:57:41Z

If it does turn out to be fixed w/ newer JDK 11 version.. wondering what our recourse here is. I had just gone to https://jdk.java.net/11/ which pointed me to the OpenJDK Archive and the latest JDK 11 there is 11.0.2 which I figured was the newest one. I would assume others would possibly be doing the same... unless I just went to the complete wrong resource to try and grab a JDK 11 download.

We should probably point people to https://adoptium.net/temurin/archive ... but we also know that OpenSearch works well with other distributions such as Coretto.

Do we just release OS 2.0 potentially w/ a caveat that says only JDK 11.x.x.x and up is supported? @dblock @nknize Ideally we could still try to track down what the issue is.. but if it's something in the JDK fixes itself then it might not be worth the effort.

I think it's not worth the effort. I recommend closing this issue.

If you think something is worth saying, add to https://opensearch.org/docs/latest/opensearch/install/compatibility/?

uschindler · 2022-04-07T07:25:29Z

Hi,

We should probably point people to https://adoptium.net/temurin/archive ... but we also know that OpenSearch works well with other distributions such as Coretto.

Generally: yes. I would always tell users to go and use the latest Adoptium/Temurin JDK. Also those shipped with Ubuntu regular updates are good (Ubuntu 20.0.4 LTS has 11.0.14.1). Coretto is also fine! Actually there are no real differences between those variants. It is just the original OpenJDK source code built with custom build scripts and some tools added on top.

For Coretto there also seems to be an included "log4j patcher agent" (not 100% sure about if it is bundled, haven't checked), but this one does not work well with security manager as used by OpenSearch, so I would not recommend to enable it.

The problem with the OpenJDK home page: They explicitely state on https://jdk.java.net/11/ that it is end of life and you should not use it in production: "The JDK 11 Early Access Program has concluded. Please visit Java SE Downloads for production ready builds. Older releases, which do not include the most up to date security vulnerability fixes and are no longer recommended for use in production, remain available in the OpenJDK Archive."

If OpenSearch ships prebuilt binaries/images like Elasticsearch, it should for sure contain the latest OpenJDK version provided by Coretto or Adoptium/Temurin. In the README it should maybe give instructions about where to get a version. I would plain simple say: Use Java 11, but make sure at least 11.0.x (x =14), the one out on the date of relase and the one we tested with.

uschindler · 2022-04-07T07:30:00Z

I think it's not worth the effort. I recommend closing this issue.

Yes. This is not a problem of OpenSearch. It is just buggy runtime. We may improve documentation, but you will always find people using some crazy JDK versions rotting on their machines. This is why I generally recommend to ask users to first update to latest version before submitting bug reports like this one. Maybe add this to the bug report template.

uschindler · 2022-04-07T07:39:08Z

Edit: Seems like Nick mentioned there's a minimum jdk variable we can set so it's more enforced.

Yes this can be done. If JDK 8 is no longer the minimum requirement for opensearch, we can for sure use the new Java 9 feature to detect and compare version: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Runtime.html#version()

This allows to do the startup checks. Parsing system properties is risky, but using those classes makes it very easy to have denylists of versions or enforce minimum versions. In Apache Lucene we also change to this soon.

I am not sure if I am allowed to mention this here, the other fork named "Elasticsearch" already implemented support for java.lang.Runtime.Version class and replaced the risky manual parsing: elastic/elasticsearch#85361

dblock · 2022-04-07T18:54:14Z

I'll close this.

@uschindler Care to open a feature request describing startup checks? OpenSearch 2.0 is min JDK 11 so that can work.

uschindler · 2022-04-07T19:26:31Z

I'll close this.

@uschindler Care to open a feature request describing startup checks? OpenSearch 2.0 is min JDK 11 so that can work.

Will do when at home. The version detection stuff is already available (class JavaVersion), but it should be removed an replaced by Runtime.Version.

uschindler · 2022-04-08T18:25:02Z

I opened: #2822

reta · 2022-04-18T20:19:38Z

Interesting, could it be related to https://bugs.openjdk.java.net/browse/JDK-8259541 ? @dbbaughe out of curiosity, did you enable JVM compiler logging to see what is going on? Thank you.

dbbaughe · 2022-04-20T18:34:06Z

Interesting, could it be related to https://bugs.openjdk.java.net/browse/JDK-8259541 ? @dbbaughe out of curiosity, did you enable JVM compiler logging to see what is going on? Thank you.

Yeah I did, but I was only cross-checking for other similar issues that others had reported.. didn't have enough expertise on those compiler logs to be able to find something on my own without having to spend a significant amount of time.

dbbaughe added bug Something isn't working untriaged labels Apr 6, 2022

dblock added v2.0.0 Version 2.0.0 Severity-Critical and removed untriaged labels Apr 6, 2022

dblock closed this as completed Apr 7, 2022

uschindler mentioned this issue Apr 8, 2022

[META] Add check for broken Java 11 versions on startup #2822

Closed

3 tasks

This was referenced Apr 12, 2022

[RFC] Cross Cluster Replication as a Core Component #2872

Open

[BUG] Indexing Performance Degraded in OpenSearch 1.3.+ #2916

Closed

reta mentioned this issue Apr 20, 2022

Bootstrap should implement a denylist of Java versions (ranges) #3005

Closed

reta mentioned this issue May 3, 2022

Bootstrap should implement a denylist of Java versions (ranges) #3164

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue #2791

[BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue #2791

dbbaughe commented Apr 6, 2022 •

edited

Loading

nknize commented Apr 6, 2022 •

edited

Loading

nknize commented Apr 6, 2022 •

edited

Loading

dbbaughe commented Apr 6, 2022 •

edited

Loading

uschindler commented Apr 6, 2022

dbbaughe commented Apr 6, 2022

uschindler commented Apr 6, 2022

dbbaughe commented Apr 6, 2022

downsrob commented Apr 6, 2022

dbbaughe commented Apr 6, 2022 •

edited

Loading

dblock commented Apr 6, 2022 •

edited

Loading

uschindler commented Apr 7, 2022 •

edited

Loading

uschindler commented Apr 7, 2022

uschindler commented Apr 7, 2022 •

edited

Loading

dblock commented Apr 7, 2022

uschindler commented Apr 7, 2022

uschindler commented Apr 8, 2022

reta commented Apr 18, 2022

dbbaughe commented Apr 20, 2022

[BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue #2791

[BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue #2791

Comments

dbbaughe commented Apr 6, 2022 • edited Loading

nknize commented Apr 6, 2022 • edited Loading

nknize commented Apr 6, 2022 • edited Loading

dbbaughe commented Apr 6, 2022 • edited Loading

uschindler commented Apr 6, 2022

dbbaughe commented Apr 6, 2022

uschindler commented Apr 6, 2022

dbbaughe commented Apr 6, 2022

downsrob commented Apr 6, 2022

dbbaughe commented Apr 6, 2022 • edited Loading

dblock commented Apr 6, 2022 • edited Loading

uschindler commented Apr 7, 2022 • edited Loading

uschindler commented Apr 7, 2022

uschindler commented Apr 7, 2022 • edited Loading

dblock commented Apr 7, 2022

uschindler commented Apr 7, 2022

uschindler commented Apr 8, 2022

reta commented Apr 18, 2022

dbbaughe commented Apr 20, 2022

dbbaughe commented Apr 6, 2022 •

edited

Loading

nknize commented Apr 6, 2022 •

edited

Loading

nknize commented Apr 6, 2022 •

edited

Loading

dbbaughe commented Apr 6, 2022 •

edited

Loading

dbbaughe commented Apr 6, 2022 •

edited

Loading

dblock commented Apr 6, 2022 •

edited

Loading

uschindler commented Apr 7, 2022 •

edited

Loading

uschindler commented Apr 7, 2022 •

edited

Loading