Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue #2791

Closed
dbbaughe opened this issue Apr 6, 2022 · 18 comments
Closed
Labels
bug Something isn't working Severity-Critical v2.0.0 Version 2.0.0

Comments

@dbbaughe
Copy link

dbbaughe commented Apr 6, 2022

Describe the bug
When running Index Management Dashboards cypress tests against an OpenSearch 2.0 cluster, the tests will all pass but if I leave the cluster up for 0-10 minutes then at some point the memory usage starts to spike and go crazy, more specifically the Compiler and Arena Chunk memory grow incredibly fast until my entire instance runs out of memory and then OS kills the process.

I haven't tried this with other plugins yet or OpenSearch by itself, so I can't 100% say this is a problem purely in core and not something in Index Management or Job Scheduler, but so far while debugging what I have done is after the Index Management tests are run, I deleted the .opendistro-ism-config index which is where all the IM jobs are located which deschedules anything IM related and it still happened after a few minutes. That along with the major changes done in core vs minimal in Index Management makes me think it's currently something in core, but still trying to root cause it. If no other plugin though is able to replicate this at all, then perhaps it is something in IM still.

Here's NMT baseline
https://gist.github.com/dbbaughe/b3eb7457e8380f39db7cfa13ae22c5e7

Here's a diff after it's been spiking and about to go OOM in the next minute.
https://gist.github.com/dbbaughe/85975886d291e8edc0c63bdf85247b72

Went from 5GB committed to 50GB committed, with a 32GB increase in Compiler and 16GB increase in Arena Chunk.

This has happened on my dev machine using JDK11 and Log4jhotpatch disabled.

dmesg:

[535702.922079] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=java,pid=30080,uid=6606827
[535702.931813] Out of memory: Killed process 30080 (java) total-vm:70035316kB, anon-rss:60797644kB, file-rss:0kB, shmem-rss:20kB, UID:6606827 pgtables:120164kB oom_score_adj:0
[535704.307968] oom_reaper: reaped process 30080 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

To Reproduce
Steps to reproduce the behavior: See further below for updated shortened steps..

  1. Set JDK to 11
  2. Pull down Index Management
    2.1. While on main branch run ./gradlew run and make sure JD
    2.2 If you'd like to see NMT info use ./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=summary" or ./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=detail
    2.3 While cluster is running you can execute jcmd <pid> VM.native_memory summary or jcmd <pid> VM.native_memory detail to see information and jcmd <pid> VM.native_memory baseline to take a baseline and then compare against jcmd <pid> VM.native_memory summary.diff or jcmd <pid> VM.native_memory detail.diff
  3. Pull down OpenSearch Dashboards and make sure you're on the 2.0 version (as of my testing that was main branch)
    3.0 Note: I specifically had the backend running on my dev machine and the dashboards running on my mac and ssh tunneled to my dev machine so they could communicate, i.e. ssh -L 9200:localhost:9200 <dev host>
    3.1 cd into plugins directory and pull down Index Management Dashboards
    3.2 Make sure your node version is set to 14.18.2 (nvm use 14.18.2)
    3.3 Run yarn osd bootstrap while inside the Index Management Dashboards directory
    3.4 Once that's done, from the OpenSearch Dashboards directory run yarn start --no-base-path --no-watch
    3.5 Once OpenSearch Dashboards is up then from the Index Management Dashboards plugin run yarn run cypress run to run the cypress test suite.
    3.6 After the cypress tests are done... usually it's anywhere from 0 to 10 minutes until memory starts spiking and then goes OOM after a short while. During this time you can run the jcmd commands above to see if the committed memory is growing.

I'm trying to see if I can scope down the steps needed to reproduce to just some backend API calls.
Leaving the cluster up by itself with no calls to it did not cause it to run OOM eventually.

Looks like I was able to also reproduce by just running our backend integ tests against the cluster now..

[599373.011209] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice,task=java,pid=23132,uid=6606827
[599373.022039] Out of memory: Killed process 23132 (java) total-vm:68547404kB, anon-rss:59905372kB, file-rss:0kB, shmem-rss:20kB, UID:6606827 pgtables:118544kB oom_score_adj:0
[599374.185090] oom_reaper: reaped process 23132 (java), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So shortened steps would be steps 1 and 2 above and then just:
./gradlew integTestRemote -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9200 -Dtests.clustername=integTest
From another terminal in the same directory (Index Management).
This wasn't able to make it through IM tests now.. but I believe we do have CI running in Index Management.. checking it looks like a recent failure from 17 hours ago for a commit merged to main.
https://github.com/opensearch-project/index-management/actions/runs/2099364761

Expected behavior
Not going OOM

Plugins
Index Management / Job Scheduler

@dbbaughe dbbaughe added bug Something isn't working untriaged labels Apr 6, 2022
@dblock dblock added v2.0.0 Version 2.0.0 Severity-Critical and removed untriaged labels Apr 6, 2022
@nknize
Copy link
Collaborator

nknize commented Apr 6, 2022

Just out of curiosity. Did you try setting -XX:MaxDirectMemorySize to something reasonable (e.g., 40GB)? It will limit the direct memory usage. A few other things that come to mind:

-Djdk.nio.maxCachedBufferSize will limit the memory used by the temporary buffer cache. This should limit netty from eating direct memory through it's Direct Memory Buffers and force reuse of existing buffers.

Were you able to see anything in the slow logs (e.g., "index.search.slowlog.level", "index.search.slowlog.threshold.query.trace": "0s", "index.indexing.slowlog.level", etc)?

Your profiling logs from before testing to right before the crash showed the majority of direct memory consumption in the JIT compiler. This is likely just a side effect of some other issue (e.g., a job thread not terminating, etc) but for anyone wanting to dig further into the jvm this sounds something like an offending lambda or functional interface similar to this stackOverflow discussion.

-                  Compiler (reserved=6472956KB, committed=6472956KB)
                            (malloc=1167KB #1891) 
                            (arena=6471789KB #10)
vs
-                  Compiler (reserved=34817884KB, committed=34817884KB)
                            (malloc=1180KB #1832) 
                            (arena=34816704KB #11)

@nknize
Copy link
Collaborator

nknize commented Apr 6, 2022

Looks like Index Management also depends on job scheduler? I, admittedly, know zero about the job scheduler. Are there trace logs that can be enabled here to ensure jobs aren't going rogue or zombie and continuing after all tests "finish" despite dashboards reporting otherwise?

I'm also curious if kotlin is doing some JIT funniness here? Did this behavior occur in 1.3? Does kotlin compiler try to be smart and re-interpret to java functional interfaces or lambdas that might be doing something funny leading to the compiler direct memory explosion?

/cc @rmuir @uschindler enjoy digging into the jvm and may have some thoughts

@dbbaughe
Copy link
Author

dbbaughe commented Apr 6, 2022

Looks like Index Management also depends on job scheduler? I, admittedly, know zero about the job scheduler. Are there trace logs that can be enabled here to ensure jobs aren't going rogue or zombie and continuing after all tests "finish" despite dashboards reporting otherwise?

I'm also curious if kotlin is doing some JIT funniness here? Did this behavior occur in 1.3? Does kotlin compiler try to be smart and re-interpret to java functional interfaces or lambdas that might be doing something funny leading to the compiler direct memory explosion?

/cc @rmuir @uschindler enjoy digging into the jvm and may have some thoughts

Yes, it uses job scheduler. I did add logs at one point to confirm no jobs were continuously running on their own, but if there was some method inside of a job that somehow caused it then I could have missed it.

I did enable tracelogs the other day while looking, here are two separate times it happened and the last 5000 logs before the node died.
TRACE logs 1
TRACE logs 2

Yep, IM is using Kotlin, would like to see if any other plugins are affected at all by this if they run their tests against a cluster and then do not kill the cluster.

We have run more tests and so far have only been able to repro on JDK 11, not JDK 17 or 14.

I tested once on JDK 11 for 1.3 and did not see it, and also have never seen this specific error previously in general.

@uschindler
Copy link
Contributor

Did you try latest 11.0.14.1 release or an older one? Also I hope the garbage collector is default one and not any of the new ones!

@dbbaughe
Copy link
Author

dbbaughe commented Apr 6, 2022

Did you try latest 11.0.14.1 release or an older one? Also I hope the garbage collector is default one and not any of the new ones!
Using OpenJDK
JDK Version : 11 (OpenJDK)
JAVA_HOME : /local/home/dbbaughe/jdk-11.0.2

Is 11.0.2 newer than 11.0.14.1 or is 11.0.14.1 for a different distribution?

I see JVM arg:
-XX:+UseConcMarkSweepGC

That being said, I am not passing in any new args or customizing any of it on my end, this is the same ./gradlew run command that has been used since 1.x and ODFE versions, so if something changed w/ those args it would mean something possibly changed w/ the buildSrc in core.

@uschindler
Copy link
Contributor

11.0.2 is stone aged. 😆

11.0.14.1 is newest one. The 1 at end is because of a bugfix due to a regression in 11.0.14.

@dbbaughe
Copy link
Author

dbbaughe commented Apr 6, 2022

Ah I think I had a brainfart moment and put a dot between the 14 and figured 2 was greater than 1.. lol

We can try testing with later version.

@downsrob
Copy link

downsrob commented Apr 6, 2022

I was running these tests on JDK 11.0.2 on three different ec2 boxes and I was able to reproduce the failure fairly consistently on all of them. When running ./gradlew integTestRemote -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9200 -Dtests.clustername=integTest --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.*" on repeat, I was typically able to reproduce this failure in around five minutes.

After upgrading to 11.0.14 I have not yet been able to reproduce the OOM failure while looping the same set of tests for over half an hour. I am going to keep trying for a bit longer and with the full test suite to see if I am able to reproduce the OOM failure in any way on 11.0.14

@dbbaughe
Copy link
Author

dbbaughe commented Apr 6, 2022

If it does turn out to be fixed w/ newer JDK 11 version.. wondering what our recourse here is.
I had just gone to https://jdk.java.net/11/ which pointed me to the OpenJDK Archive and the latest JDK 11 there is 11.0.2 which I figured was the newest one. I would assume others would possibly be doing the same... unless I just went to the complete wrong resource to try and grab a JDK 11 download.

Do we just release OS 2.0 potentially w/ a caveat that says only JDK 11.x.x.x and up is supported?
@dblock @nknize
Ideally we could still try to track down what the issue is.. but if it's something in the JDK fixes itself then it might not be worth the effort.

Edit: Seems like Nick mentioned there's a minimum jdk variable we can set so it's more enforced.

@dblock
Copy link
Member

dblock commented Apr 6, 2022

If it does turn out to be fixed w/ newer JDK 11 version.. wondering what our recourse here is. I had just gone to https://jdk.java.net/11/ which pointed me to the OpenJDK Archive and the latest JDK 11 there is 11.0.2 which I figured was the newest one. I would assume others would possibly be doing the same... unless I just went to the complete wrong resource to try and grab a JDK 11 download.

We should probably point people to https://adoptium.net/temurin/archive ... but we also know that OpenSearch works well with other distributions such as Coretto.

Do we just release OS 2.0 potentially w/ a caveat that says only JDK 11.x.x.x and up is supported? @dblock @nknize Ideally we could still try to track down what the issue is.. but if it's something in the JDK fixes itself then it might not be worth the effort.

I think it's not worth the effort. I recommend closing this issue.

If you think something is worth saying, add to https://opensearch.org/docs/latest/opensearch/install/compatibility/?

@uschindler
Copy link
Contributor

uschindler commented Apr 7, 2022

Hi,

We should probably point people to https://adoptium.net/temurin/archive ... but we also know that OpenSearch works well with other distributions such as Coretto.

Generally: yes. I would always tell users to go and use the latest Adoptium/Temurin JDK. Also those shipped with Ubuntu regular updates are good (Ubuntu 20.0.4 LTS has 11.0.14.1). Coretto is also fine! Actually there are no real differences between those variants. It is just the original OpenJDK source code built with custom build scripts and some tools added on top.

For Coretto there also seems to be an included "log4j patcher agent" (not 100% sure about if it is bundled, haven't checked), but this one does not work well with security manager as used by OpenSearch, so I would not recommend to enable it.

The problem with the OpenJDK home page: They explicitely state on https://jdk.java.net/11/ that it is end of life and you should not use it in production: "The JDK 11 Early Access Program has concluded. Please visit Java SE Downloads for production ready builds. Older releases, which do not include the most up to date security vulnerability fixes and are no longer recommended for use in production, remain available in the OpenJDK Archive."

If OpenSearch ships prebuilt binaries/images like Elasticsearch, it should for sure contain the latest OpenJDK version provided by Coretto or Adoptium/Temurin. In the README it should maybe give instructions about where to get a version. I would plain simple say: Use Java 11, but make sure at least 11.0.x (x =14), the one out on the date of relase and the one we tested with.

@uschindler
Copy link
Contributor

I think it's not worth the effort. I recommend closing this issue.

Yes. This is not a problem of OpenSearch. It is just buggy runtime. We may improve documentation, but you will always find people using some crazy JDK versions rotting on their machines. This is why I generally recommend to ask users to first update to latest version before submitting bug reports like this one. Maybe add this to the bug report template.

@uschindler
Copy link
Contributor

uschindler commented Apr 7, 2022

Edit: Seems like Nick mentioned there's a minimum jdk variable we can set so it's more enforced.

Yes this can be done. If JDK 8 is no longer the minimum requirement for opensearch, we can for sure use the new Java 9 feature to detect and compare version: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Runtime.html#version()

This allows to do the startup checks. Parsing system properties is risky, but using those classes makes it very easy to have denylists of versions or enforce minimum versions. In Apache Lucene we also change to this soon.

I am not sure if I am allowed to mention this here, the other fork named "Elasticsearch" already implemented support for java.lang.Runtime.Version class and replaced the risky manual parsing: elastic/elasticsearch#85361

@dblock
Copy link
Member

dblock commented Apr 7, 2022

I'll close this.

@uschindler Care to open a feature request describing startup checks? OpenSearch 2.0 is min JDK 11 so that can work.

@dblock dblock closed this as completed Apr 7, 2022
@uschindler
Copy link
Contributor

I'll close this.

@uschindler Care to open a feature request describing startup checks? OpenSearch 2.0 is min JDK 11 so that can work.

Will do when at home. The version detection stuff is already available (class JavaVersion), but it should be removed an replaced by Runtime.Version.

@uschindler
Copy link
Contributor

I opened: #2822

@reta
Copy link
Collaborator

reta commented Apr 18, 2022

Interesting, could it be related to https://bugs.openjdk.java.net/browse/JDK-8259541 ? @dbbaughe out of curiosity, did you enable JVM compiler logging to see what is going on? Thank you.

@dbbaughe
Copy link
Author

Interesting, could it be related to https://bugs.openjdk.java.net/browse/JDK-8259541 ? @dbbaughe out of curiosity, did you enable JVM compiler logging to see what is going on? Thank you.

Yeah I did, but I was only cross-checking for other similar issues that others had reported.. didn't have enough expertise on those compiler logs to be able to find something on my own without having to spend a significant amount of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Severity-Critical v2.0.0 Version 2.0.0
Projects
None yet
Development

No branches or pull requests

6 participants