-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012
Comments
Deep history view looks like: indicating that something happened between Dec 16 and 23 (either in test material or machine configuration to have this issue arise). |
Ok, the following unit tests seem to fail uniquely on the new machine (test-docker-fedora39-x64-1), but passed when run on other fedora machines in the past (example1, example2). sun/security/krb5/MicroTime The Grinder re-run Stewart launched tells us that all of those were infrequent failures, as they passed when rerun on the same machine. In short, I see no consistent failures in sanity.openjdk that have not occurred on existing Fedora machines in the past month, so there's nothing uniquely wrong with this new Fedora machine. |
@adamfarley - please continue to investigate this issue to understand what has changed between Dec 16 and 23 that introduces these failures.
|
Will do. |
Summary I think this could be a memory issue caused by a massive concurrency spike. Details Ok, I've taken a look at the failing test targets, and I'm seeing a 2-3x increase in the runtime on Fedora, along with the most consistent "failed" test targets on Fedora as well. Ubuntu and centos seem to pass the test targets at least some of the time, so I'm concentrating on Fedora for the first pass. Since 137 is the Linux code for processes being killed due to using up too much memory, this may also explain the socket error if, hypothetically, the VM we're trying to "get" has also been killed. Still, that's a hypothesis, not proof, so let's see what changed during the Dec 16 and 23 period that could affect memory usage and/or networking. The first thing I noticed was that the failures have a concurrency rating of 25 and the passes have a concurrency rating of 3. This would explain the timeouts "getting" a VM, and also the memory problems. Will kick off some tests with a reduced concurrency. If that solves the issue, let's tweak the code to make sure this doesn't happen again. EDIT: Ok, this isn't working. Will rerun with the fix proposed below. |
Ok, I'm confident that the "Potential candidate" is the cause for our headache here. The code there was originally designed to ensure that the concurrency could not exceed the memory limits of the machine in question. However, that code change made us use "megabytes" of memory instead of "gigabytes", so as long as the number of processors is less than the number of megabytes of memory, we use (cores/2)+1 concurrency. Since this machine has 6.5gb of memory and 48 cores, this results in us setting concurrency to 25 (aka (48/2)+1). So where CORE is now 25 and MEM is the number of megabytes of memory the system has:
So we go from guaranteeing each core 2gb of memory, to guaranteeing 0.26gb per core. I think this code change needs to be adjusted to avoid starving threads of memory. Will put together a fix and run it past sxa. Test run: https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/162/ |
Ok, there appears to be a bug preventing test pipeline jobs from running anything from personal forks, so here's a grinder: |
The grinder successfully reduces the number of concurrent threads in the test. Now to see if reducing the concurrency reduces the number of failures. Leaving to run. |
Another note is Deep history shows it mainly failed on the docker agents. Jan 30th, 2024 run on test-ibmcloud-ubuntu1604-x64-1 succeeds. The PR #5035 might also be a part of fix , which reduces the |
Seems likely. I /think/ the correct code should be:
So that original PR wasn't correct for all situations |
Ok, here's the PR: #5063 Update 2024/02/16 - Merged |
Ok, the concurrency level should now be within the bounds of sanity. Will kick some jobs off. |
Closed via #5063 |
Some test cases from jdk21 sanity.openjdk targets are failing on x86-64_linux on certain machines. The failures are in the jdk_lang, jdk_util and jdk_foreign test targets, details can be found in the Jan CPU JDK21 AQA triage, see #4983 (comment)
sanity.openjdk - 5 targets fail, jdk_lang_0, jdk_lang_1, jdk_util_0, jdk_util_1 and jdk_foreign_0 - rerun in Grinder/8539 fail, rerun on test-docker-ubuntu2204-x64-1 rerun in Grinder/8557 passes
Grinder_20240118175328_jdk21_x64Linux.tap.txt
Marking https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/151/ as a keep forever to show a good passing run on test-docker-fedora35-x64-1
Failing in https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/153/ with lots of testcases timing out on test-docker-fedora37-x64-3, with lots of
03:32:59 ACTION: testng -- Failed. Unexpected exit from test [exit code: 137]
andError. Agent communication error: java.net.SocketException: Broken pipe; check console log for any additional details
issuesFailing in https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/155/ on test-docker-fedora35-x64-1 with same types of issues as from the Test_openjdk21_hs_sanity.openjdk_x86-64_linux/153 above.
List of failing test cases from Test_openjdk21_hs_sanity.openjdk_x86-64_linux/155:
The text was updated successfully, but these errors were encountered: