JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012

smlambert · 2024-01-29T17:51:26Z

Some test cases from jdk21 sanity.openjdk targets are failing on x86-64_linux on certain machines. The failures are in the jdk_lang, jdk_util and jdk_foreign test targets, details can be found in the Jan CPU JDK21 AQA triage, see #4983 (comment)

sanity.openjdk - 5 targets fail, jdk_lang_0, jdk_lang_1, jdk_util_0, jdk_util_1 and jdk_foreign_0 - rerun in Grinder/8539 fail, rerun on test-docker-ubuntu2204-x64-1 rerun in Grinder/8557 passes
Grinder_20240118175328_jdk21_x64Linux.tap.txt

Marking https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/151/ as a keep forever to show a good passing run on test-docker-fedora35-x64-1

Failing in https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/153/ with lots of testcases timing out on test-docker-fedora37-x64-3, with lots of 03:32:59 ACTION: testng -- Failed. Unexpected exit from test [exit code: 137] and Error. Agent communication error: java.net.SocketException: Broken pipe; check console log for any additional details issues

Failing in https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/155/ on test-docker-fedora35-x64-1 with same types of issues as from the Test_openjdk21_hs_sanity.openjdk_x86-64_linux/153 above.

List of failing test cases from Test_openjdk21_hs_sanity.openjdk_x86-64_linux/155:

		java/lang/Package/bootclasspath/GetPackageFromBootClassPath.java.GetPackageFromBootClassPath
		
		java/lang/StackWalker/LocalsAndOperands.java#id1.LocalsAndOperands_id1
		
		java/lang/StrictMath/SqrtTests.java.SqrtTests
		
		java/lang/String/concat/ImplicitStringConcatManyLongs.java.ImplicitStringConcatManyLongs
		
		java/lang/Thread/virtual/stress/Skynet.java#default.Skynet_default
		
		jdk/internal/vm/Continuation/BasicExt.java#COMP_WINDOW_LENGTH_2.BasicExt_COMP_WINDOW_LENGTH_2
		
		java/util/Arrays/TimSortStackSize2.java.TimSortStackSize2
		
		java/util/LinkedHashMap/Basic.java.Basic
		
		java/util/LinkedHashSet/Basic.java.Basic
		
		java/util/List/NestedSubList.java.NestedSubList
		
		java/util/Locale/Bug4152725.java.Bug4152725
		
		java/util/Locale/Bug6989440.java.Bug6989440
		
		java/util/Locale/bcp47u/CurrencyTests.java.CurrencyTests
		
		java/util/StringJoiner/StringJoinerTest.java.StringJoinerTest
		
		java/util/concurrent/tck/JSR166TestCase.java#others.JSR166TestCase_others
		
		java/util/jar/JarFile/jarVerification/MultiProviderTest.java.MultiProviderTest
		
		java/util/jar/JarFile/mrjar/MultiReleaseJarHttpProperties.java.MultiReleaseJarHttpProperties
		
		java/util/jar/JarFile/mrjar/MultiReleaseJarProperties.java.MultiReleaseJarProperties
		
		java/util/jar/Manifest/ValueUtf8Coding.java.ValueUtf8Coding
		
		java/lang/StackWalker/LocalsAndOperands.java#id0.LocalsAndOperands_id0
		
		java/lang/String/concat/ImplicitStringConcatShapes.java.ImplicitStringConcatShapes
		
		java/util/StringJoiner/MergeTest.java.MergeTest
		
		java/util/concurrent/forkjoin/AsyncShutdownNowInvokeAnyRace.java.AsyncShutdownNowInvokeAnyRace
		
		java/util/concurrent/forkjoin/Integrate.java.Integrate
		
		java/util/stream/test/org/openjdk/tests/java/util/stream/mapMultiOpTest.java.mapMultiOpTest
		
		java/util/zip/ZipFile/TestZipFileEncodings.java.TestZipFileEncodings
		
		java/foreign/TestLargeSegmentCopy.java.TestLargeSegmentCopy
		
		java/lang/String/concat/ImplicitStringConcatOOME.java.ImplicitStringConcatOOME
		
		java/util/BitSet/stream/BitSetStreamTest.java.BitSetStreamTest
		
		java/util/HashMap/WhiteBoxResizeTest.java.WhiteBoxResizeTest
		
		java/util/HexFormat/HexFormatTest.java.HexFormatTest

The text was updated successfully, but these errors were encountered:

smlambert · 2024-01-29T19:13:42Z

Deep history view looks like:

indicating that something happened between Dec 16 and 23 (either in test material or machine configuration to have this issue arise).

adamfarley · 2024-01-30T13:01:35Z

Ok, the following unit tests seem to fail uniquely on the new machine (test-docker-fedora39-x64-1), but passed when run on other fedora machines in the past (example1, example2).

sun/security/krb5/MicroTime
java/util/Locale/bug4122700
java/util/Map/InPlaceOpsCollisions
java/util/ResourceBundle/Bug6355009
java/util/Scanner/ScanTest

The Grinder re-run Stewart launched tells us that all of those were infrequent failures, as they passed when rerun on the same machine.

In short, I see no consistent failures in sanity.openjdk that have not occurred on existing Fedora machines in the past month, so there's nothing uniquely wrong with this new Fedora machine.

smlambert · 2024-01-30T13:30:02Z

@adamfarley - please continue to investigate this issue to understand what has changed between Dec 16 and 23 that introduces these failures.

are the failing tests new or changed?
has there been a change to the machine configuration or the underlying docker host machine?
and so forth

adamfarley · 2024-01-30T13:32:12Z

Will do.

adamfarley · 2024-02-01T00:38:43Z

Summary

I think this could be a memory issue caused by a massive concurrency spike.

Details

Ok, I've taken a look at the failing test targets, and I'm seeing a 2-3x increase in the runtime on Fedora, along with the most consistent "failed" test targets on Fedora as well. Ubuntu and centos seem to pass the test targets at least some of the time, so I'm concentrating on Fedora for the first pass.

Since 137 is the Linux code for processes being killed due to using up too much memory, this may also explain the socket error if, hypothetically, the VM we're trying to "get" has also been killed.

Still, that's a hypothesis, not proof, so let's see what changed during the Dec 16 and 23 period that could affect memory usage and/or networking.

The first thing I noticed was that the failures have a concurrency rating of 25 and the passes have a concurrency rating of 3. This would explain the timeouts "getting" a VM, and also the memory problems.

Will kick off some tests with a reduced concurrency. If that solves the issue, let's tweak the code to make sure this doesn't happen again.

EDIT: Ok, this isn't working. Will rerun with the fix proposed below.

adamfarley · 2024-02-01T00:47:25Z

Potential candidate for concurrency spike trigger.

adamfarley · 2024-02-05T12:53:01Z

Ok, I'm confident that the "Potential candidate" is the cause for our headache here.

The code there was originally designed to ensure that the concurrency could not exceed the memory limits of the machine in question.

However, that code change made us use "megabytes" of memory instead of "gigabytes", so as long as the number of processors is less than the number of megabytes of memory, we use (cores/2)+1 concurrency. Since this machine has 6.5gb of memory and 48 cores, this results in us setting concurrency to 25 (aka (48/2)+1).

So where CORE is now 25 and MEM is the number of megabytes of memory the system has:

ifeq ($(shell expr $(CORE) \> $(MEM)), 1)
	CONC := $(MEM)

So we go from guaranteeing each core 2gb of memory, to guaranteeing 0.26gb per core.

I think this code change needs to be adjusted to avoid starving threads of memory.

Will put together a fix and run it past sxa.

Test run: https://ci.adoptium.net/job/Test_openjdk21_hs_sanity.openjdk_x86-64_linux/162/

adamfarley · 2024-02-05T14:03:36Z

Ok, there appears to be a bug preventing test pipeline jobs from running anything from personal forks, so here's a grinder:

https://ci.adoptium.net/job/Grinder/8715/

adamfarley · 2024-02-05T14:15:10Z

The grinder successfully reduces the number of concurrent threads in the test. Now to see if reducing the concurrency reduces the number of failures. Leaving to run.

sophia-guo · 2024-02-05T17:07:27Z

Another note is Deep history shows it mainly failed on the docker agents.

Jan 30th, 2024 run on test-ibmcloud-ubuntu1604-x64-1 succeeds. The PR #5035 might also be a part of fix , which reduces the NPROCS and hence if $(CORE) < $(MEM) we won't hit the issue. Only if $(CORE) > $(MEM) the MEM calculation might be an issue.

sxa · 2024-02-12T15:56:02Z

Potential candidate for concurrency spike trigger.

Seems likely. I /think/ the correct code should be:

else CGMEM=`expr $${KMEMMB} \* 1048576`; fi; CGMEMMB=`expr $${CGMEM} / 1048576`;`

So that original PR wasn't correct for all situations

adamfarley · 2024-02-14T12:49:17Z

Ok, here's the PR: #5063

Update 2024/02/16 - Merged

adamfarley · 2024-02-16T16:35:00Z

Ok, the concurrency level should now be within the bounds of sanity. Will kick some jobs off.

smlambert · 2024-03-02T15:47:34Z

Closed via #5063

This was referenced Jan 29, 2024

Jan 2024 CPU - Temurin JDK21 release triage #4983

Closed

General Retrospective for January 2024 Releases adoptium/temurin#13

Closed

sxa mentioned this issue Jan 30, 2024

New Machine requirement: Linux/x64 equinix dockerhost replacement adoptium/infrastructure#3352

Closed

smlambert assigned adamfarley Jan 30, 2024

adamfarley mentioned this issue Feb 14, 2024

Refine concurrency logic #5063

Merged

smlambert closed this as completed Mar 2, 2024

sophia-guo mentioned this issue Apr 10, 2024

April 2024 jdk21 dryrun triage #5210

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012

JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012

smlambert commented Jan 29, 2024 •

edited

Loading

smlambert commented Jan 29, 2024

adamfarley commented Jan 30, 2024 •

edited

Loading

smlambert commented Jan 30, 2024

adamfarley commented Jan 30, 2024

adamfarley commented Feb 1, 2024 •

edited

Loading

adamfarley commented Feb 1, 2024

adamfarley commented Feb 5, 2024 •

edited

Loading

adamfarley commented Feb 5, 2024 •

edited

Loading

adamfarley commented Feb 5, 2024

sophia-guo commented Feb 5, 2024

sxa commented Feb 12, 2024

adamfarley commented Feb 14, 2024 •

edited

Loading

adamfarley commented Feb 16, 2024

smlambert commented Mar 2, 2024

JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012

JDK21 sanity.openjdk failures on x86-64_linux on some machines #5012

Comments

smlambert commented Jan 29, 2024 • edited Loading

smlambert commented Jan 29, 2024

adamfarley commented Jan 30, 2024 • edited Loading

smlambert commented Jan 30, 2024

adamfarley commented Jan 30, 2024

adamfarley commented Feb 1, 2024 • edited Loading

adamfarley commented Feb 1, 2024

adamfarley commented Feb 5, 2024 • edited Loading

adamfarley commented Feb 5, 2024 • edited Loading

adamfarley commented Feb 5, 2024

sophia-guo commented Feb 5, 2024

sxa commented Feb 12, 2024

adamfarley commented Feb 14, 2024 • edited Loading

adamfarley commented Feb 16, 2024

smlambert commented Mar 2, 2024

smlambert commented Jan 29, 2024 •

edited

Loading

adamfarley commented Jan 30, 2024 •

edited

Loading

adamfarley commented Feb 1, 2024 •

edited

Loading

adamfarley commented Feb 5, 2024 •

edited

Loading

adamfarley commented Feb 5, 2024 •

edited

Loading

adamfarley commented Feb 14, 2024 •

edited

Loading