-
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify which tests seem unstable in docker containers #2138
Comments
NOTE - runs on the Fedora docker image testing after patching and rebooting the server: |
Also trying on a couple of X64 docker images (Fedora 33 and Ubuntu 20.04) |
NUMA interrogation is failing in Docker [EDIT: Issue shows up with just |
core dump generation is also failing (I've tried starting the container with various options that might help but to no avail ... so far) ... potentially same as described in adoptium/run-aqa#59 [EDIT: The (host) systems on which core files were not being produced had |
Also not specific to docker, but we have seen instances if this when
This will be progressed via adoptium/run-aqa#59 |
Grinder on testc-packet-fedora33-amd-2 and got
https://ci.adoptopenjdk.net/view/work-in-progress/job/grinder_sandbox/203/console Suppose testc-packet-fedora33-amd-2 is one docker container? |
Yes - it's a docker container. Hmmm that's a bit odd ... It's also nothing to do with the test if it's failing that early in the process. I've re-run it as 205 and it completed without any fatal failures so hopefully that won't occur, but if you see any further instances let me know so we can see if it happens regularly. |
From https://adoptopenjdk.slack.com/archives/C5219G28G/p1612761729068300, we should check whether the timeouthandler added to openj9 openjdk test runs is able to write a System dump in dockerized environment. |
I wonder if eclipse-openj9/openj9#12038 is another example of failure in docker environments or not. |
Hmmm interesting thought. Certainly possibly but this is the first I've heard of it. Some of those containers we have are called in terms of CPU and RAM which could explain why you wouldn't necessarily be able to replicate locally without doing the same. |
sanity.openjdk on JDK 8 (Hotspot) seems to randomly fail for these tests:
Especially LFSingleThreadCachingTest.java looks like an OOM kill. Would be nice to overlay that failure with the kernel OOM kill logs. |
Above error was on test-docker-fedora33-x64-2 hosted on test-packet-ubuntu2004-amd-1. Those systems were all started with 4 cores and 6GB allocated to them. Re-testing at @smlambert In the log Severin referenced above it gives the Grinder re-run link for the individual test as https://ci.adoptopenjdk.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=oracle&BUILD_LIST=openjdk&PLATFORM=x86-64_linux_xl&TARGET=jdk_lang_1 which is clearly wrong as it doesn't reference upstream and the EDIT: https://ci.adoptopenjdk.net/job/Grinder/7353/console passed on a real machine (IBMCLOUD RHEL8) but https://ci.adoptopenjdk.net/job/Grinder/7350/console gfailed on the machine mentioned above (Both |
Potential resource starvation reported by @lumpfish on build-docker-fedora33-armv8-3 in adoptium/infrastructure#2002 - I see a "docker day" in my near future ... (Will diagnose using
|
At the moment at least some docker images hosted on build-packet-ubuntu1804-armv8-1 (U1804b_2223 in particular) this job currently running and docker-packet-ubuntu2004-amd-1 (U2004_2224 (this job currently running) in particular) are using a lot of CPU so potentially need to be properly capped. The failures being seen above may well only be occurring on those systems. When the systems are quiesced tomorrow (since we're running the weekend piplines for JDK16 again due to adoptium/ci-jenkins-pipelines#87) I can look at adjusting the capping of the tests Related to @kumpfish's |
OK I've brought the following offline for now while investigations occur as some of these have shown problems with
|
This looks to be the same issue that's covered in #2310 and not specific to docker |
With the merging of #2345 i've brought most systems back online - I've left [EDIT: Load on the machine during the nightly testing is sitting at under 16 and there are 64 cores so I have re-enabled these three remaining executors] |
Another one adoptium/adoptium#63 (comment) |
@sophia-guo That looks like the tests have a dependency on the |
Example run in Grinder: https://ci.adoptopenjdk.net/job/Grinder/1203 |
@sxa if I login in test machine I can run |
on arm jdk11: passed on non-docker and failed on docker ones consistently. https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_arm_linux_testList_2/9/ |
java/beans/PropertyEditor/TestFontClassJava.java.TestFontClassJava error message:
https://ci.adoptopenjdk.net/job/Test_openjdk18_hs_extended.openjdk_x86-64_linux_testList_2/26/ |
This is partially for my own notes, but need to be looked at, and may also be covered elsewhere. Looks like the DDR stuff (not too surprising) will need some work
testDDR*
cmdLineTester
andjit_hw_2
failcmdLineTester
tests-2
and not-1
- unrelated to docker?Other's (on initial look - not too deep!) seem ok
Memo to self - how to check for RAM/CPU limits in a container:
wc -l /sys/fs/cgroup/cpu,cpuacct/cgroup.procs
(Not accurate)cat /sys/fs/cgroup/memory/memory.limit_in_bytes / 1024 / 1024 / 1024
(Or divide by1073741824
)while true; do clear && uptime && docker stats --no-stream; sleep 60; done
The text was updated successfully, but these errors were encountered: