-
-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests running on test-docker* machines get terminated mid-run #2888
Comments
@smlambert I am a bit suspicious it is the begin/end process clean logic, as i've run some process queries on docker container node test-docker-ubuntu1804-armv8l-4, and it shows 2 Jenkins Agents visible, which would seem to imply this docker container can see processes in the other containers on the same host dockerhost-equinix-ubuntu2004-armv8-1 ? Agents from CONTAINER test-docker-ubuntu1804-armv8l-4: Agents from HOST dockerhost-equinix-ubuntu2004-armv8-1: |
What looks odd above is it looks like some containers have 2 jenkins Agents, these 2 containers: Whereas these containers only have 1 jenkins Agent: If jenkins schedules 2 tasks within the same container, they could end up terminating each others processes ? |
Using this command from Scripting Console: println "ps -o cgroup,pid,state,tname,time,command -u jenkins".execute().text |
https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_special.functional_aarch64_linux/46/console
|
https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_extended.system_aarch64_linux/163/console
|
https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_sanity.system_aarch64_linux/163/
|
https://ci.adoptopenjdk.net/job/Test_openjdk11_hs_extended.openjdk_aarch64_linux/110/
|
https://ci.adoptopenjdk.net/job/Test_openjdk8_hs_extended.system_aarch64_linux/773/console
|
@smlambert I am fairly sure this is the process cleanup, kill visible jenkins process from other containers, there must be something special about the container environment here that I need to investigate. |
thanks @andrew-m-leonard yes please. I will keep adding examples to this issue as I find them, in case it helps us for a revised solution. |
https://ci.adoptopenjdk.net/computer/test%2Ddocker%2Dubuntu1804%2Darmv8l%2D2/ https://ci.adoptopenjdk.net/job/Test_openjdk19_hs_sanity.system_aarch64_linux/163/consoleFull which was run on https://ci.adoptopenjdk.net/computer/test%2Ddocker%2Dfedora35%2Darmv8l%2D1 does not have the same problem, so that issue is not the same. Other than this one has this occurred again other than on the two "duplicate" agent definitions which I've resolved? |
I've just looked on test-docker-fedora35-armv8l-1, and it looks as though it has multiple jenkins Agents as well ?
I can't check ps on test-docker-ubi8-armv8-1 The aqa-tests teminateProcess.sh logic assumes that if:
So my suspicion is docker containers that are serving 2 jenkins agents will kill off each others processes.... |
@sxa There are two jenkins agents launched in container port 2235 on host 147.75.35.203, because there are two Node definitions targeting that container: |
@sxa The premise for the process cleaning on a "host" is the assumption that the jenkins owned Test processes should be terminated if found running, but I am thinking that assumption is incorrect, since a host could have multiple "Executors"(Agents) hence all running independent Test job processes under the jenkins user, and with this assumption would potentially incorrectly terminate each others processes? The docker containers with multiple Agents, illustrate the same problem, although I suspect that is not intentional. @steelhead31 The above Node definitions using the same containers doesn't seem right? I am suspecting test-docker-ubi8-armv8-1 on 139.178.86.243 port 2247, has a duplicate, although I have not found one!, it's not easy to search all node configurations by host and port. |
@andrew-m-leonard I'll have a look at these duplicates, I suspect something has gone awry.. .I also can probably find the duplicates via the jenkins api :) |
Im currently performing an audit of all the docker nodes in jenkins, once I have this, we can remove any defunct ones, identify any duplicates and sort those out too... once this is done we can retry some tests, and determine any further actions. |
Ive produced an audit of the docker related hosts and machines... https://drive.google.com/file/d/1hNtQ_BOrAfV4FWj961dgT8hH9zw4EFcn/view?usp=sharing |
have 6 machines / 3 duplicates, looks to be caused by labelleing.. I'll remove the duplicates from jenkins. test-docker-ubuntu2004-aarch64-1 |
Now removed ( as these 3 are duplicates ) test-docker-ubuntu2204-armv8-1 test-docker-fedora36-aarch64-1 |
Now removed ( as these 3 are duplicates ) test-docker-ubuntu2204-armv8-1 test-docker-fedora36-aarch64-1 |
Just to be clear on this, no systems labelled for test should have more than one executor. If they do, it's defintely a bug that needs to be resolved, so thanks Scott for dealing with these :-) |
I've resolved the docker agent/multiple executors, and successfully run several test suites on these problematic machines without issue. I'll close this issue for now, @smlambert if you find any more occurences of this after today, please let me know. |
Please set the title to indicate the test name and machine name where known.
To make it easy for the infrastructure team to repeat and diagnose, please
answer the following questions:
Test_
job on https://ci.adoptopenjdk.net which showed the failure https://ci.adoptopenjdk.net/job/Test_openjdk17_hs_extended.system_aarch64_linux/257/Any other details:
There were several cases seen during release triage, I will add more examples to this issue shortly.
The text was updated successfully, but these errors were encountered: