Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jdk_container left files owned by root #5358

Open
llxia opened this issue May 30, 2024 · 13 comments
Open

jdk_container left files owned by root #5358

llxia opened this issue May 30, 2024 · 13 comments

Comments

@llxia
Copy link
Contributor

llxia commented May 30, 2024

jdk_container left files on the host machine that are owned by root. These files cannot be cleaned by Jenkins job. It causes Jenkins job to fail.

12:05:52  ERROR: Cannot delete workspace :Unable to delete '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_controller'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
[Pipeline] echo
12:05:52  Exception: hudson.AbortException: Cannot delete workspace: Unable to delete '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_controller'. Tried 3 times (of a maximum of 3) waiting 0.1 sec between attempts.
[Pipeline] sh
12:05:53  + rm -rf /home/jenkins/workspace/Grinder/aqa-tests/TKG
12:05:53  rm: cannot remove '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_controller': Operation not permitted
12:05:53  rm: cannot remove '/home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17113857837219/jdk_container_0/work/scratch/2/jdk-sharedtmp/.com_ibm_tools_attach/_notifier': Operation not permitted

@sophia-guo @smlambert do you also see a similar issue at Adoptium Jenkins? Is there a better way to resolve this?

@llxia llxia changed the title jdk_container left files under root jdk_container left files owned by root May 30, 2024
@sophia-guo
Copy link
Contributor

These tests were added around one and half years ago. As it's dev level may not run frequently. I didn't notice there is this issue. Check recent jdk21 seems no this issue.

https://ci.adoptium.net/view/Test_openjdk/job/Test_openjdk21_hs_dev.openjdk_x86-64_linux/

@llxia
Copy link
Contributor Author

llxia commented Jun 7, 2024

We should mark the node offline automatically when there is an error Cannot delete workspace: Unable to delete ...

llxia added a commit to llxia/aqa-tests that referenced this issue Jun 7, 2024
related: adoptium#5358
Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
llxia added a commit to llxia/aqa-tests that referenced this issue Jun 7, 2024
related: adoptium#5358
Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
llxia added a commit to llxia/aqa-tests that referenced this issue Jun 10, 2024
related: adoptium#5358
Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
karianna pushed a commit that referenced this issue Jun 12, 2024
related: #5358

Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
@AswathySK
Copy link

Is there any other way to clean up the crash files in the test code itself instead of marking it offline @llxia

@smlambert
Copy link
Contributor

Is there any other way to clean up the crash files in the test code itself instead of marking it offline @llxia

llxia is on vacation

Related:
https://stackoverflow.com/questions/42423999/cant-delete-file-created-via-docker

@sophia-guo
Copy link
Contributor

I think we also need to know why this happens. Does it only happen when impl=openj9|ibm as no issue reported with jdk_container running against impl=hotspot.

Normally this permission issue happens if you run things as root inside the container while using a mapped volume from the host inside the container. The jdk_container tests map volumes options are like --volume /home/jenkins/workspace/jenkinsjobname/aqa-tests/TKG/output_***/jdk_container_0/work/classes/2/...., which is not the workdir. So shouldn't have this issue. Is there something specific to openj9|ibm caused this?

@AswathySK
Copy link

Is there any updates on this issue?
Is @llxia back from vacation?

@smlambert
Copy link
Contributor

I think we also need to know why this happens. Does it only happen when impl=openj9|ibm as no issue reported with jdk_container running against impl=hotspot.

If I had to guess, it happens when a testcase fails and doesn't cleanup after itself, then the workspace can not be deleted. So @AswathySK perhaps check if that is the case and exclude the failing testcases.

Lan is not back from vacation and no one is pursuing this issue further at this time. I suggest you dig in to answer some of the questions in this issue if you are interested in a different approach than taking the machine offline.

@sophia-guo
Copy link
Contributor

Just a note that PR of making the node offine has also been reverted, which might help @AswathySK your investigation?

@AswathySK
Copy link

@smlambert , when a test case fails it is not able to clean up after since the files created when it crashes are owned by root user. And yes I will do some more investigation on which all test cases we are seeing this issue.

@smlambert
Copy link
Contributor

So my point is, the reason we do not have a cleanup problem for Temurin is that there is not a failing/crashing testcase.

So your first task would be to see which testcase is crashing/failing, triage it by gathering any extra data you can, report the issue in the openj9 repo if it doesn't already exist, and exclude the test in the ProblemList files while the issue is being investigated and fixed by the openj9 team.

llxia added a commit to llxia/aqa-tests that referenced this issue Sep 19, 2024
related: adoptium#5358 and infrastructure/issues/9874
Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
llxia added a commit to llxia/aqa-tests that referenced this issue Sep 19, 2024
related: adoptium#5358 and infrastructure/issues/9874
Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
llxia added a commit to llxia/aqa-tests that referenced this issue Sep 19, 2024
related: adoptium#5358 and infrastructure/issues/9874
Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
JasonFengJ9 pushed a commit that referenced this issue Sep 19, 2024
related: #5358 and infrastructure/issues/9874

Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
@AdamBrousseau
Copy link
Contributor

Does the cleanWs() happen inside the container or after the container is exited? Still a bit odd that it is only a few files that are owned by root. 🤔

@llxia
Copy link
Contributor Author

llxia commented Sep 20, 2024

cleanWs() (at groovy level) happens after the jdk_container is exited. jdk_container is a openjdk tests which does not belong to us.

@AdamBrousseau
Copy link
Contributor

Has anyone tried removing the files as the jenkins user via any random container as the root user? As much as that could be a hack workaround, it is probably easy to add a step to our cleanup scripts to remove everything inside the workspace dir via a container. cc @AswathySK

llxia added a commit to llxia/aqa-tests that referenced this issue Sep 23, 2024
related: adoptium#5358 and infrastructure/issues/9874

Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
LongyuZhang pushed a commit that referenced this issue Sep 23, 2024
related: #5358 and infrastructure/issues/9874

Signed-off-by: Lan Xia <Lan_Xia@ca.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

5 participants