Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK17 Extended Test Failures On test-osuosl-aix72-ppc64-5 due to filling /tmp #3129

Open
steelhead31 opened this issue Jul 9, 2023 · 6 comments

Comments

@steelhead31
Copy link
Contributor

The JDK17 extended test suites failed when running on the test-osuosl-aix72-ppc64-5 due to filling /tmp.

The error can be seen here ( as well as in Nagios ) https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_ppc64_aix_testList_2/

The test job appeared to create 2 x 2.1 GB tmp files in /tmp filling the entire file system, and causing tests to fail.

@aixtools
Copy link
Contributor

I am looking at all the systems - and as they are all recently cloned it appears there have been (undocumented?) changes to the system configurations.

When there are issues these should not be hacked at on the fly. There needs to be - at the minimum - reported in the issue what was done - and perhaps an update to the playbooks.

As an example: the size of 4G for /tmp was chosen because the test usedd to be smaller - and 4G was sufficient by nearly 2G. If the test is now doing 2x 2+G, obviously 4G is not going to work.

YET: when I look at the systems /tmp has not been increased, but /var has been increased on two systems.

I cannot second guess what needs to be done when changes are made on the fly.

So, no (known) action taken to resolve this issue. And it looks like it is just waiting to happen again - on different systems.

@aixtools
Copy link
Contributor

Seems to be affecting some, but not all systems: (note 100% used below).

root@osunim:[/root]dsh-adopt "/usr/bin/df -g /tmp"
adopt01:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%       47     1% /tmp
==============
adopt02:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      4.00    1%       44     1% /tmp
==============
adopt03:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      2.97   26%     1665     1% /tmp
==============
adopt04:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.40   16%      252     1% /tmp
==============
adopt05:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%      535     1% /tmp
==============
adopt06:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      0.00  100%      406     7% /tmp
==============
adopt07:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      0.00  100%      469     9% /tmp
==============
adopt08:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%      503     1% /tmp
==============
adopt10:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           4.00      3.99    1%      116     1% /tmp
==============
  • When I look at the affected systems - I wonder if the issue is caused, in part, by a lack of cleanup.
root@osunim:[/root]ssh adopt07 /usr/bin/du -sg /tmp/*.dat
2.00    /tmp/dst2848659901056789357.dat
1.99    /tmp/dst665158920248314980.dat
0.00    /tmp/src2464967498881328204.dat
root@osunim:[/root]ssh adopt06 /usr/bin/du -sg /tmp/*.dat
1.99    /tmp/dst16616608096706097137.dat
2.00    /tmp/dst6039054501113654116.dat
0.00    /tmp/src805118232336050098.dat
  • I cannot conclude anything based on the file timestamp (last modified) - but I do see that the src.dat file is logically large, but physically small (ie, sparse).
root@osunim:[/root]ssh adopt06 /usr/bin/ls -ls /tmp/*.dat
2087736 -rw-r--r--    1 jenkins  staff    2137837568 Aug 13 16:51 /tmp/dst16616608096706097137.dat
2097156 -rw-------    1 jenkins  staff    2147484671 Aug 13 16:51 /tmp/dst6039054501113654116.dat
   8 -rw-------    1 jenkins  staff    2147484671 Aug 13 16:51 /tmp/src805118232336050098.dat
root@osunim:[/root]ssh adopt07 /usr/bin/ls -ls /tmp/*.dat
2097160 -rw-------    1 jenkins  staff    2147484671 Aug 13 16:08 /tmp/dst2848659901056789357.dat
2089732 -rw-r--r--    1 jenkins  staff    2139873280 Aug 13 16:10 /tmp/dst665158920248314980.dat
   8 -rw-------    1 jenkins  staff    2147484671 Aug 13 16:08 /tmp/src2464967498881328204.dat
  • I'll increase /tmp to 5G to give breathing space, and cleanup junk
  • If this resolves the issue - then the playbook can be adjusted to give /tmp 5G rather than 4G.

@aixtools
Copy link
Contributor

  • Modified situation:
root@osunim:[/root]dsh-adopt "/usr/bin/df -g /tmp"
adopt01:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.99    1%       47     1% /tmp
==============
adopt02:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.99    1%       39     1% /tmp
==============
adopt03:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      3.97   21%     1388     1% /tmp
==============
adopt04:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.40   13%      235     1% /tmp
==============
adopt05:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.99    1%      242     1% /tmp
==============
adopt06:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.99    1%      304     1% /tmp
==============
adopt07:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.99    1%      433     1% /tmp
==============
adopt08:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.99    1%      281     1% /tmp
==============
adopt10:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      4.99    1%      111     1% /tmp
==============

@aixtools
Copy link
Contributor

Looks like there may still be an artifact:

adopt07:
Filesystem    GB blocks      Free %Used    Iused %Iused Mounted on
/dev/hd3           5.00      0.99   81%      632     1% /tmp

image

  • Looking like it is not cleaning up properly?
root@adopt07:[/root]ls -ltr /tmp/*.dat
-rw------- 1 jenkins staff 2147484671 Aug 20 17:01 /tmp/src4056949266194807527.dat
-rw------- 1 jenkins staff 2147484671 Aug 20 17:02 /tmp/dst9520860984591320509.dat
-rw-r--r-- 1 jenkins staff 2147484671 Aug 20 17:03 /tmp/dst6282469117645280818.dat
root@adopt07:[/root]date
Wed Aug 23 12:06:51 UTC 2023
  • removed the files, and clearing jenkins.
  • Is the last console output still available?

@aixtools
Copy link
Contributor

Just wondering if this is a problem with the test.

      4 -rw-r--r-- 1 jenkins staff          40 Aug 27 16:22 blah4255219114647392657.tmp
      4 -rw-r--r-- 1 jenkins staff         151 Aug 26 17:33 unsigned.jar1450541346237646654jar
      4 -rw-r--r-- 1 jenkins staff         305 Aug 26 15:27 test1723908656910621468.test
      4 -rw-r--r-- 1 jenkins staff         383 Aug 27 17:39 test10517561616964218431.test
      4 -rw-r--r-- 1 jenkins staff         403 Aug 27 17:39 test15750855502312122436.test
      4 -rw-r--r-- 1 jenkins staff         403 Aug 27 17:39 test16807323090330678638.test
      4 -rw-r--r-- 1 jenkins staff        1862 Aug 26 17:33 signed.jar8571074910892324627jar
      4 -rw-r--r-- 1 jenkins staff        1974 Aug 26 17:33 signed2.jar1180279166009667648jar
      4 -rw-r--r-- 1 jenkins staff       32007 Aug 27 16:22 source245824410068849651.tmp
      4 -rw-r--r-- 1 jenkins staff  6442450960 Aug 27 16:23 source1323321727058409565.tmp
      4 -rw-r--r-- 1 root    system          6 Jun 20 10:57 rc.net.out
      4 -rw-r--r-- 1 root    system         24 Jun 20 11:08 NIM_instp_updt_list
      4 -rw-r--r-- 1 root    system         77 Jun 20 10:57 KrsctPHA.saved
      4 -rw-r--r-- 1 root    system       2124 Jun 20 10:57 ctrmc_MDdr.dbg
      4 -rw-rw-r-- 1 root    system         53 Jun 20 10:54 uncfgct.dbg
      4 -rw-rw-r-- 1 root    system        676 Jun 20 10:55 rsct_cfgct_history.log
      8 -rw------- 1 jenkins staff  2147484671 Aug 27 16:22 src6628553441366702378.dat
    136 -rw-r--r-- 1 jenkins staff      138481 Aug 27 13:50 hs_err_pid19726826.log
    200 -rw-rw-r-- 1 root    system     204800 Aug 29 15:00 lvmt.log
   1024 -rw-r--r-- 1 jenkins staff     1048576 Aug 27 16:22 blah17126132827369752914.tmp
2097156 -rw------- 1 jenkins staff  2147484671 Aug 27 16:22 dst1075097809748986483.dat
2097160 -rw-r--r-- 1 jenkins staff  2147484671 Aug 27 16:23 dst2207767882312241704.dat

@sxa
Copy link
Member

sxa commented Nov 5, 2024

Needs to be examined further to determine in a clear environment, and ideally to narrow down which tests in the external suites are causing the problem.

OpenJDK have discussed test cases not always cleaning up after themselves.

@sxa sxa added this to the 2024-11 (November) milestone Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

3 participants