Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build-marist-rhel77-s390x-[234] crashing daily #1154

Closed
sxa opened this issue Feb 20, 2020 · 9 comments
Closed

build-marist-rhel77-s390x-[234] crashing daily #1154

sxa opened this issue Feb 20, 2020 · 9 comments
Assignees
Milestone

Comments

@sxa
Copy link
Member

sxa commented Feb 20, 2020

The first build machine seems ok, but the second one (148.100.86.218) is repeatedly falling over:

linux1   pts/0        195.212.29.74    Thu Feb 20 06:39   still logged in   
reboot   system boot  3.10.0-1062.12.1 Wed Feb 19 15:50 - 06:41  (14:50)    
reboot   system boot  3.10.0-1062.12.1 Tue Feb 18 15:49 - 06:41 (1+14:51)   
root     pts/0        195.212.29.94    Tue Feb 18 13:02 - 13:21  (00:19)    
root     pts/1        195.212.29.94    Tue Feb 18 10:25 - 12:40  (02:15)    
linux1   pts/0        195.212.29.94    Tue Feb 18 09:55 - 12:14  (02:18)    
reboot   system boot  3.10.0-1062.12.1 Mon Feb 17 06:09 - 06:41 (3+00:31)   
reboot   system boot  3.10.0-1062.12.1 Sat Feb 15 19:11 - 06:41 (4+11:29)   
reboot   system boot  3.10.0-1062.12.1 Sat Feb 15 14:14 - 06:41 (4+16:26)   
reboot   system boot  3.10.0-1062.12.1 Thu Feb 13 14:34 - 06:41 (6+16:06)   
reboot   system boot  3.10.0-1062.12.1 Wed Feb 12 14:14 - 06:41 (7+16:26)   
root     pts/0        195.212.29.66    Wed Feb 12 10:21 - 12:34  (02:12)    
reboot   system boot  3.10.0-1062.12.1 Mon Feb 10 19:12 - 06:41 (9+11:28)   
linux1   pts/0        195.212.29.86    Mon Feb 10 10:01 - 10:07  (00:06)    
linux1   pts/0        195.212.29.86    Mon Feb 10 07:30 - 07:30  (00:00)    
linux1   pts/0        195.212.29.86    Fri Feb  7 11:09 - 12:06  (00:56)    
linux1   pts/0        195.212.29.86    Fri Feb  7 10:44 - 10:46  (00:01)    
linux1   pts/0        195.212.29.86    Thu Feb  6 14:16 - 14:16  (00:00)    
reboot   system boot  3.10.0-1062.12.1 Thu Feb  6 07:51 - 06:41 (13+22:49)  
reboot   system boot  3.10.0-1062.9.1. Tue Feb  4 14:13 - 06:41 (15+16:27)  
linux1   pts/2        195.212.29.82    Tue Feb  4 10:18 - 10:19  (00:00)    
linux1   pts/1        195.212.29.82    Tue Feb  4 10:10 - 12:32  (02:21)    
linux1   pts/0        195.212.29.82    Tue Feb  4 08:57 - 11:23  (02:25)    
reboot   system boot  3.10.0-1062.9.1. Mon Feb  3 07:40 - 06:41 (16+23:00)  
reboot   system boot  3.10.0-1062.9.1. Sun Feb  2 07:43 - 06:41 (17+22:57)  
reboot   system boot  3.10.0-1062.9.1. Sat Feb  1 23:40 - 06:41 (18+07:00)  
reboot   system boot  3.10.0-1062.9.1. Sat Feb  1 14:17 - 06:41 (18+16:23)  
reboot   system boot  3.10.0-1062.9.1. Sat Feb  1 00:41 - 06:41 (19+05:59)  
reboot   system boot  3.10.0-1062.9.1. Thu Jan 30 14:18 - 06:41 (20+16:22)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan 29 14:32 - 06:41 (21+16:09)  
reboot   system boot  3.10.0-1062.9.1. Tue Jan 28 07:23 - 06:41 (22+23:17)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 24 00:44 - 06:41 (27+05:56)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan 22 10:54 - 06:41 (28+19:46)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan 22 09:15 - 06:41 (28+21:25)  
reboot   system boot  3.10.0-1062.9.1. Tue Jan 21 14:13 - 06:41 (29+16:27)  
reboot   system boot  3.10.0-1062.9.1. Sun Jan 19 13:47 - 06:41 (31+16:53)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 17 23:50 - 06:41 (33+06:50)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 17 05:51 - 06:41 (34+00:49)  
reboot   system boot  3.10.0-1062.9.1. Thu Jan 16 00:30 - 06:41 (35+06:10)  
reboot   system boot  3.10.0-1062.9.1. Tue Jan 14 14:25 - 06:41 (36+16:15)  
reboot   system boot  3.10.0-1062.9.1. Mon Jan 13 15:11 - 06:41 (37+15:29)  
reboot   system boot  3.10.0-1062.9.1. Sun Jan 12 01:08 - 06:41 (39+05:32)  
reboot   system boot  3.10.0-1062.9.1. Fri Jan 10 14:28 - 06:41 (40+16:12)  
reboot   system boot  3.10.0-1062.9.1. Wed Jan  8 10:42 - 06:41 (42+19:58)  
reboot   system boot  3.10.0-1062.9.1. Sun Jan  5 05:37 - 06:41 (46+01:03)  
reboot   system boot  3.10.0-1062.9.1. Sat Jan  4 00:37 - 06:41 (47+06:03)  
reboot   system boot  3.10.0-1062.9.1. Thu Jan  2 14:13 - 06:41 (48+16:27)  
linux1   pts/0        195.212.29.70    Thu Jan  2 10:05 - 12:32  (02:27)    
reboot   system boot  3.10.0-1062.9.1. Wed Jan  1 01:42 - 06:41 (50+04:58)  
reboot   system boot  3.10.0-1062.9.1. Tue Dec 31 13:09 - 06:41 (50+17:31)  
reboot   system boot  3.10.0-1062.9.1. Sun Dec 29 23:36 - 06:41 (52+07:04)  
reboot   system boot  3.10.0-1062.9.1. Sun Dec 29 00:19 - 06:41 (53+06:21)  
reboot   system boot  3.10.0-1062.9.1. Fri Dec 27 04:03 - 06:41 (55+02:37)  
reboot   system boot  3.10.0-1062.9.1. Thu Dec 26 14:13 - 06:41 (55+16:27)  
reboot   system boot  3.10.0-1062.9.1. Mon Dec 23 23:37 - 06:41 (58+07:03)  
reboot   system boot  3.10.0-1062.9.1. Sat Dec 21 14:23 - 06:41 (60+16:17)  

Looks like a kernel crash as follows:

[86417.737380] Unable to handle kernel pointer dereference at virtual kernel address 10180004a7f40000
[86417.737423] Oops: 0038 [#1] SMP 
[86417.737427] Modules linked in: isofs xt_pkttype ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter loop af_iucv qeth_l2 vmur ip_tables ext4 mbcache jbd2 dasd_fba_mod dasd_mod qeth ccwgroup qdio prng sha512_s390 ghash_s390 des_s390 des_generic aes_s390
[86417.737479] CPU: 2 PID: 38890 Comm: cc1plus Kdump: loaded Not tainted 3.10.0-1062.12.1.el7.s390x #1
[86417.737483] task: 0000000001e35d80 ti: 000000007dd9c000 task.ti: 000000007dd9c000
[86417.737486] Krnl PSW : 0704e00180000000 00000000004886e0 (__radix_tree_lookup+0x50/0x118)
[86417.737498]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 0000000000279080 10180004a7f40089 0000000000000001 10180004a7f40088
[86417.737500]            0000000000000000 000000007dd9fcd8 000002001e070000 0000000000000001
[86417.737501]            0000000000000000 000000007dd9fcd8 0000000000000040 000000000000a2f8
[86417.737502]            18000000003b55d0 000000000000a2f8 000000007dd9fc38 000000007dd9fbe8
[86417.737510] Krnl Code: 00000000004886ce: ec213ebf0055	risbg	%r2,%r1,62,191,0
	   00000000004886d4: ec260063017c	cgij	%r2,1,6,48879a
	  #00000000004886da: ec3100be0055	risbg	%r3,%r1,0,190,0
	  >00000000004886e0: e33030000094	llc	%r3,0(%r3)
	   00000000004886e6: eb3a3000000d	sllg	%r3,%r10,0(%r3)
	   00000000004886ec: a73bffff		aghi	%r3,-1
	   00000000004886f0: ecc300492065	clgrj	%r12,%r3,2,488782
	   00000000004886f6: ec26004a017c	cgij	%r2,1,6,48878a
[86417.737521] Call Trace:
[86417.737522] ([<0000000000b97700>] contig_page_data+0x700/0x1600)
[86417.737526]  [<00000000004887d4>] radix_tree_lookup_slot+0x2c/0x50
[86417.737528]  [<00000000002790b4>] __find_get_page+0x4c/0xd0
[86417.737531]  [<0000000000279174>] find_get_page+0x3c/0x58
[86417.737532]  [<00000000002ca24a>] lookup_swap_cache+0x7a/0x178
[86417.737534]  [<00000000002caa7c>] swap_readahead_detect+0xac/0x318
[86417.737535]  [<00000000002b4d00>] __handle_mm_fault+0x238/0x1028
[86417.737537]  [<00000000002b5bd6>] handle_mm_fault+0xe6/0x188
[86417.737538]  [<000000000075d5f4>] do_dat_exception+0x194/0x308
[86417.737542]  [<000000000075b728>] pgm_check_handler+0x168/0x16c
[86417.737543]  [<0000000080a65a2e>] 0x80a65a2e
[86417.737545] Last Breaking-Event-Address:
[86417.737546]  [<00000000004887ce>] radix_tree_lookup_slot+0x26/0x50
[86417.737547]  
[86417.737548] Kernel panic - not syncing: Fatal exception: panic_on_oops
@sxa sxa added this to the February 2020 milestone Feb 20, 2020
@sxa
Copy link
Member Author

sxa commented Feb 25, 2020

Two new machines added build-marist-rhel77-s390x-3 and build-marist-rhel77-s390x-4 with the same kernel level as the failing machine.
It's also worth noting that since disabling the machine in jenkins it hasn't crashed. I've re-enabled it to see if it falls over tonight

@sxa
Copy link
Member Author

sxa commented Feb 25, 2020

New machine 148.100.245.197 (-4) has just crashed during a build: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk14/job/jdk14-linux-s390x-openj9lastFailedBuild/console

@M-Davies
Copy link

M-Davies commented Mar 3, 2020

@sxa
Copy link
Member Author

sxa commented Mar 3, 2020

Swapfile disabled on next reboot on machines 2 to 4. I've rebooted -2 so that will take effect immediately. We'll see if that makes any difference tonight.

@sxa sxa changed the title build-marist-rhel77-s390x-2 crashing daily build-marist-rhel77-s390x-[234] crashing daily Mar 18, 2020
@sxa
Copy link
Member Author

sxa commented Mar 18, 2020

-3 has been upgraded to have 16GB of RAM so I've re-enabled it alongside -1 and we'll see if it's any more stable

@sxa
Copy link
Member Author

sxa commented Mar 19, 2020

-3 worked ok yesterday although it only seems to have ran one of the jobs: https://ci.adoptopenjdk.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-linux-s390x-openj9-linuxXL/103/consoleFull

@sxa
Copy link
Member Author

sxa commented Mar 19, 2020

Rerunning another pipeline and the following are running on -3:

I think I'll mark -1 offline tonight to force everything to -3 and see what happens

@sxa
Copy link
Member Author

sxa commented Mar 24, 2020

The 16Gb system failed again today. I'm going to start logging the failures:
-2 https://ci.adoptopenjdk.net/view/Failing%20Builds/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-linux-s390x-openj9/536/consoleFull

@sxa
Copy link
Member Author

sxa commented Mar 31, 2020

Latest kernel update from RedHat appears to have resolved this on all machines regardless of memory/swap setup - it was installed on the 19th March and none of the machines have crashed in the last week. OpenJ9 (via @jdekonin) reporting the same success so I'm going to close this :-)

[root@adoptopenjdk01 ~]# rpm -qi kernel-3.10.0-1062.18.1.el7.s390x
Name        : kernel
Version     : 3.10.0
Release     : 1062.18.1.el7
Architecture: s390x
Install Date: Thu 19 Mar 2020 01:00:29 EDT
Group       : System Environment/Kernel

For completeness, the original machine was on this kernel:

[linux1@localhost ~]$ uname -a
Linux localhost.adoptopenjdk.net 3.10.0-957.21.3.el7.s390x #1 SMP Fri Jun 14 02:52:25 EDT 2019 s390x s390x s390x GNU/Linux

The failing ones were on

[linux1@adoptopenjdk01 ~]$ uname -a
Linux adoptopenjdk01.novalocal 3.10.0-1062.12.1.el7.s390x #1 SMP Thu Dec 12 06:45:30 EST 2019 s390x s390x s390x GNU/Linux

And the new ones are:

[root@adoptopenjdk03 ~]# uname -a
Linux adoptopenjdk03.novalocal 3.10.0-1062.18.1.el7.s390x #1 SMP Wed Feb 12 09:11:02 EST 2020 s390x s390x s390x GNU/Linux

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants