Hardlockup during a VM boot on zz with D2.0 with latest devel branch #25

sathnaga · 2017-12-08T13:47:22Z

cde:info Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=162352 </cde:info>

Hitting hardlockup with latest hostos devel branch while it was booting VM(hostos guest and ran yum update test inside the guest, I guess test irrespective of test, guest boot caused it.)

kernel: 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le

Machine used:
zz - DD2.0

# lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    4
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004e 1200)
Model name:            POWER9 (raw), altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2283.0000
Hypervisor vendor:     (null)
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-63
NUMA node8 CPU(s):     64-127

FW: fips910/b1117d_1748.910

JOB LOG : /home/workspace/runAvocadoFVTTest/avocado-fvt-wrapper/results/job-2017-12-08T08.18-fe6f991/job.log
(1/5) guest_import.qemu.qcow2.virtio_scsi.smp2.virtio_net.HostOS.ppc64le.powerkvm-qemu.unattended_install.import.import.default_install.aio_native: PASS (47.01 s)
(2/5) guest_update.yum.qemu.qcow2.virtio_scsi.smp2.virtio_net.HostOS.ppc64le.powerkvm-qemu.guest_test.isa_serial_operations: PASS (58.74 s)
...

[ 7475.715366] virbr0: topology change detected, propagating
[ 7553.518889] virbr0: port 2(vnet0) entered disabled state
[ 7553.525649] device vnet0 left promiscuous mode
[ 7553.525718] virbr0: port 2(vnet0) entered disabled state
[ 7569.408657] Watchdog CPU:32 detected Hard LOCKUP other CPUS:117

Message from syslogd@ltczzj2 at Dec  8 08:21:19 ...
 kernel:Watchdog CPU:32 detected Hard LOCKUP other CPUS:117
[ 7584.519034] Watchdog CPU:94 detected Hard LOCKUP other CPUS:44
[ 7584.519142] Watchdog CPU:44 Hard LOCKUP
[ 7584.519206] Modules linked in: vhost_net vhost tap binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ebtable_nat ebtable_broute ib_ipoib rdma_ucm ib_ucm ib_uverbs bridge ib_umad stp llc rdma_cm ib_cm iw_cm ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat mlx5_ib ib_core nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf powernv_op_panel ipmi_msghandler opal_prd nfsd auth_rpcgss oid_registry
[ 7584.519872]  nfs_acl lockd grace sunrpc kvm_hv kvm lpfc mlx5_core bnx2x scsi_transport_fc mdio libcrc32c ptp pps_core
[ 7584.519983] CPU: 44 PID: 35342 Comm: CPU 30/KVM Not tainted 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le #1
[ 7584.520072] task: c000003f7da1ce00 task.stack: c000003f7dad4000
[ 7584.520135] NIP:  c0000000001b9464 LR: c0000000001b9408 CTR: c0000000000420e0
[ 7584.520209] REGS: c000003f7dad74c0 TRAP: 0e81   Not tainted  (4.14.0-3.dev.git68b4afb.el7.centos.ppc64le)
[ 7584.520296] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44224484  XER: 20040000
[ 7584.520378] CFAR: c0000000001b9470 SOFTE: 1
[ 7584.520378] GPR00: c0000000001b9408 c000003f7dad7740 c000000001403900 0000000000000075
[ 7584.520378] GPR04: 0000000000000400 0000000000000040 0000000000000000 c000000001439d78
[ 7584.520378] GPR08: 0000000000000001 0000000000000003 c000203994b48da0 0000000000000010
[ 7584.520378] GPR12: c0000000000420e0 c00000000fd7ce00
[ 7584.520689] NIP [c0000000001b9464] smp_call_function_many+0x384/0x420
[ 7584.520753] LR [c0000000001b9408] smp_call_function_many+0x328/0x420
[ 7584.520816] Call Trace:
[ 7584.520844] [c000003f7dad7740] [c0000000001b9408] smp_call_function_many+0x328/0x420 (unreliable)
[ 7584.520934] [c000003f7dad77b0] [c0000000000711c8] serialize_against_pte_lookup+0x38/0x50
[ 7584.521011] [c000003f7dad77d0] [c000000000073360] radix__pmdp_huge_get_and_clear+0x60/0x80
[ 7584.521088] [c000003f7dad7800] [c00000000033ae2c] zap_huge_pmd+0x8c/0x510
[ 7584.521153] [c000003f7dad78a0] [c0000000002d47f0] unmap_page_range+0xe80/0x10a0
[ 7584.521229] [c000003f7dad79e0] [c0000000002d4f04] unmap_vmas+0x84/0xf0
[ 7584.521293] [c000003f7dad7a30] [c0000000002e3508] exit_mmap+0xe8/0x1f0
[ 7584.521376] [c000003f7dad7af0] [c0000000000fa9f8] mmput+0xb8/0x1f0
[ 7584.521494] [c000003f7dad7b20] [c000000000104cf8] do_exit+0x358/0xcd0
[ 7584.521609] [c000003f7dad7be0] [c000000000105744] do_group_exit+0x64/0x100
[ 7584.521725] [c000003f7dad7c20] [c0000000001155d0] get_signal+0x210/0x700
[ 7584.521841] [c000003f7dad7d10] [c00000000001c65c] do_signal+0x6c/0x2d0
[ 7584.521957] [c000003f7dad7e00] [c00000000001ca74] do_notify_resume+0xd4/0x100
[ 7584.522098] [c000003f7dad7e30] [c00000000000bec4] ret_from_except_lite+0x70/0x74
[ 7584.522238] Instruction dump:
[ 7584.522310] 7d495214 812a0018 792807e1 41820034 4800001c 60000000 60000000 60000000
[ 7584.522451] 60000000 60000000 60420000 7c210b78 <7c421378> 812a0018 792807e1 4082fff0

Message from syslogd@ltczzj2 at Dec  8 08:21:33 ...
 kernel:Watchdog CPU:94 detected Hard LOCKUP other CPUS:44

Message from syslogd@ltczzj2 at Dec  8 08:21:33 ...
 kernel:Watchdog CPU:44 Hard LOCKUP

The text was updated successfully, but these errors were encountered:

cdeadmin · 2017-12-11T04:05:28Z

------- Comment From chavez@us.ibm.com 2017-12-08 12:48:35 EDT-------
Which firmware version is installed?

commit af3ff80 upstream. Because the HMAC template didn't check that its underlying hash algorithm is unkeyed, trying to use "hmac(hmac(sha3-512-generic))" through AF_ALG or through KEYCTL_DH_COMPUTE resulted in the inner HMAC being used without having been keyed, resulting in sha3_update() being called without sha3_init(), causing a stack buffer overflow. This is a very old bug, but it seems to have only started causing real problems when SHA-3 support was added (requires CONFIG_CRYPTO_SHA3) because the innermost hash's state is ->import()ed from a zeroed buffer, and it just so happens that other hash algorithms are fine with that, but SHA-3 is not. However, there could be arch or hardware-dependent hash algorithms also affected; I couldn't test everything. Fix the bug by introducing a function crypto_shash_alg_has_setkey() which tests whether a shash algorithm is keyed. Then update the HMAC template to require that its underlying hash algorithm is unkeyed. Here is a reproducer: #include <linux/if_alg.h> #include <sys/socket.h> int main() { int algfd; struct sockaddr_alg addr = { .salg_type = "hash", .salg_name = "hmac(hmac(sha3-512-generic))", }; char key[4096] = { 0 }; algfd = socket(AF_ALG, SOCK_SEQPACKET, 0); bind(algfd, (const struct sockaddr *)&addr, sizeof(addr)); setsockopt(algfd, SOL_ALG, ALG_SET_KEY, key, sizeof(key)); } Here was the KASAN report from syzbot: BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:341 [inline] BUG: KASAN: stack-out-of-bounds in sha3_update+0xdf/0x2e0 crypto/sha3_generic.c:161 Write of size 4096 at addr ffff8801cca07c40 by task syzkaller076574/3044 CPU: 1 PID: 3044 Comm: syzkaller076574 Not tainted 4.14.0-mm1+ open-power-host-os#25 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:53 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x25b/0x340 mm/kasan/report.c:409 check_memory_region_inline mm/kasan/kasan.c:260 [inline] check_memory_region+0x137/0x190 mm/kasan/kasan.c:267 memcpy+0x37/0x50 mm/kasan/kasan.c:303 memcpy include/linux/string.h:341 [inline] sha3_update+0xdf/0x2e0 crypto/sha3_generic.c:161 crypto_shash_update+0xcb/0x220 crypto/shash.c:109 shash_finup_unaligned+0x2a/0x60 crypto/shash.c:151 crypto_shash_finup+0xc4/0x120 crypto/shash.c:165 hmac_finup+0x182/0x330 crypto/hmac.c:152 crypto_shash_finup+0xc4/0x120 crypto/shash.c:165 shash_digest_unaligned+0x9e/0xd0 crypto/shash.c:172 crypto_shash_digest+0xc4/0x120 crypto/shash.c:186 hmac_setkey+0x36a/0x690 crypto/hmac.c:66 crypto_shash_setkey+0xad/0x190 crypto/shash.c:64 shash_async_setkey+0x47/0x60 crypto/shash.c:207 crypto_ahash_setkey+0xaf/0x180 crypto/ahash.c:200 hash_setkey+0x40/0x90 crypto/algif_hash.c:446 alg_setkey crypto/af_alg.c:221 [inline] alg_setsockopt+0x2a1/0x350 crypto/af_alg.c:254 SYSC_setsockopt net/socket.c:1851 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1830 entry_SYSCALL_64_fastpath+0x1f/0x96 Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

cdeadmin · 2018-02-15T05:46:26Z

------- Comment From seg@us.ibm.com 2018-02-15 00:39:50 EDT-------
OK the firmware is there. Please test.

sathnaga · 2018-02-19T08:51:09Z

Used latest FW and able to reproduce the issue

ENV:

FW: fips910/b0215b_1808.910

Host Kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le
lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    4
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004e 1200)
Model name:            POWER9 (raw), altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000
Hypervisor vendor:     (null)
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-63
NUMA node8 CPU(s):     64-127

# cat /proc/cmdline 
root=/dev/mapper/host_os_ltczzj2-root ro console=tty0 rd.lvm.lv=host_os_ltczzj2/root rd.lvm.lv=host_os_ltczzj2/swap LANG=en_US.UTF-8

Steps to reproduce: Just run any avocado guest tests.

python avocado-setup.py  --run-suite guest_cpu --guest-os HostOS.ppc64le --no-download
00:22:56 INFO    : Check for environment
00:22:56 INFO    : Creating temporary mux dir
00:23:06 INFO    : 
00:23:06 INFO    : Running Guest Tests Suite cpu
00:23:06 INFO    : Running: /usr/bin/avocado run --vt-type libvirt --vt-config /home/sath/avocado-fvt-wrapper/data/avocado-vt/backends/libvirt/cfg/cpu.cfg                 --force-job-id d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3  --vt-only-filter                                             "HostOS.ppc64le" --job-results-dir /home/sath/avocado-fvt-wrapper/results
JOB ID     : d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3
JOB LOG    : /home/sath/avocado-fvt-wrapper/results/job-2018-02-19T00.23-d40eec1/job.log
 (001/853) guest_import.qemu.qcow2.virtio_scsi.smp2.virtio_net.Guest.HostOS.ppc64le.powerkvm-qemu.unattended_install.import.import.default_install.aio_native: -
Message from syslogd@ltczzj2 at Feb 19 00:24:14 ...
 kernel:Watchdog CPU:97 detected Hard LOCKUP other CPUS:38

dmesg:

[  186.241993] virbr0: port 2(vnet0) entered disabled state
[  200.167913] Watchdog CPU:97 detected Hard LOCKUP other CPUS:38
[  246.439590] INFO: rcu_sched self-detected stall on CPU
[  246.439710] 	74-....: (6000 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=2889 
[  246.439795] 	 (t=6000 jiffies g=2031 c=2030 q=30958)
[  246.439852] NMI backtrace for cpu 74
[  246.439891] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[  246.439976] Call Trace:
[  246.440007] [c000203835602ef0] [c000000000b0c19c] dump_stack+0xb0/0xf4 (unreliable)
[  246.440082] [c000203835602f30] [c000000000b153c4] nmi_cpu_backtrace+0x1a4/0x210
[  246.440156] [c000203835602fc0] [c000000000b1560c] nmi_trigger_cpumask_backtrace+0x1dc/0x220
[  246.440232] [c000203835603060] [c00000000002f448] arch_trigger_cpumask_backtrace+0x28/0x40
[  246.440307] [c000203835603080] [c000000000197a6c] rcu_dump_cpu_stacks+0xfc/0x160
[  246.440380] [c0002038356030d0] [c000000000196c90] rcu_check_callbacks+0x8e0/0xaf0
[  246.440454] [c000203835603200] [c0000000001a3874] update_process_times+0x44/0x90
[  246.440529] [c000203835603230] [c0000000001bac8c] tick_sched_handle.isra.13+0x4c/0x80
[  246.440602] [c000203835603250] [c0000000001bad20] tick_sched_timer+0x60/0xc0
[  246.440675] [c000203835603290] [c0000000001a44a8] __hrtimer_run_queues+0xf8/0x330
[  246.440749] [c000203835603310] [c0000000001a524c] hrtimer_interrupt+0xec/0x290
[  246.440823] [c0002038356033d0] [c000000000024438] __timer_interrupt+0x98/0x280
[  246.440897] [c000203835603420] [c000000000024b58] timer_interrupt+0xa8/0xe0
[  246.440959] [c000203835603450] [c0000000000092a8] decrementer_common+0x158/0x160
[  246.441034] --- interrupt: 901 at smp_call_function_many+0x388/0x420
    LR = smp_call_function_many+0x328/0x420
[  246.441145] [c0002038356037b0] [c000000000072328] serialize_against_pte_lookup+0x38/0x50
[  246.441219] [c0002038356037d0] [c0000000000744d0] radix__pmdp_huge_get_and_clear+0x60/0x80
[  246.441294] [c000203835603800] [c00000000034598c] zap_huge_pmd+0x8c/0x540
[  246.441356] [c0002038356038a0] [c0000000002df120] unmap_page_range+0xe80/0x10a0
[  246.441430] [c0002038356039e0] [c0000000002df834] unmap_vmas+0x84/0x100
[  246.441491] [c000203835603a30] [c0000000002edf1c] exit_mmap+0x9c/0x1c0
[  246.441553] [c000203835603af0] [c0000000000ffd28] mmput+0xb8/0x1f0
[  246.441614] [c000203835603b20] [c00000000010a1c8] do_exit+0x358/0xcd0
[  246.441676] [c000203835603be0] [c00000000010ac14] do_group_exit+0x64/0x100
[  246.441738] [c000203835603c20] [c00000000011aba0] get_signal+0x210/0x700
[  246.441799] [c000203835603d10] [c00000000001cb00] do_signal+0x80/0x2e0
[  246.441861] [c000203835603e00] [c00000000001cf14] do_notify_resume+0xd4/0x100
[  246.441935] [c000203835603e30] [c00000000000bf44] ret_from_except_lite+0x70/0x74
[  426.487634] INFO: rcu_sched self-detected stall on CPU
[  426.487722] 	74-....: (24004 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=11606 
[  426.487733] 	 (t=24004 jiffies g=2031 c=2030 q=119140)
[  426.487791] NMI backtrace for cpu 74
[  426.487830] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[  426.487915] Call Trace:

P:S: with powersave=off kernel commandline no issues seen.

cdeadmin · 2018-02-19T13:56:11Z

------- Comment From seg@us.ibm.com 2018-02-19 08:52:54 EDT-------
OPAL team, please have a look.

cdeadmin · 2018-02-19T19:36:04Z

==== State: Rejected by: dcrowell on 19 February 2018 13:18:55 ====

I have tried to read the pointed-at bug and I can't make any sense of it so I have no idea what the fix is that is being referred to. It looks like there have been several iterations of different problems described. Please summarize what the expected fix is and reopen. Or better yet, just dupe this directly.

cdeadmin · 2018-02-27T15:06:45Z

------- Comment From satheera@in.ibm.com 2018-02-27 09:57:06 EDT-------
I am able to reproduce the similar issue with DD2.2 aswell with the latest FW with powersave enabled and with powersave disabled(i.e powersave=off kernel cmdline) this issue is not seen.

Host Env
$cupdcmd -f
...

Current Side Driver:.....fips910/b0223b_1810.910
Non-Current Side Driver:.fips910/b0215b_1808.910

Model: 2.2 (pvr 004e 1202)
Model name: POWER9 (raw), altivec supported

Host kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le

kernel cmdline: root=/dev/mapper/host_os_9-root ro console=tty0 rd.lvm.lv=host_os_9/root rd.lvm.lv=host_os_9/swap LANG=en_US.UTF-8

Steps to reproduce.

Boot a kvm Guest and do a simple test(copy a file from host to guest , do an yum update)
=======> Host suddenly freezes and reboots

FSP Event/error logs has this entry then,
5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?

Platform Event Log - 5302D258
Created at : 02/27/2018 14:17:03
Subsystem : CEC Hardware Subsystem
Event Severity : Unrecoverable Error
Action Flags : Report Externally
Action Status : Reported to Opr Sys

Primary System Reference Code

Reference Code : BB82C013
Hex Words 2 - 5 : 00000080 00000000 00000000 00000000
Hex Words 6 - 9 : 00000000 00000000 00000000 00000000

Log Hex Dump
00000000 50480030 01004F43 20180227 14170300 00000010 20180227 14170430 4B000106 00000000 00000020 00000000 00000000 B0000002 5302D258 00000030 55480018 01004F43 50004000 00000000 00000040 00002000 01004500 50530050 01004F43 00000050 02000008 00000048 00000080 00000000 00000060 00000000 00000000 00000000 00000000 00000070 00000000 00000000 42423832 43303133 00000080 20202020 20202020 20202020 00000090 20202020 20202020 4548004C 01004F43 000000A0 39303038 2D32324C 31333642 45444100 000000B0 00000000 00000000 00000000 00000000 000000C0 00000000 00000000 00000000 00000000 000000D0 00000000 00000000 20180227 14170300 000000E0 00000000 4D54001C 01004F43 39303038 000000F0 2D32324C 31333642 45444100 00000000 00000100 5544004D 01004F43 44455343 00450000 00000110 4F43433A 20496E69 7469616C 697A6174 00000120 696F6E20 6F6E2061 6C6C2063 68697073 00000130 20646964 206E6F74 20636F6D 706C6574 00000140 65287469 6D656420 6F757429 0A PH.0..OC ..'....
..'...0K.......
............S..X
UH....OCP.@.....
.. ...E.PS.P..OC
.......H........
................
........BB82C013
20202020
EH.L..OC
9008-22L136BEDA.
................
................
........ ..'....
....MT....OC9008
-22L136BEDA.....
UD.M..OCDESC.E..
OCC: Initializat
ion on all chips
did not complet
e(timed out).

Regards,
-Satheesh

cdeadmin · 2018-02-27T15:16:19Z

------- Comment From satheera@in.ibm.com 2018-02-27 10:08:25 EDT-------
I just now noticed the issue after a few test runs inside host even with powersave disabled aswell.
During the issue occurred, host was idle though, not running any tests/load.

FSP log event/errors histroy:
Log ID Time Failing subsystem Severity SRC Reported
5302D602 2018-02-27 15:17:40 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? ------------------------------------------------------------------------------------------------------------------->powersave=off
5002D4A0 2018-02-27 15:11:42 Service Processor Firmware Unrecoverable Error B1813450 ?
5002D295 2018-02-27 14:21:32 OCC Unrecoverable Error, Degraded Performance B18B2616 ?
5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?
5302D218 2018-02-27 14:12:48 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?
5002D0A4 2018-02-27 14:07:06 Service Processor Firmware Unrecoverable Error B1813450 ?
5002CE92 2018-02-27 13:16:56 OCC Unrecoverable Error, Degraded Performance B18B2616 ?

Regards,
-Satheesh

cdeadmin · 2018-02-28T00:35:57Z

------- Comment From npiggin@au1.ibm.com 2018-02-27 19:27:03 EDT-------
Are you getting the same kernel logs with powersave=off? Or is this a new bug?

If the kernel messages include "smp_call_function_many", then we need to get traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic should crash the system when you get a lockup and give traces.

cdeadmin · 2018-02-28T11:16:27Z

------- Comment From satheera@in.ibm.com 2018-02-28 06:11:34 EDT-------
(In reply to comment #16)
> Are you getting the same kernel logs with powersave=off? Or is this a new
> bug?
>
> If the kernel messages include "smp_call_function_many", then we need to get
> traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic should
> crash the system when you get a lockup and give traces.

Looks like this is different issue, am not completely sure though, host does not have any traces at all, it dies suddenly and FSP error log has the below issue reported, looked to checkstop?? , I just wanted reuse the FW bug which got mirrored already,

FSP Event/error logs has this entry then,
5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?

Platform Event Log - 5302D258
Created at : 02/27/2018 14:17:03
Subsystem : CEC Hardware Subsystem
Event Severity : Unrecoverable Error
Action Flags : Report Externally
Action Status : Reported to Opr Sys

Regards,
-Satheesh.

cdeadmin · 2018-03-06T07:36:03Z

------- Comment From bssrikanth@in.ibm.com 2018-03-06 02:27:12 EDT-------
This bug is gating us from using the ZZ small core DD 2.2 for KVM testing. Please help resolve asap.

cdeadmin · 2018-03-08T08:57:03Z

------- Comment From satheera@in.ibm.com 2018-03-08 03:54:01 EDT-------
Am able to reproduce the same(In reply to comment #16)
> Are you getting the same kernel logs with powersave=off? Or is this a new
> bug?
>
> If the kernel messages include "smp_call_function_many", then we need to get
> traces from all CPUs. echo 1 > /proc/sys/kernel/hardlockup_panic should
> crash the system when you get a lockup and give traces.

yes, looks like new bug, I will close this issue and open a new one for the issue, even the latest FW reproduces it.

Regards,
-Satheesh

------- Comment From satheera@in.ibm.com 2018-03-08 03:55:46 EDT-------
Initial issue reported in the bz is not seen.

Chen Yu reported a divide-by-zero error when accessing the 'size' resctrl file when a MBA resource is enabled. divide error: 0000 [#1] SMP PTI CPU: 93 PID: 1929 Comm: cat Not tainted 4.19.0-rc2-debug-rdt+ open-power-host-os#25 RIP: 0010:rdtgroup_cbm_to_size+0x7e/0xa0 Call Trace: rdtgroup_size_show+0x11a/0x1d0 seq_read+0xd8/0x3b0 Quoting Chen Yu's report: This is because for MB resource, the r->cache.cbm_len is zero, thus calculating size in rdtgroup_cbm_to_size() will trigger the exception. Fix this issue in the 'size' file by getting correct memory bandwidth value which is in MBps when MBA software controller is enabled or in percentage when MBA software controller is disabled. Fixes: d9b48c8 ("x86/intel_rdt: Display resource groups' allocations in bytes") Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Chen Yu <yu.c.chen@intel.com> Cc: "H Peter Anvin" <hpa@zytor.com> Cc: "Tony Luck" <tony.luck@intel.com> Cc: "Xiaochen Shen" <xiaochen.shen@intel.com> Link: https://lkml.kernel.org/r/20180904174614.26682-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/1537048707-76280-3-git-send-email-fenghua.yu@intel.com

Add missing prepare/unprepare operations for fbi->clk, this fixes following kernel warning: ------------[ cut here ]------------ WARNING: CPU: 0 PID: 1 at drivers/clk/clk.c:874 clk_core_enable+0x2c/0x1b0 Enabling unprepared disp0_clk Modules linked in: CPU: 0 PID: 1 Comm: swapper Not tainted 4.18.0-rc8-00032-g02b43ddd4f21-dirty open-power-host-os#25 Hardware name: Marvell MMP2 (Device Tree Support) [<c010f7cc>] (unwind_backtrace) from [<c010cc6c>] (show_stack+0x10/0x14) [<c010cc6c>] (show_stack) from [<c011dab4>] (__warn+0xd8/0xf0) [<c011dab4>] (__warn) from [<c011db10>] (warn_slowpath_fmt+0x44/0x6c) [<c011db10>] (warn_slowpath_fmt) from [<c043898c>] (clk_core_enable+0x2c/0x1b0) [<c043898c>] (clk_core_enable) from [<c0439ec8>] (clk_core_enable_lock+0x18/0x2c) [<c0439ec8>] (clk_core_enable_lock) from [<c0436698>] (pxa168fb_probe+0x464/0x6ac) [<c0436698>] (pxa168fb_probe) from [<c04779a0>] (platform_drv_probe+0x48/0x94) [<c04779a0>] (platform_drv_probe) from [<c0475bec>] (driver_probe_device+0x328/0x470) [<c0475bec>] (driver_probe_device) from [<c0475de4>] (__driver_attach+0xb0/0x124) [<c0475de4>] (__driver_attach) from [<c0473c38>] (bus_for_each_dev+0x64/0xa0) [<c0473c38>] (bus_for_each_dev) from [<c0474ee0>] (bus_add_driver+0x1b8/0x230) [<c0474ee0>] (bus_add_driver) from [<c0476a20>] (driver_register+0xac/0xf0) [<c0476a20>] (driver_register) from [<c0102dd4>] (do_one_initcall+0xb8/0x1f0) [<c0102dd4>] (do_one_initcall) from [<c0b010a0>] (kernel_init_freeable+0x294/0x2e0) [<c0b010a0>] (kernel_init_freeable) from [<c07e9eb8>] (kernel_init+0x8/0x10c) [<c07e9eb8>] (kernel_init) from [<c01010e8>] (ret_from_fork+0x14/0x2c) Exception stack(0xd008bfb0 to 0xd008bff8) bfa0: 00000000 00000000 00000000 00000000 bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 ---[ end trace c0af40f9e2ed7cb4 ]--- Signed-off-by: Lubomir Rintel <lkundrak@v3.sk> [b.zolnierkie: enhance patch description a bit] Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>

commit d50d82f upstream. In kernel 4.17 I removed some code from dm-bufio that did slab cache merging (commit 21bb132: "dm bufio: remove code that merges slab caches") - both slab and slub support merging caches with identical attributes, so dm-bufio now just calls kmem_cache_create and relies on implicit merging. This uncovered a bug in the slub subsystem - if we delete a cache and immediatelly create another cache with the same attributes, it fails because of duplicate filename in /sys/kernel/slab/. The slub subsystem offloads freeing the cache to a workqueue - and if we create the new cache before the workqueue runs, it complains because of duplicate filename in sysfs. This patch fixes the bug by moving the call of kobject_del from sysfs_slab_remove_workfn to shutdown_cache. kobject_del must be called while we hold slab_mutex - so that the sysfs entry is deleted before a cache with the same attributes could be created. Running device-mapper-test-suite with: dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/ triggered: Buffer I/O error on dev dm-0, logical block 1572848, async page read device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5 device-mapper: thin: 253:1: aborting current metadata transaction sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144' CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ open-power-host-os#25 Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017 Workqueue: dm-thin do_worker [dm_thin_pool] Call Trace: dump_stack+0x5a/0x73 sysfs_warn_dup+0x58/0x70 sysfs_create_dir_ns+0x77/0x80 kobject_add_internal+0xba/0x2e0 kobject_init_and_add+0x70/0xb0 sysfs_slab_add+0xb1/0x250 __kmem_cache_create+0x116/0x150 create_cache+0xd9/0x1f0 kmem_cache_create_usercopy+0x1c1/0x250 kmem_cache_create+0x18/0x20 dm_bufio_client_create+0x1ae/0x410 [dm_bufio] dm_block_manager_create+0x5e/0x90 [dm_persistent_data] __create_persistent_data_objects+0x38/0x940 [dm_thin_pool] dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool] metadata_operation_failed+0x59/0x100 [dm_thin_pool] alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool] process_cell+0x2a3/0x550 [dm_thin_pool] do_worker+0x28d/0x8f0 [dm_thin_pool] process_one_work+0x171/0x370 worker_thread+0x49/0x3f0 kthread+0xf8/0x130 ret_from_fork+0x35/0x40 kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory. kmem_cache_create(dm_bufio_buffer-16) failed with error -17 Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

cdeadmin closed this as completed Dec 21, 2017

cdeadmin reopened this Jan 18, 2018

cdeadmin closed this as completed Feb 19, 2018

cdeadmin reopened this Feb 27, 2018

cdeadmin closed this as completed Mar 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardlockup during a VM boot on zz with D2.0 with latest devel branch #25

Hardlockup during a VM boot on zz with D2.0 with latest devel branch #25

sathnaga commented Dec 8, 2017 •

edited by cdeadmin

Loading

cdeadmin commented Dec 11, 2017

cdeadmin commented Feb 15, 2018

sathnaga commented Feb 19, 2018 •

edited

Loading

cdeadmin commented Feb 19, 2018

cdeadmin commented Feb 19, 2018

cdeadmin commented Feb 27, 2018

cdeadmin commented Feb 27, 2018

cdeadmin commented Feb 28, 2018

cdeadmin commented Feb 28, 2018

cdeadmin commented Mar 6, 2018

cdeadmin commented Mar 8, 2018

Hardlockup during a VM boot on zz with D2.0 with latest devel branch #25

Hardlockup during a VM boot on zz with D2.0 with latest devel branch #25

Comments

sathnaga commented Dec 8, 2017 • edited by cdeadmin Loading

cdeadmin commented Dec 11, 2017

cdeadmin commented Feb 15, 2018

sathnaga commented Feb 19, 2018 • edited Loading

cdeadmin commented Feb 19, 2018

cdeadmin commented Feb 19, 2018

cdeadmin commented Feb 27, 2018

cdeadmin commented Feb 27, 2018

cdeadmin commented Feb 28, 2018

cdeadmin commented Feb 28, 2018

cdeadmin commented Mar 6, 2018

cdeadmin commented Mar 8, 2018

sathnaga commented Dec 8, 2017 •

edited by cdeadmin

Loading

sathnaga commented Feb 19, 2018 •

edited

Loading