Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardlockup during a VM boot on zz with D2.0 with latest devel branch #25

Closed
sathnaga opened this issue Dec 8, 2017 · 11 comments
Closed

Comments

@sathnaga
Copy link
Member

sathnaga commented Dec 8, 2017

cde:info Mirrored with LTC bug https://bugzilla.linux.ibm.com/show_bug.cgi?id=162352 </cde:info>

Hitting hardlockup with latest hostos devel branch while it was booting VM(hostos guest and ran yum update test inside the guest, I guess test irrespective of test, guest boot caused it.)

kernel: 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le

Machine used:
zz - DD2.0

# lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    4
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004e 1200)
Model name:            POWER9 (raw), altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2283.0000
Hypervisor vendor:     (null)
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-63
NUMA node8 CPU(s):     64-127

FW: fips910/b1117d_1748.910

JOB LOG : /home/workspace/runAvocadoFVTTest/avocado-fvt-wrapper/results/job-2017-12-08T08.18-fe6f991/job.log
(1/5) guest_import.qemu.qcow2.virtio_scsi.smp2.virtio_net.HostOS.ppc64le.powerkvm-qemu.unattended_install.import.import.default_install.aio_native: PASS (47.01 s)
(2/5) guest_update.yum.qemu.qcow2.virtio_scsi.smp2.virtio_net.HostOS.ppc64le.powerkvm-qemu.guest_test.isa_serial_operations: PASS (58.74 s)
...

[ 7475.715366] virbr0: topology change detected, propagating
[ 7553.518889] virbr0: port 2(vnet0) entered disabled state
[ 7553.525649] device vnet0 left promiscuous mode
[ 7553.525718] virbr0: port 2(vnet0) entered disabled state
[ 7569.408657] Watchdog CPU:32 detected Hard LOCKUP other CPUS:117

Message from syslogd@ltczzj2 at Dec  8 08:21:19 ...
 kernel:Watchdog CPU:32 detected Hard LOCKUP other CPUS:117
[ 7584.519034] Watchdog CPU:94 detected Hard LOCKUP other CPUS:44
[ 7584.519142] Watchdog CPU:44 Hard LOCKUP
[ 7584.519206] Modules linked in: vhost_net vhost tap binfmt_misc xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ebtable_nat ebtable_broute ib_ipoib rdma_ucm ib_ucm ib_uverbs bridge ib_umad stp llc rdma_cm ib_cm iw_cm ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat mlx5_ib ib_core nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas ipmi_powernv ipmi_devintf powernv_op_panel ipmi_msghandler opal_prd nfsd auth_rpcgss oid_registry
[ 7584.519872]  nfs_acl lockd grace sunrpc kvm_hv kvm lpfc mlx5_core bnx2x scsi_transport_fc mdio libcrc32c ptp pps_core
[ 7584.519983] CPU: 44 PID: 35342 Comm: CPU 30/KVM Not tainted 4.14.0-3.dev.git68b4afb.el7.centos.ppc64le #1
[ 7584.520072] task: c000003f7da1ce00 task.stack: c000003f7dad4000
[ 7584.520135] NIP:  c0000000001b9464 LR: c0000000001b9408 CTR: c0000000000420e0
[ 7584.520209] REGS: c000003f7dad74c0 TRAP: 0e81   Not tainted  (4.14.0-3.dev.git68b4afb.el7.centos.ppc64le)
[ 7584.520296] MSR:  9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 44224484  XER: 20040000
[ 7584.520378] CFAR: c0000000001b9470 SOFTE: 1
[ 7584.520378] GPR00: c0000000001b9408 c000003f7dad7740 c000000001403900 0000000000000075
[ 7584.520378] GPR04: 0000000000000400 0000000000000040 0000000000000000 c000000001439d78
[ 7584.520378] GPR08: 0000000000000001 0000000000000003 c000203994b48da0 0000000000000010
[ 7584.520378] GPR12: c0000000000420e0 c00000000fd7ce00
[ 7584.520689] NIP [c0000000001b9464] smp_call_function_many+0x384/0x420
[ 7584.520753] LR [c0000000001b9408] smp_call_function_many+0x328/0x420
[ 7584.520816] Call Trace:
[ 7584.520844] [c000003f7dad7740] [c0000000001b9408] smp_call_function_many+0x328/0x420 (unreliable)
[ 7584.520934] [c000003f7dad77b0] [c0000000000711c8] serialize_against_pte_lookup+0x38/0x50
[ 7584.521011] [c000003f7dad77d0] [c000000000073360] radix__pmdp_huge_get_and_clear+0x60/0x80
[ 7584.521088] [c000003f7dad7800] [c00000000033ae2c] zap_huge_pmd+0x8c/0x510
[ 7584.521153] [c000003f7dad78a0] [c0000000002d47f0] unmap_page_range+0xe80/0x10a0
[ 7584.521229] [c000003f7dad79e0] [c0000000002d4f04] unmap_vmas+0x84/0xf0
[ 7584.521293] [c000003f7dad7a30] [c0000000002e3508] exit_mmap+0xe8/0x1f0
[ 7584.521376] [c000003f7dad7af0] [c0000000000fa9f8] mmput+0xb8/0x1f0
[ 7584.521494] [c000003f7dad7b20] [c000000000104cf8] do_exit+0x358/0xcd0
[ 7584.521609] [c000003f7dad7be0] [c000000000105744] do_group_exit+0x64/0x100
[ 7584.521725] [c000003f7dad7c20] [c0000000001155d0] get_signal+0x210/0x700
[ 7584.521841] [c000003f7dad7d10] [c00000000001c65c] do_signal+0x6c/0x2d0
[ 7584.521957] [c000003f7dad7e00] [c00000000001ca74] do_notify_resume+0xd4/0x100
[ 7584.522098] [c000003f7dad7e30] [c00000000000bec4] ret_from_except_lite+0x70/0x74
[ 7584.522238] Instruction dump:
[ 7584.522310] 7d495214 812a0018 792807e1 41820034 4800001c 60000000 60000000 60000000
[ 7584.522451] 60000000 60000000 60420000 7c210b78 <7c421378> 812a0018 792807e1 4082fff0

Message from syslogd@ltczzj2 at Dec  8 08:21:33 ...
 kernel:Watchdog CPU:94 detected Hard LOCKUP other CPUS:44

Message from syslogd@ltczzj2 at Dec  8 08:21:33 ...
 kernel:Watchdog CPU:44 Hard LOCKUP

@cdeadmin
Copy link

------- Comment From chavez@us.ibm.com 2017-12-08 12:48:35 EDT-------
Which firmware version is installed?

@cdeadmin cdeadmin reopened this Jan 18, 2018
malcolmcrossley pushed a commit to malcolmcrossley/linux that referenced this issue Jan 24, 2018
commit af3ff80 upstream.

Because the HMAC template didn't check that its underlying hash
algorithm is unkeyed, trying to use "hmac(hmac(sha3-512-generic))"
through AF_ALG or through KEYCTL_DH_COMPUTE resulted in the inner HMAC
being used without having been keyed, resulting in sha3_update() being
called without sha3_init(), causing a stack buffer overflow.

This is a very old bug, but it seems to have only started causing real
problems when SHA-3 support was added (requires CONFIG_CRYPTO_SHA3)
because the innermost hash's state is ->import()ed from a zeroed buffer,
and it just so happens that other hash algorithms are fine with that,
but SHA-3 is not.  However, there could be arch or hardware-dependent
hash algorithms also affected; I couldn't test everything.

Fix the bug by introducing a function crypto_shash_alg_has_setkey()
which tests whether a shash algorithm is keyed.  Then update the HMAC
template to require that its underlying hash algorithm is unkeyed.

Here is a reproducer:

    #include <linux/if_alg.h>
    #include <sys/socket.h>

    int main()
    {
        int algfd;
        struct sockaddr_alg addr = {
            .salg_type = "hash",
            .salg_name = "hmac(hmac(sha3-512-generic))",
        };
        char key[4096] = { 0 };

        algfd = socket(AF_ALG, SOCK_SEQPACKET, 0);
        bind(algfd, (const struct sockaddr *)&addr, sizeof(addr));
        setsockopt(algfd, SOL_ALG, ALG_SET_KEY, key, sizeof(key));
    }

Here was the KASAN report from syzbot:

    BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:341  [inline]
    BUG: KASAN: stack-out-of-bounds in sha3_update+0xdf/0x2e0  crypto/sha3_generic.c:161
    Write of size 4096 at addr ffff8801cca07c40 by task syzkaller076574/3044

    CPU: 1 PID: 3044 Comm: syzkaller076574 Not tainted 4.14.0-mm1+ open-power-host-os#25
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  Google 01/01/2011
    Call Trace:
      __dump_stack lib/dump_stack.c:17 [inline]
      dump_stack+0x194/0x257 lib/dump_stack.c:53
      print_address_description+0x73/0x250 mm/kasan/report.c:252
      kasan_report_error mm/kasan/report.c:351 [inline]
      kasan_report+0x25b/0x340 mm/kasan/report.c:409
      check_memory_region_inline mm/kasan/kasan.c:260 [inline]
      check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
      memcpy+0x37/0x50 mm/kasan/kasan.c:303
      memcpy include/linux/string.h:341 [inline]
      sha3_update+0xdf/0x2e0 crypto/sha3_generic.c:161
      crypto_shash_update+0xcb/0x220 crypto/shash.c:109
      shash_finup_unaligned+0x2a/0x60 crypto/shash.c:151
      crypto_shash_finup+0xc4/0x120 crypto/shash.c:165
      hmac_finup+0x182/0x330 crypto/hmac.c:152
      crypto_shash_finup+0xc4/0x120 crypto/shash.c:165
      shash_digest_unaligned+0x9e/0xd0 crypto/shash.c:172
      crypto_shash_digest+0xc4/0x120 crypto/shash.c:186
      hmac_setkey+0x36a/0x690 crypto/hmac.c:66
      crypto_shash_setkey+0xad/0x190 crypto/shash.c:64
      shash_async_setkey+0x47/0x60 crypto/shash.c:207
      crypto_ahash_setkey+0xaf/0x180 crypto/ahash.c:200
      hash_setkey+0x40/0x90 crypto/algif_hash.c:446
      alg_setkey crypto/af_alg.c:221 [inline]
      alg_setsockopt+0x2a1/0x350 crypto/af_alg.c:254
      SYSC_setsockopt net/socket.c:1851 [inline]
      SyS_setsockopt+0x189/0x360 net/socket.c:1830
      entry_SYSCALL_64_fastpath+0x1f/0x96

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
@cdeadmin
Copy link

------- Comment From seg@us.ibm.com 2018-02-15 00:39:50 EDT-------
OK the firmware is there. Please test.

@sathnaga
Copy link
Member Author

sathnaga commented Feb 19, 2018

Used latest FW and able to reproduce the issue

ENV:

FW: fips910/b0215b_1808.910

Host Kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le
lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    4
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Model:                 2.0 (pvr 004e 1200)
Model name:            POWER9 (raw), altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000
Hypervisor vendor:     (null)
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              512K
L3 cache:              10240K
NUMA node0 CPU(s):     0-63
NUMA node8 CPU(s):     64-127

# cat /proc/cmdline 
root=/dev/mapper/host_os_ltczzj2-root ro console=tty0 rd.lvm.lv=host_os_ltczzj2/root rd.lvm.lv=host_os_ltczzj2/swap LANG=en_US.UTF-8

Steps to reproduce: Just run any avocado guest tests.

python avocado-setup.py  --run-suite guest_cpu --guest-os HostOS.ppc64le --no-download
00:22:56 INFO    : Check for environment
00:22:56 INFO    : Creating temporary mux dir
00:23:06 INFO    : 
00:23:06 INFO    : Running Guest Tests Suite cpu
00:23:06 INFO    : Running: /usr/bin/avocado run --vt-type libvirt --vt-config /home/sath/avocado-fvt-wrapper/data/avocado-vt/backends/libvirt/cfg/cpu.cfg                 --force-job-id d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3  --vt-only-filter                                             "HostOS.ppc64le" --job-results-dir /home/sath/avocado-fvt-wrapper/results
JOB ID     : d40eec16791f17cf5a0b3f0de03d2ab04da7a2d3
JOB LOG    : /home/sath/avocado-fvt-wrapper/results/job-2018-02-19T00.23-d40eec1/job.log
 (001/853) guest_import.qemu.qcow2.virtio_scsi.smp2.virtio_net.Guest.HostOS.ppc64le.powerkvm-qemu.unattended_install.import.import.default_install.aio_native: -
Message from syslogd@ltczzj2 at Feb 19 00:24:14 ...
 kernel:Watchdog CPU:97 detected Hard LOCKUP other CPUS:38

dmesg:

[  186.241993] virbr0: port 2(vnet0) entered disabled state
[  200.167913] Watchdog CPU:97 detected Hard LOCKUP other CPUS:38
[  246.439590] INFO: rcu_sched self-detected stall on CPU
[  246.439710] 	74-....: (6000 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=2889 
[  246.439795] 	 (t=6000 jiffies g=2031 c=2030 q=30958)
[  246.439852] NMI backtrace for cpu 74
[  246.439891] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[  246.439976] Call Trace:
[  246.440007] [c000203835602ef0] [c000000000b0c19c] dump_stack+0xb0/0xf4 (unreliable)
[  246.440082] [c000203835602f30] [c000000000b153c4] nmi_cpu_backtrace+0x1a4/0x210
[  246.440156] [c000203835602fc0] [c000000000b1560c] nmi_trigger_cpumask_backtrace+0x1dc/0x220
[  246.440232] [c000203835603060] [c00000000002f448] arch_trigger_cpumask_backtrace+0x28/0x40
[  246.440307] [c000203835603080] [c000000000197a6c] rcu_dump_cpu_stacks+0xfc/0x160
[  246.440380] [c0002038356030d0] [c000000000196c90] rcu_check_callbacks+0x8e0/0xaf0
[  246.440454] [c000203835603200] [c0000000001a3874] update_process_times+0x44/0x90
[  246.440529] [c000203835603230] [c0000000001bac8c] tick_sched_handle.isra.13+0x4c/0x80
[  246.440602] [c000203835603250] [c0000000001bad20] tick_sched_timer+0x60/0xc0
[  246.440675] [c000203835603290] [c0000000001a44a8] __hrtimer_run_queues+0xf8/0x330
[  246.440749] [c000203835603310] [c0000000001a524c] hrtimer_interrupt+0xec/0x290
[  246.440823] [c0002038356033d0] [c000000000024438] __timer_interrupt+0x98/0x280
[  246.440897] [c000203835603420] [c000000000024b58] timer_interrupt+0xa8/0xe0
[  246.440959] [c000203835603450] [c0000000000092a8] decrementer_common+0x158/0x160
[  246.441034] --- interrupt: 901 at smp_call_function_many+0x388/0x420
    LR = smp_call_function_many+0x328/0x420
[  246.441145] [c0002038356037b0] [c000000000072328] serialize_against_pte_lookup+0x38/0x50
[  246.441219] [c0002038356037d0] [c0000000000744d0] radix__pmdp_huge_get_and_clear+0x60/0x80
[  246.441294] [c000203835603800] [c00000000034598c] zap_huge_pmd+0x8c/0x540
[  246.441356] [c0002038356038a0] [c0000000002df120] unmap_page_range+0xe80/0x10a0
[  246.441430] [c0002038356039e0] [c0000000002df834] unmap_vmas+0x84/0x100
[  246.441491] [c000203835603a30] [c0000000002edf1c] exit_mmap+0x9c/0x1c0
[  246.441553] [c000203835603af0] [c0000000000ffd28] mmput+0xb8/0x1f0
[  246.441614] [c000203835603b20] [c00000000010a1c8] do_exit+0x358/0xcd0
[  246.441676] [c000203835603be0] [c00000000010ac14] do_group_exit+0x64/0x100
[  246.441738] [c000203835603c20] [c00000000011aba0] get_signal+0x210/0x700
[  246.441799] [c000203835603d10] [c00000000001cb00] do_signal+0x80/0x2e0
[  246.441861] [c000203835603e00] [c00000000001cf14] do_notify_resume+0xd4/0x100
[  246.441935] [c000203835603e30] [c00000000000bf44] ret_from_except_lite+0x70/0x74
[  426.487634] INFO: rcu_sched self-detected stall on CPU
[  426.487722] 	74-....: (24004 ticks this GP) idle=6d2/140000000000001/0 softirq=786/786 fqs=11606 
[  426.487733] 	 (t=24004 jiffies g=2031 c=2030 q=119140)
[  426.487791] NMI backtrace for cpu 74
[  426.487830] CPU: 74 PID: 6855 Comm: worker Not tainted 4.15.0-5.dev.git33f711f.el7.centos.ppc64le #1
[  426.487915] Call Trace:

P:S: with powersave=off kernel commandline no issues seen.

@cdeadmin
Copy link

------- Comment From seg@us.ibm.com 2018-02-19 08:52:54 EDT-------
OPAL team, please have a look.

@cdeadmin
Copy link

==== State: Rejected by: dcrowell on 19 February 2018 13:18:55 ====

I have tried to read the pointed-at bug and I can't make any sense of it so I have no idea what the fix is that is being referred to. It looks like there have been several iterations of different problems described. Please summarize what the expected fix is and reopen. Or better yet, just dupe this directly.

@cdeadmin cdeadmin reopened this Feb 27, 2018
@cdeadmin
Copy link

------- Comment From satheera@in.ibm.com 2018-02-27 09:57:06 EDT-------
I am able to reproduce the similar issue with DD2.2 aswell with the latest FW with powersave enabled and with powersave disabled(i.e powersave=off kernel cmdline) this issue is not seen.

Host Env
$cupdcmd -f
...

Current Side Driver:.....fips910/b0223b_1810.910
Non-Current Side Driver:.fips910/b0215b_1808.910

Model: 2.2 (pvr 004e 1202)
Model name: POWER9 (raw), altivec supported

Host kernel: 4.15.0-5.dev.git33f711f.el7.centos.ppc64le

kernel cmdline: root=/dev/mapper/host_os_9-root ro console=tty0 rd.lvm.lv=host_os_9/root rd.lvm.lv=host_os_9/swap LANG=en_US.UTF-8

Steps to reproduce.

  1. Boot a kvm Guest and do a simple test(copy a file from host to guest , do an yum update)
    =======> Host suddenly freezes and reboots

FSP Event/error logs has this entry then,
5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?

Platform Event Log - 5302D258
Created at : 02/27/2018 14:17:03
Subsystem : CEC Hardware Subsystem
Event Severity : Unrecoverable Error
Action Flags : Report Externally
Action Status : Reported to Opr Sys

Primary System Reference Code

Reference Code : BB82C013
Hex Words 2 - 5 : 00000080 00000000 00000000 00000000
Hex Words 6 - 9 : 00000000 00000000 00000000 00000000

Log Hex Dump
00000000 50480030 01004F43 20180227 14170300 PH.0..OC ..'....
00000010 20180227 14170430 4B000106 00000000 ..'...0K.......
00000020 00000000 00000000 B0000002 5302D258 ............S..X
00000030 55480018 01004F43 50004000 00000000 UH....OCP.@.....
00000040 00002000 01004500 50530050 01004F43 .. ...E.PS.P..OC
00000050 02000008 00000048 00000080 00000000 .......H........
00000060 00000000 00000000 00000000 00000000 ................
00000070 00000000 00000000 42423832 43303133 ........BB82C013
00000080 20202020 20202020 20202020 20202020
00000090 20202020 20202020 4548004C 01004F43 EH.L..OC
000000A0 39303038 2D32324C 31333642 45444100 9008-22L136BEDA.
000000B0 00000000 00000000 00000000 00000000 ................
000000C0 00000000 00000000 00000000 00000000 ................
000000D0 00000000 00000000 20180227 14170300 ........ ..'....
000000E0 00000000 4D54001C 01004F43 39303038 ....MT....OC9008
000000F0 2D32324C 31333642 45444100 00000000 -22L136BEDA.....
00000100 5544004D 01004F43 44455343 00450000 UD.M..OCDESC.E..
00000110 4F43433A 20496E69 7469616C 697A6174 OCC: Initializat
00000120 696F6E20 6F6E2061 6C6C2063 68697073 ion on all chips
00000130 20646964 206E6F74 20636F6D 706C6574 did not complet
00000140 65287469 6D656420 6F757429 0A e(timed out).

Regards,
-Satheesh

@cdeadmin
Copy link

------- Comment From satheera@in.ibm.com 2018-02-27 10:08:25 EDT-------
I just now noticed the issue after a few test runs inside host even with powersave disabled aswell.
During the issue occurred, host was idle though, not running any tests/load.

FSP log event/errors histroy:
Log ID Time Failing subsystem Severity SRC Reported
5302D602 2018-02-27 15:17:40 CEC Hardware Subsystem Unrecoverable Error BB82C013 ? ------------------------------------------------------------------------------------------------------------------->powersave=off
5002D4A0 2018-02-27 15:11:42 Service Processor Firmware Unrecoverable Error B1813450 ?
5002D295 2018-02-27 14:21:32 OCC Unrecoverable Error, Degraded Performance B18B2616 ?
5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?
5302D218 2018-02-27 14:12:48 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?
5002D0A4 2018-02-27 14:07:06 Service Processor Firmware Unrecoverable Error B1813450 ?
5002CE92 2018-02-27 13:16:56 OCC Unrecoverable Error, Degraded Performance B18B2616 ?

Regards,
-Satheesh

@cdeadmin
Copy link

------- Comment From npiggin@au1.ibm.com 2018-02-27 19:27:03 EDT-------
Are you getting the same kernel logs with powersave=off? Or is this a new bug?

If the kernel messages include "smp_call_function_many", then we need to get traces from all CPUs. echo 1 &gt; /proc/sys/kernel/hardlockup_panic should crash the system when you get a lockup and give traces.

@cdeadmin
Copy link

------- Comment From satheera@in.ibm.com 2018-02-28 06:11:34 EDT-------
(In reply to comment #16)
> Are you getting the same kernel logs with powersave=off? Or is this a new
> bug?
>
> If the kernel messages include "smp_call_function_many", then we need to get
> traces from all CPUs. echo 1 &gt; /proc/sys/kernel/hardlockup_panic should
> crash the system when you get a lockup and give traces.

Looks like this is different issue, am not completely sure though, host does not have any traces at all, it dies suddenly and FSP error log has the below issue reported, looked to checkstop?? , I just wanted reuse the FW bug which got mirrored already,

FSP Event/error logs has this entry then,
5302D258 2018-02-27 14:17:04 CEC Hardware Subsystem Unrecoverable Error BB82C013 ?

Platform Event Log - 5302D258
Created at : 02/27/2018 14:17:03
Subsystem : CEC Hardware Subsystem
Event Severity : Unrecoverable Error
Action Flags : Report Externally
Action Status : Reported to Opr Sys

Regards,
-Satheesh.

@cdeadmin
Copy link

cdeadmin commented Mar 6, 2018

------- Comment From bssrikanth@in.ibm.com 2018-03-06 02:27:12 EDT-------
This bug is gating us from using the ZZ small core DD 2.2 for KVM testing. Please help resolve asap.

@cdeadmin
Copy link

cdeadmin commented Mar 8, 2018

------- Comment From satheera@in.ibm.com 2018-03-08 03:54:01 EDT-------
Am able to reproduce the same(In reply to comment #16)
> Are you getting the same kernel logs with powersave=off? Or is this a new
> bug?
>
> If the kernel messages include "smp_call_function_many", then we need to get
> traces from all CPUs. echo 1 &gt; /proc/sys/kernel/hardlockup_panic should
> crash the system when you get a lockup and give traces.

yes, looks like new bug, I will close this issue and open a new one for the issue, even the latest FW reproduces it.

Regards,
-Satheesh

------- Comment From satheera@in.ibm.com 2018-03-08 03:55:46 EDT-------
Initial issue reported in the bz is not seen.

@cdeadmin cdeadmin closed this as completed Mar 8, 2018
liyi-ibm pushed a commit to liyi-ibm/linux that referenced this issue Dec 6, 2018
Chen Yu reported a divide-by-zero error when accessing the 'size'
resctrl file when a MBA resource is enabled.

divide error: 0000 [#1] SMP PTI
CPU: 93 PID: 1929 Comm: cat Not tainted 4.19.0-rc2-debug-rdt+ open-power-host-os#25
RIP: 0010:rdtgroup_cbm_to_size+0x7e/0xa0
Call Trace:
rdtgroup_size_show+0x11a/0x1d0
seq_read+0xd8/0x3b0

Quoting Chen Yu's report: This is because for MB resource, the
r->cache.cbm_len is zero, thus calculating size in rdtgroup_cbm_to_size()
will trigger the exception.

Fix this issue in the 'size' file by getting correct memory bandwidth value
which is in MBps when MBA software controller is enabled or in percentage
when MBA software controller is disabled.

Fixes: d9b48c8 ("x86/intel_rdt: Display resource groups' allocations in bytes")
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen Yu <yu.c.chen@intel.com>
Cc: "H Peter Anvin" <hpa@zytor.com>
Cc: "Tony Luck" <tony.luck@intel.com>
Cc: "Xiaochen Shen" <xiaochen.shen@intel.com>
Link: https://lkml.kernel.org/r/20180904174614.26682-1-yu.c.chen@intel.com
Link: https://lkml.kernel.org/r/1537048707-76280-3-git-send-email-fenghua.yu@intel.com
liyi-ibm pushed a commit to liyi-ibm/linux that referenced this issue Dec 6, 2018
Add missing prepare/unprepare operations for fbi->clk,
this fixes following kernel warning:

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 1 at drivers/clk/clk.c:874 clk_core_enable+0x2c/0x1b0
  Enabling unprepared disp0_clk
  Modules linked in:
  CPU: 0 PID: 1 Comm: swapper Not tainted 4.18.0-rc8-00032-g02b43ddd4f21-dirty open-power-host-os#25
  Hardware name: Marvell MMP2 (Device Tree Support)
  [<c010f7cc>] (unwind_backtrace) from [<c010cc6c>] (show_stack+0x10/0x14)
  [<c010cc6c>] (show_stack) from [<c011dab4>] (__warn+0xd8/0xf0)
  [<c011dab4>] (__warn) from [<c011db10>] (warn_slowpath_fmt+0x44/0x6c)
  [<c011db10>] (warn_slowpath_fmt) from [<c043898c>] (clk_core_enable+0x2c/0x1b0)
  [<c043898c>] (clk_core_enable) from [<c0439ec8>] (clk_core_enable_lock+0x18/0x2c)
  [<c0439ec8>] (clk_core_enable_lock) from [<c0436698>] (pxa168fb_probe+0x464/0x6ac)
  [<c0436698>] (pxa168fb_probe) from [<c04779a0>] (platform_drv_probe+0x48/0x94)
  [<c04779a0>] (platform_drv_probe) from [<c0475bec>] (driver_probe_device+0x328/0x470)
  [<c0475bec>] (driver_probe_device) from [<c0475de4>] (__driver_attach+0xb0/0x124)
  [<c0475de4>] (__driver_attach) from [<c0473c38>] (bus_for_each_dev+0x64/0xa0)
  [<c0473c38>] (bus_for_each_dev) from [<c0474ee0>] (bus_add_driver+0x1b8/0x230)
  [<c0474ee0>] (bus_add_driver) from [<c0476a20>] (driver_register+0xac/0xf0)
  [<c0476a20>] (driver_register) from [<c0102dd4>] (do_one_initcall+0xb8/0x1f0)
  [<c0102dd4>] (do_one_initcall) from [<c0b010a0>] (kernel_init_freeable+0x294/0x2e0)
  [<c0b010a0>] (kernel_init_freeable) from [<c07e9eb8>] (kernel_init+0x8/0x10c)
  [<c07e9eb8>] (kernel_init) from [<c01010e8>] (ret_from_fork+0x14/0x2c)
  Exception stack(0xd008bfb0 to 0xd008bff8)
  bfa0:                                     00000000 00000000 00000000 00000000
  bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
  bfe0: 00000000 00000000 00000000 00000000 00000013 00000000
  ---[ end trace c0af40f9e2ed7cb4 ]---

Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
[b.zolnierkie: enhance patch description a bit]
Signed-off-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
liyi-ibm pushed a commit to liyi-ibm/linux that referenced this issue Dec 28, 2018
commit d50d82f upstream.

In kernel 4.17 I removed some code from dm-bufio that did slab cache
merging (commit 21bb132: "dm bufio: remove code that merges slab
caches") - both slab and slub support merging caches with identical
attributes, so dm-bufio now just calls kmem_cache_create and relies on
implicit merging.

This uncovered a bug in the slub subsystem - if we delete a cache and
immediatelly create another cache with the same attributes, it fails
because of duplicate filename in /sys/kernel/slab/.  The slub subsystem
offloads freeing the cache to a workqueue - and if we create the new
cache before the workqueue runs, it complains because of duplicate
filename in sysfs.

This patch fixes the bug by moving the call of kobject_del from
sysfs_slab_remove_workfn to shutdown_cache.  kobject_del must be called
while we hold slab_mutex - so that the sysfs entry is deleted before a
cache with the same attributes could be created.

Running device-mapper-test-suite with:

  dmtest run --suite thin-provisioning -n /commit_failure_causes_fallback/

triggered:

  Buffer I/O error on dev dm-0, logical block 1572848, async page read
  device-mapper: thin: 253:1: metadata operation 'dm_pool_alloc_data_block' failed: error = -5
  device-mapper: thin: 253:1: aborting current metadata transaction
  sysfs: cannot create duplicate filename '/kernel/slab/:a-0000144'
  CPU: 2 PID: 1037 Comm: kworker/u48:1 Not tainted 4.17.0.snitm+ open-power-host-os#25
  Hardware name: Supermicro SYS-1029P-WTR/X11DDW-L, BIOS 2.0a 12/06/2017
  Workqueue: dm-thin do_worker [dm_thin_pool]
  Call Trace:
   dump_stack+0x5a/0x73
   sysfs_warn_dup+0x58/0x70
   sysfs_create_dir_ns+0x77/0x80
   kobject_add_internal+0xba/0x2e0
   kobject_init_and_add+0x70/0xb0
   sysfs_slab_add+0xb1/0x250
   __kmem_cache_create+0x116/0x150
   create_cache+0xd9/0x1f0
   kmem_cache_create_usercopy+0x1c1/0x250
   kmem_cache_create+0x18/0x20
   dm_bufio_client_create+0x1ae/0x410 [dm_bufio]
   dm_block_manager_create+0x5e/0x90 [dm_persistent_data]
   __create_persistent_data_objects+0x38/0x940 [dm_thin_pool]
   dm_pool_abort_metadata+0x64/0x90 [dm_thin_pool]
   metadata_operation_failed+0x59/0x100 [dm_thin_pool]
   alloc_data_block.isra.53+0x86/0x180 [dm_thin_pool]
   process_cell+0x2a3/0x550 [dm_thin_pool]
   do_worker+0x28d/0x8f0 [dm_thin_pool]
   process_one_work+0x171/0x370
   worker_thread+0x49/0x3f0
   kthread+0xf8/0x130
   ret_from_fork+0x35/0x40
  kobject_add_internal failed for :a-0000144 with -EEXIST, don't try to register things with the same name in the same directory.
  kmem_cache_create(dm_bufio_buffer-16) failed with error -17

Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1806151817130.6333@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reported-by: Mike Snitzer <snitzer@redhat.com>
Tested-by: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants