Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Power8: Hit with "Oops: Kernel access of bad area, sig: 11" on latest nightly #14

Closed
sathnaga opened this issue Aug 31, 2017 · 2 comments

Comments

@sathnaga
Copy link
Member

sathnaga commented Aug 31, 2017

Kernel Version: 4.13.0-3.rc3.dev.gitec0d270.el7.centos.ppc64le
Hit few mins after a fresh boot, tried to run avocado tests(just started).
Most of(sosreport, service restart, etc) command gets stuck after the crash.

[  909.585268] list_del corruption. prev->next should be c000000f23120760, but was c000000f23121760
[  909.585448] ------------[ cut here ]------------
[  909.585547] WARNING: CPU: 64 PID: 14123 at lib/list_debug.c:53 __list_del_entry_valid+0xd0/0x100
[  909.585705] Modules linked in: vhost_net vhost tap act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal i2c_core powernv_op_panel ipmi_powernv ipmi_devintf ipmi_msghandler nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm xfs libcrc32c tg3 ptp pps_core
[  909.586812] CPU: 64 PID: 14123 Comm: qemu-system-ppc Not tainted 4.13.0-3.rc3.dev.gitec0d270.el7.centos.ppc64le #1
[  909.586963] task: c000000f0c9cc600 task.stack: c000000f061a8000
[  909.587026] NIP: c0000000005a0770 LR: c0000000005a076c CTR: 00000000300304d0
[  909.587100] REGS: c000000f061ab6c0 TRAP: 0700   Not tainted  (4.13.0-3.rc3.dev.gitec0d270.el7.centos.ppc64le)
[  909.587197] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>
[  909.587205]   CR: 42024422  XER: 20000000
[  909.587291] CFAR: c00000000016e9c8 SOFTE: 1 
[  909.587291] GPR00: c0000000005a076c c000000f061ab940 c000000001397a00 0000000000000054 
[  909.587291] GPR04: 0000000000000000 c000000000098244 9000000000009033 0000000000000000 
[  909.587291] GPR08: 0000000000000001 0000000000000007 0000000000000006 9000000000001003 
[  909.587291] GPR12: 0000000000004400 c00000000fda8000 0000000000000000 0000000000000000 
[  909.587291] GPR16: 0000000000000000 0000000124cb8058 0000000124cb8038 00000001250ed8b8 
[  909.587291] GPR20: 00000001250ed8b0 00000001250ed8d0 c00000000138d820 c000000000d9c238 
[  909.587291] GPR24: 0000000000000001 5deadbeef0000100 c000000f061abb80 c000000000f24840 
[  909.587291] GPR28: c0000000013cbe50 0000000000000001 c000000f231215e0 c000000f23120750 
[  909.587927] NIP [c0000000005a0770] __list_del_entry_valid+0xd0/0x100
[  909.587990] LR [c0000000005a076c] __list_del_entry_valid+0xcc/0x100
[  909.588052] Call Trace:
[  909.588079] [c000000f061ab940] [c0000000005a076c] __list_del_entry_valid+0xcc/0x100 (unreliable)
[  909.588167] [c000000f061ab9a0] [c000000000988bbc] tcf_chain_destroy+0x2c/0xa0
[  909.588243] [c000000f061ab9d0] [c000000000988c84] tcf_block_put+0x54/0x90
[  909.588308] [c000000f061aba00] [d000000014d3178c] htb_destroy_class.isra.11+0x5c/0x80 [sch_htb]
[  909.588401] [c000000f061aba30] [d000000014d318a8] htb_destroy+0xf8/0x1b0 [sch_htb]
[  909.588476] [c000000f061abab0] [c0000000009818a4] qdisc_destroy+0xe4/0x170
[  909.588539] [c000000f061abae0] [c00000000098332c] dev_shutdown+0xbc/0x100
[  909.588604] [c000000f061abb20] [c00000000093f248] rollback_registered_many+0x2f8/0x560
[  909.588679] [c000000f061abbf0] [c00000000093f520] rollback_registered+0x70/0xb0
[  909.588755] [c000000f061abc40] [c000000000941908] unregister_netdevice_queue+0x128/0x180
[  909.588832] [c000000f061abcc0] [c00000000077a6cc] __tun_detach+0x22c/0x460
[  909.588895] [c000000f061abd20] [c00000000077a938] tun_chr_close+0x38/0x60
[  909.588959] [c000000f061abd50] [c00000000035abf8] __fput+0xd8/0x280
[  909.589024] [c000000f061abdb0] [c000000000120f20] task_work_run+0x140/0x1a0
[  909.589089] [c000000f061abe00] [c00000000001d810] do_notify_resume+0xf0/0x100
[  909.589164] [c000000f061abe30] [c00000000000bf44] ret_from_except_lite+0x70/0x74
[  909.589238] Instruction dump:
[  909.589295] 4bffffd4 3c62ff9b 3863f6d0 4bbce235 60000000 0fe00000 38600000 4bffffb8 
[  909.589435] 3c62ff9b 3863f690 4bbce219 60000000 <0fe00000> 38600000 4bffff9c 3c62ff9b 
[  909.589577] ---[ end trace c2b424e83e247e4b ]---
[  909.589685] Unable to handle kernel paging request for data at address 0x00000000
[  909.589823] Faulting instruction address: 0xc000000000988b48
[  909.589939] Oops: Kernel access of bad area, sig: 11 [#1]
[  909.590030] SMP NR_CPUS=1024 
[  909.590030] NUMA 
[  909.590101] PowerNV
[  909.590197] Modules linked in: vhost_net vhost tap act_police cls_u32 sch_ingress cls_fw sch_sfq sch_htb xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables ses enclosure scsi_transport_sas i2c_opal i2c_core powernv_op_panel ipmi_powernv ipmi_devintf ipmi_msghandler nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc kvm_hv kvm_pr kvm xfs libcrc32c tg3 ptp pps_core
[  909.591279] CPU: 64 PID: 14123 Comm: qemu-system-ppc Tainted: G        W       4.13.0-3.rc3.dev.gitec0d270.el7.centos.ppc64le #1
[  909.591481] task: c000000f0c9cc600 task.stack: c000000f061a8000
[  909.591596] NIP: c000000000988b48 LR: c000000000988c04 CTR: 00000000300304d0
[  909.591733] REGS: c000000f061ab6f0 TRAP: 0300   Tainted: G        W        (4.13.0-3.rc3.dev.gitec0d270.el7.centos.ppc64le)
[  909.591913] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[  909.591919]   CR: 42024422  XER: 20000000
[  909.592080] CFAR: c0000000000087d8 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1 
[  909.592080] GPR00: c000000000988c04 c000000f061ab970 c000000001397a00 c000000f23120750 
[  909.592080] GPR04: 0000000000000000 c000000000098244 9000000000009033 0000000000000000 
[  909.592080] GPR08: 0000000000000001 0000000000000000 5deadbeef0000100 9000000000001003 
[  909.592080] GPR12: 0000000000004400 c00000000fda8000 0000000000000000 0000000000000000 
[  909.592080] GPR16: 0000000000000000 0000000124cb8058 0000000124cb8038 00000001250ed8b8 
[  909.592080] GPR20: 00000001250ed8b0 00000001250ed8d0 c00000000138d820 c000000000d9c238 
[  909.592080] GPR24: 0000000000000001 5deadbeef0000100 c000000f061abb80 c000000000f24840 
[  909.592080] GPR28: c0000000013cbe50 0000000000000001 c000000f231215e0 c000000f23120750 
[  909.593263] NIP [c000000000988b48] tcf_chain_flush+0x28/0x70
[  909.593377] LR [c000000000988c04] tcf_chain_destroy+0x74/0xa0
[  909.593491] Call Trace:
[  909.593540] [c000000f061ab970] [0000000000000001] 0x1 (unreliable)
[  909.593654] [c000000f061ab9a0] [c000000000988c04] tcf_chain_destroy+0x74/0xa0
[  909.593783] [c000000f061ab9d0] [c000000000988c84] tcf_block_put+0x54/0x90
[  909.593847] [c000000f061aba00] [d000000014d3178c] htb_destroy_class.isra.11+0x5c/0x80 [sch_htb]
[  909.593935] [c000000f061aba30] [d000000014d318a8] htb_destroy+0xf8/0x1b0 [sch_htb]
[  909.594013] [c000000f061abab0] [c0000000009818a4] qdisc_destroy+0xe4/0x170
[  909.594076] [c000000f061abae0] [c00000000098332c] dev_shutdown+0xbc/0x100
[  909.594140] [c000000f061abb20] [c00000000093f248] rollback_registered_many+0x2f8/0x560
[  909.594217] [c000000f061abbf0] [c00000000093f520] rollback_registered+0x70/0xb0
[  909.594292] [c000000f061abc40] [c000000000941908] unregister_netdevice_queue+0x128/0x180
[  909.594369] [c000000f061abcc0] [c00000000077a6cc] __tun_detach+0x22c/0x460
[  909.594433] [c000000f061abd20] [c00000000077a938] tun_chr_close+0x38/0x60
[  909.594496] [c000000f061abd50] [c00000000035abf8] __fput+0xd8/0x280
[  909.594563] [c000000f061abdb0] [c000000000120f20] task_work_run+0x140/0x1a0
[  909.594628] [c000000f061abe00] [c00000000001d810] do_notify_resume+0xf0/0x100
[  909.594704] [c000000f061abe30] [c00000000000bf44] ret_from_except_lite+0x70/0x74
[  909.594778] Instruction dump:
[  909.594816] 7c0803a6 4e800020 3c4c00a1 3842eee0 7c0802a6 60000000 7c0802a6 fbe1fff8 
[  909.594895] f8010010 f821ffd1 7c7f1b78 e9230008 <e9490000> 2faa0000 419e001c 39400000 
[  909.594975] ---[ end trace c2b424e83e247e4c ]---
[  909.601138] 

cde:info Mirrored with LTC bug #158177 </cde:info>

@cdeadmin
Copy link

------- Comment From viparash@in.ibm.com 2017-08-31 09:26:40 EDT-------
(In reply to comment #1)

I see two issues here

Issue 1

>
Subsequently it crashes further in tcf_chain_flush() due to hitting to segmentation fault.

@cdeadmin
Copy link

------- Comment From satheera@in.ibm.com 2017-09-26 06:44:04 EDT-------
Am not hitting an issue with latest nightly devel 4.13.0-4.dev.git49564cb.el7.centos.ppc64le

------- Comment From satheera@in.ibm.com 2017-09-26 06:45:15 EDT-------
Closing as per previous comment

malcolmcrossley pushed a commit to malcolmcrossley/linux that referenced this issue Jan 24, 2018
... before the first use of kaiser_enabled as otherwise funky
things happen:

  about to get started...
  (XEN) d0v0 Unhandled page fault fault/trap [open-power-host-os#14, ec=0000]
  (XEN) Pagetable walk from ffff88022a449090:
  (XEN)  L4[0x110] = 0000000229e0e067 0000000000001e0e
  (XEN)  L3[0x008] = 0000000000000000 ffffffffffffffff
  (XEN) domain_crash_sync called from entry.S: fault at ffff82d08033fd08
  entry.o#create_bounce_frame+0x135/0x14d
  (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
  (XEN) ----[ Xen-4.9.1_02-3.21  x86_64  debug=n   Not tainted ]----
  (XEN) CPU:    0
  (XEN) RIP:    e033:[<ffffffff81007460>]
  (XEN) RFLAGS: 0000000000000286   EM: 1   CONTEXT: pv guest (d0v0)

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paulusmack pushed a commit that referenced this issue Feb 14, 2018
for_each_set_bit() only accepts variable of type unsigned long, and we can
not cast it from smaller types.

[   16.499365] ==================================================================
[   16.506655] BUG: KASAN: stack-out-of-bounds in find_first_bit+0x1d/0x70
[   16.513313] Read of size 8 at addr ffff8803616cf510 by task systemd-udevd/180
[   16.521998] CPU: 0 PID: 180 Comm: systemd-udevd Tainted: G     U     O     4.15.0-rc3+ #14
[   16.530317] Hardware name: Dell Inc. OptiPlex 7040/0Y7WYT, BIOS 1.2.8 01/26/2016
[   16.537760] Call Trace:
[   16.540230]  dump_stack+0x7c/0xbb
[   16.543569]  print_address_description+0x6b/0x290
[   16.548306]  kasan_report+0x28a/0x370
[   16.551993]  ? find_first_bit+0x1d/0x70
[   16.555858]  find_first_bit+0x1d/0x70
[   16.559625]  intel_gvt_init_cmd_parser+0x127/0x3c0 [i915]
[   16.565060]  ? __lock_is_held+0x8f/0xf0
[   16.568990]  ? intel_gvt_clean_cmd_parser+0x10/0x10 [i915]
[   16.574514]  ? __hrtimer_init+0x5d/0xb0
[   16.578445]  intel_gvt_init_device+0x2c3/0x690 [i915]
[   16.583537]  ? unregister_module_notifier+0x20/0x20
[   16.588515]  intel_gvt_init+0x89/0x100 [i915]
[   16.592962]  i915_driver_load+0x1992/0x1c70 [i915]
[   16.597846]  ? __i915_printk+0x210/0x210 [i915]
[   16.602410]  ? wait_for_completion+0x280/0x280
[   16.606883]  ? lock_downgrade+0x2c0/0x2c0
[   16.610923]  ? __pm_runtime_resume+0x46/0x90
[   16.615238]  ? acpi_dev_found+0x76/0x80
[   16.619162]  ? i915_pci_remove+0x30/0x30 [i915]
[   16.623733]  local_pci_probe+0x74/0xe0
[   16.627518]  pci_device_probe+0x208/0x310
[   16.631561]  ? pci_device_remove+0x100/0x100
[   16.635871]  ? __list_add_valid+0x29/0xa0
[   16.639919]  driver_probe_device+0x40b/0x6b0
[   16.644223]  ? driver_probe_device+0x6b0/0x6b0
[   16.648696]  __driver_attach+0x11d/0x130
[   16.652649]  bus_for_each_dev+0xe7/0x160
[   16.656600]  ? subsys_dev_iter_exit+0x10/0x10
[   16.660987]  ? __list_add_valid+0x29/0xa0
[   16.665028]  bus_add_driver+0x31d/0x3a0
[   16.668893]  driver_register+0xc6/0x170
[   16.672758]  ? 0xffffffffc0ad8000
[   16.676108]  do_one_initcall+0x9c/0x206
[   16.679984]  ? initcall_blacklisted+0x150/0x150
[   16.684545]  ? do_init_module+0x35/0x33b
[   16.688494]  ? kasan_unpoison_shadow+0x31/0x40
[   16.692968]  ? kasan_kmalloc+0xa6/0xd0
[   16.696743]  ? do_init_module+0x35/0x33b
[   16.700694]  ? kasan_unpoison_shadow+0x31/0x40
[   16.705168]  ? __asan_register_globals+0x82/0xa0
[   16.709819]  do_init_module+0xe7/0x33b
[   16.713597]  load_module+0x4481/0x4ce0
[   16.717397]  ? module_frob_arch_sections+0x20/0x20
[   16.722228]  ? vfs_read+0x13b/0x190
[   16.725742]  ? kernel_read+0x74/0xa0
[   16.729351]  ? get_user_arg_ptr.isra.17+0x70/0x70
[   16.734099]  ? SYSC_finit_module+0x175/0x1b0
[   16.738399]  SYSC_finit_module+0x175/0x1b0
[   16.742524]  ? SYSC_init_module+0x1e0/0x1e0
[   16.746741]  ? __fget+0x157/0x240
[   16.750090]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[   16.754747]  entry_SYSCALL_64_fastpath+0x23/0x9a
[   16.759397] RIP: 0033:0x7f8fbc837499
[   16.762996] RSP: 002b:00007ffead76c138 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   16.770618] RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007f8fbc837499
[   16.777800] RDX: 0000000000000000 RSI: 000056484e67b080 RDI: 0000000000000012
[   16.784979] RBP: 00007ffead76b140 R08: 0000000000000000 R09: 0000000000000021
[   16.792164] R10: 0000000000000012 R11: 0000000000000246 R12: 000056484e67b460
[   16.799345] R13: 00007ffead76b120 R14: 0000000000000005 R15: 0000000000000000
[   16.808052] The buggy address belongs to the page:
[   16.812876] page:00000000dc4b8c1e count:0 mapcount:0 mapping:          (null) index:0x0
[   16.820934] flags: 0x17ffffc0000000()
[   16.824621] raw: 0017ffffc0000000 0000000000000000 0000000000000000 00000000ffffffff
[   16.832416] raw: ffffea000d85b3e0 ffffea000d85b3e0 0000000000000000 0000000000000000
[   16.840208] page dumped because: kasan: bad access detected
[   16.847318] Memory state around the buggy address:
[   16.852143]  ffff8803616cf400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   16.859427]  ffff8803616cf480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
[   16.866708] >ffff8803616cf500: f1 f1 04 f4 f4 f4 f3 f3 f3 f3 00 00 00 00 00 00
[   16.873988]                          ^
[   16.877770]  ffff8803616cf580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[   16.885042]  ffff8803616cf600: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
[   16.892312] ==================================================================

Signed-off-by: Changbin Du <changbin.du@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
liyi-ibm referenced this issue in liyi-ibm/linux Dec 6, 2018
When booting kernel with LOCKDEP option, below warning info was found:

WARNING: possible recursive locking detected
4.19.0-rc7+ #14 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
include/linux/spinlock.h:334 [inline]
00000000dcfc0fc8 (&(&list->lock)->rlock#4){+...}, at:
tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850

but task is already holding lock:
00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at: spin_lock_bh
include/linux/spinlock.h:334 [inline]
00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&(&list->lock)->rlock#4);
  lock(&(&list->lock)->rlock#4);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

2 locks held by swapper/0/1:
 #0: 00000000f7539d34 (pernet_ops_rwsem){+.+.}, at:
register_pernet_subsys+0x19/0x40 net/core/net_namespace.c:1051
 #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
spin_lock_bh include/linux/spinlock.h:334 [inline]
 #1: 00000000cbb9b036 (&(&list->lock)->rlock#4){+...}, at:
tipc_link_reset+0xfa/0xdf0 net/tipc/link.c:849

stack backtrace:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7+ #14
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1af/0x295 lib/dump_stack.c:113
 print_deadlock_bug kernel/locking/lockdep.c:1759 [inline]
 check_deadlock kernel/locking/lockdep.c:1803 [inline]
 validate_chain kernel/locking/lockdep.c:2399 [inline]
 __lock_acquire+0xf1e/0x3c60 kernel/locking/lockdep.c:3411
 lock_acquire+0x1db/0x520 kernel/locking/lockdep.c:3900
 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline]
 _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:168
 spin_lock_bh include/linux/spinlock.h:334 [inline]
 tipc_link_reset+0x125/0xdf0 net/tipc/link.c:850
 tipc_link_bc_create+0xb5/0x1f0 net/tipc/link.c:526
 tipc_bcast_init+0x59b/0xab0 net/tipc/bcast.c:521
 tipc_init_net+0x472/0x610 net/tipc/core.c:82
 ops_init+0xf7/0x520 net/core/net_namespace.c:129
 __register_pernet_operations net/core/net_namespace.c:940 [inline]
 register_pernet_operations+0x453/0xac0 net/core/net_namespace.c:1011
 register_pernet_subsys+0x28/0x40 net/core/net_namespace.c:1052
 tipc_init+0x83/0x104 net/tipc/core.c:140
 do_one_initcall+0x109/0x70a init/main.c:885
 do_initcall_level init/main.c:953 [inline]
 do_initcalls init/main.c:961 [inline]
 do_basic_setup init/main.c:979 [inline]
 kernel_init_freeable+0x4bd/0x57f init/main.c:1144
 kernel_init+0x13/0x180 init/main.c:1063
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:413

The reason why the noise above was complained by LOCKDEP is because we
nested to hold l->wakeupq.lock and l->inputq->lock in tipc_link_reset
function. In fact it's unnecessary to move skb buffer from l->wakeupq
queue to l->inputq queue while holding the two locks at the same time.
Instead, we can move skb buffers in l->wakeupq queue to a temporary
list first and then move the buffers of the temporary list to l->inputq
queue, which is also safe for us.

Fixes: 3f32d0b ("tipc: lock wakeup & inputq at tipc_link_reset()")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants