Skip to content

[CI run] selftests/bpf: Don't assign outer source IP to host #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

pchaigno
Copy link
Member

No description provided.

Florian Westphal and others added 11 commits July 7, 2022 20:55
Kajetan Puchalski reports crash on ARM, with backtrace of:

__nf_ct_delete_from_lists
nf_ct_delete
early_drop
__nf_conntrack_alloc

Unlike atomic_inc_not_zero, refcount_inc_not_zero is not a full barrier.
conntrack uses SLAB_TYPESAFE_BY_RCU, i.e. it is possible that a 'newly'
allocated object is still in use on another CPU:

CPU1						CPU2
						encounter 'ct' during hlist walk
 delete_from_lists
 refcount drops to 0
 kmem_cache_free(ct);
 __nf_conntrack_alloc() // returns same object
						refcount_inc_not_zero(ct); /* might fail */

						/* If set, ct is public/in the hash table */
						test_bit(IPS_CONFIRMED_BIT, &ct->status);

In case CPU1 already set refcount back to 1, refcount_inc_not_zero()
will succeed.

The expected possibilities for a CPU that obtained the object 'ct'
(but no reference so far) are:

1. refcount_inc_not_zero() fails.  CPU2 ignores the object and moves to
   the next entry in the list.  This happens for objects that are about
   to be free'd, that have been free'd, or that have been reallocated
   by __nf_conntrack_alloc(), but where the refcount has not been
   increased back to 1 yet.

2. refcount_inc_not_zero() succeeds. CPU2 checks the CONFIRMED bit
   in ct->status.  If set, the object is public/in the table.

   If not, the object must be skipped; CPU2 calls nf_ct_put() to
   un-do the refcount increment and moves to the next object.

Parallel deletion from the hlists is prevented by a
'test_and_set_bit(IPS_DYING_BIT, &ct->status);' check, i.e. only one
cpu will do the unlink, the other one will only drop its reference count.

Because refcount_inc_not_zero is not a full barrier, CPU2 may try to
delete an object that is not on any list:

1. refcount_inc_not_zero() successful (refcount inited to 1 on other CPU)
2. CONFIRMED test also successful (load was reordered or zeroing
   of ct->status not yet visible)
3. delete_from_lists unlinks entry not on the hlist, because
   IPS_DYING_BIT is 0 (already cleared).

2) is already wrong: CPU2 will handle a partially initited object
that is supposed to be private to CPU1.

Add needed barriers when refcount_inc_not_zero() is successful.

It also inserts a smp_wmb() before the refcount is set to 1 during
allocation.

Because other CPU might still see the object, refcount_set(1)
"resurrects" it, so we need to make sure that other CPUs will also observe
the right content.  In particular, the CONFIRMED bit test must only pass
once the object is fully initialised and either in the hash or about to be
inserted (with locks held to delay possible unlink from early_drop or
gc worker).

I did not change flow_offload_alloc(), as far as I can see it should call
refcount_inc(), not refcount_inc_not_zero(): the ct object is attached to
the skb so its refcount should be >= 1 in all cases.

v2: prefer smp_acquire__after_ctrl_dep to smp_rmb (Will Deacon).
v3: keep smp_acquire__after_ctrl_dep close to refcount_inc_not_zero call
    add comment in nf_conntrack_netlink, no control dependency there
    due to locks.

Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/all/Yr7WTfd6AVTQkLjI@e126311.manchester.arm.com/
Reported-by: Kajetan Puchalski <kajetan.puchalski@arm.com>
Diagnosed-by: Will Deacon <will@kernel.org>
Fixes: 7197743 ("netfilter: conntrack: convert to refcount_t api")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Will Deacon <will@kernel.org>
NFPROTO_ARP is expecting to find the ARP header at the network offset.

In the particular case of ARP, HTYPE= field shows the initial bytes of
the ethernet header destination MAC address.

 netdev out: IN= OUT=bridge0 MACSRC=c2:76:e5:71:e1:de MACDST=36:b0:4a:e2:72:ea MACPROTO=0806 ARP HTYPE=14000 PTYPE=0x4ae2 OPCODE=49782

NFPROTO_NETDEV egress hook is also expecting to find the IP headers at
the network offset.

Fixes: 35b9395 ("netfilter: add generic ARP packet logger")
Reported-by: Tom Yan <tom.ty89@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
TCP packets will be dropped if the segments number in the tx skb
exceeds limitation when sending iperf3 traffic with --zerocopy option.

we make the following changes:

Get nr_frags in nfp_nfdk_tx_maybe_close_block instead of passing from
outside because it will be changed after skb_linearize operation.

Fill maximum dma_len in first tx descriptor to make sure the whole
head is included in the first descriptor.

Fixes: c10d12e ("nfp: add support for NFDK data path")
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com>
Reviewed-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Blamed commit added back a bug I fixed in commit 9bbd917
("vlan: fix memory leak in vlan_dev_set_egress_priority")

If a memory allocation fails in vlan_changelink() after other allocations
succeeded, we need to call vlan_dev_free_egress_priority()
to free all allocated memory because after a failed ->newlink()
we do not call any methods like ndo_uninit() or dev->priv_destructor().

In following example, if the allocation for last element 2000:2001 fails,
we need to free eight prior allocations:

ip link add link dummy0 dummy0.100 type vlan id 100 \
	egress-qos-map 1:2 2:3 3:4 4:5 5:6 6:7 7:8 8:9 2000:2001

syzbot report was:

BUG: memory leak
unreferenced object 0xffff888117bd1060 (size 32):
comm "syz-executor408", pid 3759, jiffies 4294956555 (age 34.090s)
hex dump (first 32 bytes):
09 00 00 00 00 a0 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<ffffffff83fc60ad>] kmalloc include/linux/slab.h:600 [inline]
[<ffffffff83fc60ad>] vlan_dev_set_egress_priority+0xed/0x170 net/8021q/vlan_dev.c:193
[<ffffffff83fc6628>] vlan_changelink+0x178/0x1d0 net/8021q/vlan_netlink.c:128
[<ffffffff83fc67c8>] vlan_newlink+0x148/0x260 net/8021q/vlan_netlink.c:185
[<ffffffff838b1278>] rtnl_newlink_create net/core/rtnetlink.c:3363 [inline]
[<ffffffff838b1278>] __rtnl_newlink+0xa58/0xdc0 net/core/rtnetlink.c:3580
[<ffffffff838b1629>] rtnl_newlink+0x49/0x70 net/core/rtnetlink.c:3593
[<ffffffff838ac66c>] rtnetlink_rcv_msg+0x21c/0x5c0 net/core/rtnetlink.c:6089
[<ffffffff839f9c37>] netlink_rcv_skb+0x87/0x1d0 net/netlink/af_netlink.c:2501
[<ffffffff839f8da7>] netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
[<ffffffff839f8da7>] netlink_unicast+0x397/0x4c0 net/netlink/af_netlink.c:1345
[<ffffffff839f9266>] netlink_sendmsg+0x396/0x710 net/netlink/af_netlink.c:1921
[<ffffffff8384dbf6>] sock_sendmsg_nosec net/socket.c:714 [inline]
[<ffffffff8384dbf6>] sock_sendmsg+0x56/0x80 net/socket.c:734
[<ffffffff8384e15c>] ____sys_sendmsg+0x36c/0x390 net/socket.c:2488
[<ffffffff838523cb>] ___sys_sendmsg+0x8b/0xd0 net/socket.c:2542
[<ffffffff838525b8>] __sys_sendmsg net/socket.c:2571 [inline]
[<ffffffff838525b8>] __do_sys_sendmsg net/socket.c:2580 [inline]
[<ffffffff838525b8>] __se_sys_sendmsg net/socket.c:2578 [inline]
[<ffffffff838525b8>] __x64_sys_sendmsg+0x78/0xf0 net/socket.c:2578
[<ffffffff845ad8d5>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<ffffffff845ad8d5>] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
[<ffffffff8460006a>] entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: 37aa50c ("vlan: introduce vlan_dev_free_egress_priority")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
BUG_ON can be triggered from userspace with an element with a large
userdata area. Replace it by length check and return EINVAL instead.
Over time extensions have been growing in size.

Pick a sufficiently old Fixes: tag to propagate this fix.

Fixes: 7d74026 ("netfilter: nf_tables: variable sized set element keys / data")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
At disconnect time the MPTCP protocol traverse the subflows
list closing each of them. In some circumstances - MPJ subflow,
passive MPTCP socket, the latter operation can remove the
subflow from the list, invalidating the current iterator.

Address the issue using the safe list traversing helper
variant.

Reported-by: van fantasy <g1042620637@gmail.com>
Fixes: b29fcfb ("mptcp: full disconnect implementation")
Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new script was not listed in the programs to test.

By consequence, some CIs running MPTCP selftests were not validating
these new tests. Note that MPTCP CI was validating it as it executes all
.sh scripts from 'tools/testing/selftests/net/mptcp' directory.

Fixes: 259a834 ("selftests: mptcp: functional tests for the userspace PM type")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Mat Martineau says:

====================
mptcp: Disconnect and selftest fixes

Patch 1 switches to a safe list iterator in the MPTCP disconnect code.

Patch 2 adds the userspace_pm.sh selftest script to the MPTCP selftest
Makefile, resolving the netdev/check_selftest CI failure.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) refcount_inc_not_zero() is not semantically equivalent to
   atomic_int_not_zero(), from Florian Westphal. My understanding was
   that refcount_*() API provides a wrapper to easier debugging of
   reference count leaks, however, there are semantic differences
   between these two APIs, where refcount_inc_not_zero() needs a barrier.
   Reason for this subtle difference to me is unknown.

2) packet logging is not correct for ARP and IP packets, from the
   ARP family and netdev/egress respectively. Use skb_network_offset()
   to reach the headers accordingly.

3) set element extension length have been growing over time, replace
   a BUG_ON by EINVAL which might be triggerable from userspace.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Add missing .gitignore entry.

Fixes: 839b92f ("selftest: tun: add test for NAPI dismantle")
Link: https://lore.kernel.org/r/20220709024141.321683-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add unregister_fib_notifier as rollback of register_fib_notifier.

Fixes: 4394fbc ("net: marvell: prestera: handle fib notifications")
Signed-off-by: Yevhen Orlov <yevhen.orlov@plvision.eu>
Link: https://lore.kernel.org/r/20220710122021.7642-1-yevhen.orlov@plvision.eu
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
@pchaigno pchaigno force-pushed the ci-test/test-test-extension branch from 2347d12 to cc3e820 Compare July 13, 2022 10:46
@pchaigno pchaigno closed this Jul 15, 2022
@pchaigno pchaigno deleted the ci-test/test-test-extension branch July 15, 2022 17:47
@pchaigno pchaigno restored the ci-test/test-test-extension branch July 16, 2022 08:46
@pchaigno pchaigno reopened this Jul 16, 2022
@pchaigno pchaigno force-pushed the ci-test/test-test-extension branch from cc3e820 to e6da5c3 Compare July 16, 2022 08:48
@pchaigno pchaigno closed this Jul 16, 2022
@pchaigno pchaigno deleted the ci-test/test-test-extension branch July 16, 2022 08:51
pchaigno pushed a commit that referenced this pull request Jul 25, 2022
…tion

Each cset (css_set) is pinned by its tasks. When we're moving tasks around
across csets for a migration, we need to hold the source and destination
csets to ensure that they don't go away while we're moving tasks about. This
is done by linking cset->mg_preload_node on either the
mgctx->preloaded_src_csets or mgctx->preloaded_dst_csets list. Using the
same cset->mg_preload_node for both the src and dst lists was deemed okay as
a cset can't be both the source and destination at the same time.

Unfortunately, this overloading becomes problematic when multiple tasks are
involved in a migration and some of them are identity noop migrations while
others are actually moving across cgroups. For example, this can happen with
the following sequence on cgroup1:

 #1> mkdir -p /sys/fs/cgroup/misc/a/b
 #2> echo $$ > /sys/fs/cgroup/misc/a/cgroup.procs
 #3> RUN_A_COMMAND_WHICH_CREATES_MULTIPLE_THREADS &
 #4> PID=$!
 #5> echo $PID > /sys/fs/cgroup/misc/a/b/tasks
 #6> echo $PID > /sys/fs/cgroup/misc/a/cgroup.procs

the process including the group leader back into a. In this final migration,
non-leader threads would be doing identity migration while the group leader
is doing an actual one.

After #3, let's say the whole process was in cset A, and that after #4, the
leader moves to cset B. Then, during #6, the following happens:

 1. cgroup_migrate_add_src() is called on B for the leader.

 2. cgroup_migrate_add_src() is called on A for the other threads.

 3. cgroup_migrate_prepare_dst() is called. It scans the src list.

 4. It notices that B wants to migrate to A, so it tries to A to the dst
    list but realizes that its ->mg_preload_node is already busy.

 5. and then it notices A wants to migrate to A as it's an identity
    migration, it culls it by list_del_init()'ing its ->mg_preload_node and
    putting references accordingly.

 6. The rest of migration takes place with B on the src list but nothing on
    the dst list.

This means that A isn't held while migration is in progress. If all tasks
leave A before the migration finishes and the incoming task pins it, the
cset will be destroyed leading to use-after-free.

This is caused by overloading cset->mg_preload_node for both src and dst
preload lists. We wanted to exclude the cset from the src list but ended up
inadvertently excluding it from the dst list too.

This patch fixes the issue by separating out cset->mg_preload_node into
->mg_src_preload_node and ->mg_dst_preload_node, so that the src and dst
preloadings don't interfere with each other.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reported-by: shisiyuan <shisiyuan19870131@gmail.com>
Link: http://lkml.kernel.org/r/1654187688-27411-1-git-send-email-shisiyuan@xiaomi.com
Link: https://www.spinics.net/lists/cgroups/msg33313.html
Fixes: f817de9 ("cgroup: prepare migration path for unified hierarchy")
Cc: stable@vger.kernel.org # v3.16+
pchaigno pushed a commit that referenced this pull request Jul 25, 2022
On powerpc, 'perf trace' is crashing with a SIGSEGV when trying to
process a perf.data file created with 'perf trace record -p':

  #0  0x00000001225b8988 in syscall_arg__scnprintf_augmented_string <snip> at builtin-trace.c:1492
  #1  syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1492
  #2  syscall_arg__scnprintf_filename <snip> at builtin-trace.c:1486
  #3  0x00000001225bdd9c in syscall_arg_fmt__scnprintf_val <snip> at builtin-trace.c:1973
  #4  syscall__scnprintf_args <snip> at builtin-trace.c:2041
  #5  0x00000001225bff04 in trace__sys_enter <snip> at builtin-trace.c:2319

That points to the below code in tools/perf/builtin-trace.c:
	/*
	 * If this is raw_syscalls.sys_enter, then it always comes with the 6 possible
	 * arguments, even if the syscall being handled, say "openat", uses only 4 arguments
	 * this breaks syscall__augmented_args() check for augmented args, as we calculate
	 * syscall->args_size using each syscalls:sys_enter_NAME tracefs format file,
	 * so when handling, say the openat syscall, we end up getting 6 args for the
	 * raw_syscalls:sys_enter event, when we expected just 4, we end up mistakenly
	 * thinking that the extra 2 u64 args are the augmented filename, so just check
	 * here and avoid using augmented syscalls when the evsel is the raw_syscalls one.
	 */
	if (evsel != trace->syscalls.events.sys_enter)
		augmented_args = syscall__augmented_args(sc, sample, &augmented_args_size, trace->raw_augmented_syscalls_args_size);

As the comment points out, we should not be trying to augment the args
for raw_syscalls. However, when processing a perf.data file, we are not
initializing those properly. Fix the same.

Reported-by: Claudio Carvalho <cclaudio@linux.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: http://lore.kernel.org/lkml/20220707090900.572584-1-naveen.n.rao@linux.vnet.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
pchaigno pushed a commit that referenced this pull request Jul 25, 2022
Add a lock_class_key per devlink instance to avoid DEADLOCK warning by
lockdep, while locking more than one devlink instance in driver code,
for example in opening VFs flow.

Kernel log:
[  101.433802] ============================================
[  101.433803] WARNING: possible recursive locking detected
[  101.433810] 5.19.0-rc1+ torvalds#35 Not tainted
[  101.433812] --------------------------------------------
[  101.433813] bash/892 is trying to acquire lock:
[  101.433815] ffff888127bfc2f8 (&devlink->lock){+.+.}-{3:3}, at: probe_one+0x3c/0x690 [mlx5_core]
[  101.433909]
               but task is already holding lock:
[  101.433910] ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core]
[  101.433989]
               other info that might help us debug this:
[  101.433990]  Possible unsafe locking scenario:

[  101.433991]        CPU0
[  101.433991]        ----
[  101.433992]   lock(&devlink->lock);
[  101.433993]   lock(&devlink->lock);
[  101.433995]
                *** DEADLOCK ***

[  101.433996]  May be due to missing lock nesting notation

[  101.433996] 6 locks held by bash/892:
[  101.433998]  #0: ffff88810eb50448 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0xf3/0x1d0
[  101.434009]  #1: ffff888114777c88 (&of->mutex){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x20d/0x520
[  101.434017]  #2: ffff888102b58660 (kn->active#231){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x230/0x520
[  101.434023]  #3: ffff888102d70198 (&dev->mutex){....}-{3:3}, at: sriov_numvfs_store+0x132/0x310
[  101.434031]  #4: ffff888118f4c2f8 (&devlink->lock){+.+.}-{3:3}, at: mlx5_core_sriov_configure+0x62/0x280 [mlx5_core]
[  101.434108]  #5: ffff88812adce198 (&dev->mutex){....}-{3:3}, at: __device_attach+0x76/0x430
[  101.434116]
               stack backtrace:
[  101.434118] CPU: 5 PID: 892 Comm: bash Not tainted 5.19.0-rc1+ torvalds#35
[  101.434120] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  101.434130] Call Trace:
[  101.434133]  <TASK>
[  101.434135]  dump_stack_lvl+0x57/0x7d
[  101.434145]  __lock_acquire.cold+0x1df/0x3e7
[  101.434151]  ? register_lock_class+0x1880/0x1880
[  101.434157]  lock_acquire+0x1c1/0x550
[  101.434160]  ? probe_one+0x3c/0x690 [mlx5_core]
[  101.434229]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  101.434232]  ? __xa_alloc+0x1ed/0x2d0
[  101.434236]  ? ksys_write+0xf3/0x1d0
[  101.434239]  __mutex_lock+0x12c/0x14b0
[  101.434243]  ? probe_one+0x3c/0x690 [mlx5_core]
[  101.434312]  ? probe_one+0x3c/0x690 [mlx5_core]
[  101.434380]  ? devlink_alloc_ns+0x11b/0x910
[  101.434385]  ? mutex_lock_io_nested+0x1320/0x1320
[  101.434388]  ? lockdep_init_map_type+0x21a/0x7d0
[  101.434391]  ? lockdep_init_map_type+0x21a/0x7d0
[  101.434393]  ? __init_swait_queue_head+0x70/0xd0
[  101.434397]  probe_one+0x3c/0x690 [mlx5_core]
[  101.434467]  pci_device_probe+0x1b4/0x480
[  101.434471]  really_probe+0x1e0/0xaa0
[  101.434474]  __driver_probe_device+0x219/0x480
[  101.434478]  driver_probe_device+0x49/0x130
[  101.434481]  __device_attach_driver+0x1b8/0x280
[  101.434484]  ? driver_allows_async_probing+0x140/0x140
[  101.434487]  bus_for_each_drv+0x123/0x1a0
[  101.434489]  ? bus_for_each_dev+0x1a0/0x1a0
[  101.434491]  ? lockdep_hardirqs_on_prepare+0x286/0x400
[  101.434494]  ? trace_hardirqs_on+0x2d/0x100
[  101.434498]  __device_attach+0x1a3/0x430
[  101.434501]  ? device_driver_attach+0x1e0/0x1e0
[  101.434503]  ? pci_bridge_d3_possible+0x1e0/0x1e0
[  101.434506]  ? pci_create_resource_files+0xeb/0x190
[  101.434511]  pci_bus_add_device+0x6c/0xa0
[  101.434514]  pci_iov_add_virtfn+0x9e4/0xe00
[  101.434517]  ? trace_hardirqs_on+0x2d/0x100
[  101.434521]  sriov_enable+0x64a/0xca0
[  101.434524]  ? pcibios_sriov_disable+0x10/0x10
[  101.434528]  mlx5_core_sriov_configure+0xab/0x280 [mlx5_core]
[  101.434602]  sriov_numvfs_store+0x20a/0x310
[  101.434605]  ? sriov_totalvfs_show+0xc0/0xc0
[  101.434608]  ? sysfs_file_ops+0x170/0x170
[  101.434611]  ? sysfs_file_ops+0x117/0x170
[  101.434614]  ? sysfs_file_ops+0x170/0x170
[  101.434616]  kernfs_fop_write_iter+0x348/0x520
[  101.434619]  new_sync_write+0x2e5/0x520
[  101.434621]  ? new_sync_read+0x520/0x520
[  101.434624]  ? lock_acquire+0x1c1/0x550
[  101.434626]  ? lockdep_hardirqs_on_prepare+0x400/0x400
[  101.434630]  vfs_write+0x5cb/0x8d0
[  101.434633]  ksys_write+0xf3/0x1d0
[  101.434635]  ? __x64_sys_read+0xb0/0xb0
[  101.434638]  ? lockdep_hardirqs_on_prepare+0x286/0x400
[  101.434640]  ? syscall_enter_from_user_mode+0x1d/0x50
[  101.434643]  do_syscall_64+0x3d/0x90
[  101.434647]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  101.434650] RIP: 0033:0x7f5ff536b2f7
[  101.434658] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f
1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f
05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  101.434661] RSP: 002b:00007ffd9ea85d58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  101.434664] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f5ff536b2f7
[  101.434666] RDX: 0000000000000002 RSI: 000055c4c279e230 RDI: 0000000000000001
[  101.434668] RBP: 000055c4c279e230 R08: 000000000000000a R09: 0000000000000001
[  101.434669] R10: 000055c4c283cbf0 R11: 0000000000000246 R12: 0000000000000002
[  101.434670] R13: 00007f5ff543d500 R14: 0000000000000002 R15: 00007f5ff543d700
[  101.434673]  </TASK>

Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann pushed a commit that referenced this pull request Apr 30, 2023
Andrii Nakryiko says:

====================

This patch set moves bpf_for(), bpf_for_each(), and bpf_repeat() macros from
selftests-internal bpf_misc.h header to libbpf-provided bpf_helpers.h header.
To do this in a way to allow users to feature-detect and guard such
bpf_for()/bpf_for_each() uses on old kernels we also extend libbpf to improve
unresolved kfunc calls handling and reporting. This lets us mark
bpf_iter_num_{new,next,destroy}() declarations as __weak, and thus not fail
program loading outright if such kfuncs are missing on the host kernel.

Patches #1 and #2 do some simple clean ups and logging improvements. Patch #3
adds kfunc call poisoning and log fixup logic and is the hear of this patch
set, effectively. Patch #4 adds selftest for this logic. Patches #4 and #5
move bpf_for()/bpf_for_each()/bpf_repeat() into bpf_helpers.h header and mark
kfuncs as __weak to allow users to feature-detect and guard their uses.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann pushed a commit that referenced this pull request Apr 30, 2023
Ido Schimmel says:

====================
bridge: Add per-{Port, VLAN} neighbor suppression

Background
==========

In order to minimize the flooding of ARP and ND messages in the VXLAN
network, EVPN includes provisions [1] that allow participating VTEPs to
suppress such messages in case they know the MAC-IP binding and can
reply on behalf of the remote host. In Linux, the above is implemented
in the bridge driver using a per-port option called "neigh_suppress"
that was added in kernel version 4.15 [2].

Motivation
==========

Some applications use ARP messages as keepalives between the application
nodes in the network. This works perfectly well when two nodes are
connected to the same VTEP. When a node goes down it will stop
responding to ARP requests and the other node will notice it
immediately.

However, when the two nodes are connected to different VTEPs and
neighbor suppression is enabled, the local VTEP will reply to ARP
requests even after the remote node went down, until certain timers
expire and the EVPN control plane decides to withdraw the MAC/IP
Advertisement route for the address. Therefore, some users would like to
be able to disable neighbor suppression on VLANs where such applications
reside and keep it enabled on the rest.

Implementation
==============

The proposed solution is to allow user space to control neighbor
suppression on a per-{Port, VLAN} basis, in a similar fashion to other
per-port options that gained per-{Port, VLAN} counterparts such as
"mcast_router". This allows users to benefit from the operational
simplicity and scalability associated with shared VXLAN devices (i.e.,
external / collect-metadata mode), while still allowing for per-VLAN/VNI
neighbor suppression control.

The user interface is extended with a new "neigh_vlan_suppress" bridge
port option that allows user space to enable per-{Port, VLAN} neighbor
suppression on the bridge port. When enabled, the existing
"neigh_suppress" option has no effect and neighbor suppression is
controlled using a new "neigh_suppress" VLAN option. Example usage:

 # bridge link set dev vxlan0 neigh_vlan_suppress on
 # bridge vlan add vid 10 dev vxlan0
 # bridge vlan set vid 10 dev vxlan0 neigh_suppress on

Testing
=======

Tested using existing bridge selftests. Added a dedicated selftest in
the last patch.

Patchset overview
=================

Patches #1-#5 are preparations.

Patch #6 adds per-{Port, VLAN} neighbor suppression support to the
bridge's data path.

Patches #7-#8 add the required netlink attributes to enable the feature.

Patch #9 adds a selftest.

iproute2 patches can be found here [3].

Changelog
=========

Since RFC [4]:

No changes.

[1] https://www.rfc-editor.org/rfc/rfc7432#section-10
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a42317785c898c0ed46db45a33b0cc71b671bf29
[3] https://github.com/idosch/iproute2/tree/submit/neigh_suppress_v1
[4] https://lore.kernel.org/netdev/20230413095830.2182382-1-idosch@nvidia.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
lmb pushed a commit that referenced this pull request May 23, 2023
In the function ieee80211_tx_dequeue() there is a particular locking
sequence:

begin:
	spin_lock(&local->queue_stop_reason_lock);
	q_stopped = local->queue_stop_reasons[q];
	spin_unlock(&local->queue_stop_reason_lock);

However small the chance (increased by ftracetest), an asynchronous
interrupt can occur in between of spin_lock() and spin_unlock(),
and the interrupt routine will attempt to lock the same
&local->queue_stop_reason_lock again.

This will cause a costly reset of the CPU and the wifi device or an
altogether hang in the single CPU and single core scenario.

The only remaining spin_lock(&local->queue_stop_reason_lock) that
did not disable interrupts was patched, which should prevent any
deadlocks on the same CPU/core and the same wifi device.

This is the probable trace of the deadlock:

kernel: ================================
kernel: WARNING: inconsistent lock state
kernel: 6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4 Tainted: G        W
kernel: --------------------------------
kernel: inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
kernel: kworker/5:0/25656 [HC0[0]:SC0[0]:HE1:SE1] takes:
kernel: ffff9d6190779478 (&local->queue_stop_reason_lock){+.?.}-{2:2}, at: return_to_handler+0x0/0x40
kernel: {IN-SOFTIRQ-W} state was registered at:
kernel:   lock_acquire+0xc7/0x2d0
kernel:   _raw_spin_lock+0x36/0x50
kernel:   ieee80211_tx_dequeue+0xb4/0x1330 [mac80211]
kernel:   iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm]
kernel:   iwl_mvm_mac_wake_tx_queue+0x2d/0xd0 [iwlmvm]
kernel:   ieee80211_queue_skb+0x450/0x730 [mac80211]
kernel:   __ieee80211_xmit_fast.constprop.66+0x834/0xa50 [mac80211]
kernel:   __ieee80211_subif_start_xmit+0x217/0x530 [mac80211]
kernel:   ieee80211_subif_start_xmit+0x60/0x580 [mac80211]
kernel:   dev_hard_start_xmit+0xb5/0x260
kernel:   __dev_queue_xmit+0xdbe/0x1200
kernel:   neigh_resolve_output+0x166/0x260
kernel:   ip_finish_output2+0x216/0xb80
kernel:   __ip_finish_output+0x2a4/0x4d0
kernel:   ip_finish_output+0x2d/0xd0
kernel:   ip_output+0x82/0x2b0
kernel:   ip_local_out+0xec/0x110
kernel:   igmpv3_sendpack+0x5c/0x90
kernel:   igmp_ifc_timer_expire+0x26e/0x4e0
kernel:   call_timer_fn+0xa5/0x230
kernel:   run_timer_softirq+0x27f/0x550
kernel:   __do_softirq+0xb4/0x3a4
kernel:   irq_exit_rcu+0x9b/0xc0
kernel:   sysvec_apic_timer_interrupt+0x80/0xa0
kernel:   asm_sysvec_apic_timer_interrupt+0x1f/0x30
kernel:   _raw_spin_unlock_irqrestore+0x3f/0x70
kernel:   free_to_partial_list+0x3d6/0x590
kernel:   __slab_free+0x1b7/0x310
kernel:   kmem_cache_free+0x52d/0x550
kernel:   putname+0x5d/0x70
kernel:   do_sys_openat2+0x1d7/0x310
kernel:   do_sys_open+0x51/0x80
kernel:   __x64_sys_openat+0x24/0x30
kernel:   do_syscall_64+0x5c/0x90
kernel:   entry_SYSCALL_64_after_hwframe+0x72/0xdc
kernel: irq event stamp: 5120729
kernel: hardirqs last  enabled at (5120729): [<ffffffff9d149936>] trace_graph_return+0xd6/0x120
kernel: hardirqs last disabled at (5120728): [<ffffffff9d149950>] trace_graph_return+0xf0/0x120
kernel: softirqs last  enabled at (5069900): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40
kernel: softirqs last disabled at (5067555): [<ffffffff9cf65b60>] return_to_handler+0x0/0x40
kernel:
        other info that might help us debug this:
kernel:  Possible unsafe locking scenario:
kernel:        CPU0
kernel:        ----
kernel:   lock(&local->queue_stop_reason_lock);
kernel:   <Interrupt>
kernel:     lock(&local->queue_stop_reason_lock);
kernel:
         *** DEADLOCK ***
kernel: 8 locks held by kworker/5:0/25656:
kernel:  #0: ffff9d618009d138 ((wq_completion)events_freezable){+.+.}-{0:0}, at: process_one_work+0x1ca/0x530
kernel:  #1: ffffb1ef4637fe68 ((work_completion)(&local->restart_work)){+.+.}-{0:0}, at: process_one_work+0x1ce/0x530
kernel:  #2: ffffffff9f166548 (rtnl_mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #3: ffff9d6190778728 (&rdev->wiphy.mtx){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #4: ffff9d619077b480 (&mvm->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #5: ffff9d61907bacd8 (&trans_pcie->mutex){+.+.}-{3:3}, at: return_to_handler+0x0/0x40
kernel:  #6: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_queue_state_change+0x59/0x3a0 [iwlmvm]
kernel:  #7: ffffffff9ef9cda0 (rcu_read_lock){....}-{1:2}, at: iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm]
kernel:
        stack backtrace:
kernel: CPU: 5 PID: 25656 Comm: kworker/5:0 Tainted: G        W          6.3.0-rc6-mt-20230401-00001-gf86822a1170f #4
kernel: Hardware name: LENOVO 82H8/LNVNB161216, BIOS GGCN51WW 11/16/2022
kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211]
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  dump_stack_lvl+0x5f/0xa0
kernel:  dump_stack+0x14/0x20
kernel:  print_usage_bug.part.46+0x208/0x2a0
kernel:  mark_lock.part.47+0x605/0x630
kernel:  ? sched_clock+0xd/0x20
kernel:  ? trace_clock_local+0x14/0x30
kernel:  ? __rb_reserve_next+0x5f/0x490
kernel:  ? _raw_spin_lock+0x1b/0x50
kernel:  __lock_acquire+0x464/0x1990
kernel:  ? mark_held_locks+0x4e/0x80
kernel:  lock_acquire+0xc7/0x2d0
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ? ftrace_return_to_handler+0x8b/0x100
kernel:  ? preempt_count_add+0x4/0x70
kernel:  _raw_spin_lock+0x36/0x50
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ieee80211_tx_dequeue+0xb4/0x1330 [mac80211]
kernel:  ? prepare_ftrace_return+0xc5/0x190
kernel:  ? ftrace_graph_func+0x16/0x20
kernel:  ? 0xffffffffc02ab0b1
kernel:  ? lock_acquire+0xc7/0x2d0
kernel:  ? iwl_mvm_mac_itxq_xmit+0x42/0x210 [iwlmvm]
kernel:  ? ieee80211_tx_dequeue+0x9/0x1330 [mac80211]
kernel:  ? __rcu_read_lock+0x4/0x40
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_mac_itxq_xmit+0xae/0x210 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_queue_state_change+0x311/0x3a0 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_wake_sw_queue+0x17/0x20 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_txq_gen2_unmap+0x1c9/0x1f0 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_txq_gen2_free+0x55/0x130 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_txq_gen2_tx_free+0x63/0x80 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  _iwl_trans_pcie_gen2_stop_device+0x3f3/0x5b0 [iwlwifi]
kernel:  ? _iwl_trans_pcie_gen2_stop_device+0x9/0x5b0 [iwlwifi]
kernel:  ? mutex_lock_nested+0x4/0x30
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_trans_pcie_gen2_stop_device+0x5f/0x90 [iwlwifi]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_stop_device+0x78/0xd0 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  __iwl_mvm_mac_start+0x114/0x210 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  iwl_mvm_mac_start+0x76/0x150 [iwlmvm]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  drv_start+0x79/0x180 [mac80211]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ieee80211_reconfig+0x1523/0x1ce0 [mac80211]
kernel:  ? synchronize_net+0x4/0x50
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  ieee80211_restart_work+0x108/0x170 [mac80211]
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  process_one_work+0x250/0x530
kernel:  ? ftrace_regs_caller_end+0x66/0x66
kernel:  worker_thread+0x48/0x3a0
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0x10f/0x140
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x29/0x50
kernel:  </TASK>

Fixes: 4444bc2 ("wifi: mac80211: Proper mark iTXQs for resumption")
Link: https://lore.kernel.org/all/1f58a0d1-d2b9-d851-73c3-93fcc607501c@alu.unizg.hr/
Reported-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Cc: Gregory Greenman <gregory.greenman@intel.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/all/cdc80531-f25f-6f9d-b15f-25e16130b53a@alu.unizg.hr/
Cc: David S. Miller <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Leon Romanovsky <leon@kernel.org>
Cc: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Mirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: tag, or it goes automatically?
Link: https://lore.kernel.org/r/20230425164005.25272-1-mirsad.todorovac@alu.unizg.hr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
lmb pushed a commit that referenced this pull request Jun 20, 2023
The cited commit adds a compeletion to remove dependency on rtnl
lock. But it causes a deadlock for multiple encapsulations:

 crash> bt ffff8aece8a64000
 PID: 1514557  TASK: ffff8aece8a64000  CPU: 3    COMMAND: "tc"
  #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418
  #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898
  #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8
  #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb
  #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core]
  #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core]
  #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core]
  #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core]
  #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core]
 #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core]
 #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core]
 #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8
 torvalds#13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower]
 torvalds#14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower]
 torvalds#15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047
 torvalds#16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31
 torvalds#17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853
 torvalds#18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835
 torvalds#19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27
 torvalds#20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245
 torvalds#21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482
 torvalds#22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a
 torvalds#23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2
 torvalds#24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2
 torvalds#25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f
 torvalds#26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8
 torvalds#27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c
 crash> bt 0xffff8aeb07544000
 PID: 1110766  TASK: ffff8aeb07544000  CPU: 0    COMMAND: "kworker/u20:9"
  #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418
  #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88
  #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b
  #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core]
  #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core]
  #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core]
  #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c
  #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012
  #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d
 #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f

After the first encap is attached, flow will be added to encap
entry's flows list. If neigh update is running at this time, the
following encaps of the flow can't hold the encap_tbl_lock and
sleep. If neigh update thread is waiting for that flow's init_done,
deadlock happens.

Fix it by holding lock outside of the for loop. If neigh update is
running, prevent encap flows from offloading. Since the lock is held
outside of the for loop, concurrent creation of encap entries is not
allowed. So remove unnecessary wait_for_completion call for res_ready.

Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
lmb pushed a commit that referenced this pull request Jun 20, 2023
Currently, the per cpu upcall counters are allocated after the vport is
created and inserted into the system. This could lead to the datapath
accessing the counters before they are allocated resulting in a kernel
Oops.

Here is an example:

  PID: 59693    TASK: ffff0005f4f51500  CPU: 0    COMMAND: "ovs-vswitchd"
   #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4
   #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc
   #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60
   #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58
   #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388
   #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c
   #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68
   #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch]
   ...

  PID: 58682    TASK: ffff0005b2f0bf00  CPU: 0    COMMAND: "kworker/0:3"
   #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758
   #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994
   #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8
   #3 [ffff80000a5d3120] die at ffffb70f0628234c
   #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8
   #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4
   #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4
   #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710
   #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74
   #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac
  #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24
  #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc
  #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch]
  torvalds#13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch]
  torvalds#14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch]
  torvalds#15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch]
  torvalds#16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch]
  torvalds#17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90

We moved the per cpu upcall counter allocation to the existing vport
alloc and free functions to solve this.

Fixes: 95637d9 ("net: openvswitch: release vport resources on failure")
Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Jul 19, 2023
Petr Machata says:

====================
mlxsw: Manage RIF across PVID changes

The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.

As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.

The situation is going to be made better by implementing a range of replays
and post-hoc offloads.

In this patch set, address the ordering issues related to creation of
bridge RIFs. Currently, mlxsw has several shortcomings with regards to RIF
handling due to PVID changes:

- In order to cause RIF for a bridge device to be created, the user is
  expected first to set PVID, then to add an IP address. The reverse
  ordering is disallowed, which is not very user-friendly.

- When such bridge gets a VLAN upper whose VID was the same as the existing
  PVID, and this VLAN netdevice gets an IP address, a RIF is created for
  this netdevice. The new RIF is then assigned to the 802.1Q FID for the
  given VID. This results in a working configuration. However, then, when
  the VLAN netdevice is removed again, the RIF for the bridge itself is
  never reassociated to the PVID.

- PVID cannot be changed once the bridge has uppers. Presumably this is
  because the driver does not manage RIFs properly in face of PVID changes.
  However, as the previous point shows, it is still possible to get into
  invalid configurations.

This patch set addresses these issues and relaxes some of the ordering
requirements that mlxsw had. The patch set proceeds as follows:

- In patch #1, pass extack to mlxsw_sp_br_ban_rif_pvid_change()

- To relax ordering between setting PVID and adding an IP address to a
  bridge, mlxsw must be able to request that a RIF is created with a given
  VLAN ID, instead of trying to deduce it from the current netdevice
  settings, which do not reflect the user-requested values yet. This is
  done in patches #2 and #3.

- Similarly, mlxsw_sp_inetaddr_bridge_event() will need to make decisions
  based on the user-requested value of PVID, not the current value. Thus in
  patches #4 and #5, add a new argument which carries the requested PVID
  value.

- Finally in patch #6 relax the ban on PVID changes when a bridge has
  uppers. Instead, add the logic necessary for creation of a RIF as a
  result of PVID change.

- Relevant selftests are presented afterwards. In patch #7 a preparatory
  helper is added to lib.sh. Patches #8, #9, #10 and #11 include selftests
  themselves.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Jul 28, 2023
Petr Machata says:

====================
mlxsw: Permit enslavement to netdevices with uppers

The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.

As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.

Similarly, detaching a front panel port from a configured topology means
unoffloading of this whole topology -- VLAN uppers, next hops, etc.
Attaching the port back is then not permitted at all. If it were, it would
not result in a working configuration, because much of mlxsw is written to
react to changes in immediate configuration. There is nothing that would go
visit netdevices in the attached-to topology and offload existing routes
and VLAN memberships, for example.

In this patchset, introduce a number of replays to be invoked so that this
sort of post-hoc offload is supported. Then remove the vetoes that
disallowed enslavement of front panel ports to other netdevices with
uppers.

The patchset progresses as follows:

- In patch #1, fix an issue in the bridge driver. To my knowledge, the
  issue could not have resulted in a buggy behavior previously, and thus is
  packaged with this patchset instead of being sent separately to net.

- In patch #2, add a new helper to the switchdev code.

- In patch #3, drop mlxsw selftests that will not be relevant after this
  patchset anymore.

- Patches #4, #5, #6, #7 and #8 prepare the codebase for smoother
  introduction of the rest of the code.

- Patches #9, #10, #11, #12, torvalds#13 and torvalds#14 replay various aspects of upper
  configuration when a front panel port is introduced into a topology.
  Individual patches take care of bridge and LAG RIF memberships, switchdev
  replay, nexthop and neighbors replay, and MACVLAN offload.

- Patches torvalds#15 and torvalds#16 introduce RIFs for newly-relevant netdevices when a
  front panel port is enslaved (in which case all uppers are newly
  relevant), or, respectively, deslaved (in which case the newly-relevant
  netdevice is the one being deslaved).

- Up until this point, the introduced scaffolding was not really used,
  because mlxsw still forbids enslavement of mlxsw netdevices to uppers
  with uppers. In patch torvalds#17, this condition is finally relaxed.

A sizable selftest suite is available to test all this new code. That will
be sent in a separate patchset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Jul 28, 2023
Petr Machata says:

====================
mlxsw: Speed up transceiver module EEPROM dump

Ido Schimmel writes:

Old firmware versions could only read up to 48 bytes from a transceiver
module's EEPROM in one go. Newer versions can read up to 128 bytes,
resulting in fewer transactions.

Query support for the new capability during driver initialization and if
supported, read up to 128 bytes in one go.

This is going to be especially useful for upcoming transceiver module
firmware flashing support.

Before:

 # perf stat -e devlink:devlink_hwmsg -- ethtool -m swp11 page 0x1 offset 128 length 128 i2c 0x50
 [...]
  Performance counter stats for 'ethtool -m swp11 page 0x1 offset 128 length 128 i2c 0x50':

                  3      devlink:devlink_hwmsg

After:

 # perf stat -e devlink:devlink_hwmsg -- ethtool -m swp11 page 0x1 offset 128 length 128 i2c 0x50
 [...]
  Performance counter stats for 'ethtool -m swp11 page 0x1 offset 128 length 128 i2c 0x50':

                  1      devlink:devlink_hwmsg

Patches #1-#4 are preparations / cleanups.

Patch #5 adds support for the new read size.
====================

Link: https://lore.kernel.org/r/cover.1690281940.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann pushed a commit that referenced this pull request Aug 4, 2023
The cited commit holds encap tbl lock unconditionally when setting
up dests. But it may cause the following deadlock:

 PID: 1063722  TASK: ffffa062ca5d0000  CPU: 13   COMMAND: "handler8"
  #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91
  #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb
  #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528
  #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb
  #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb
  #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core]
  #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core]
  #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core]
  #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core]
  #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core]
 #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core]
 #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core]
 #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core]
 torvalds#13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8
 torvalds#14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower]
 torvalds#15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower]
 torvalds#16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0

[1031218.028143]  wait_for_completion+0x24/0x30
[1031218.028589]  mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core]
[1031218.029256]  mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core]
[1031218.029885]  process_one_work+0x24e/0x510

Actually no need to hold encap tbl lock if there is no encap action.
Fix it by checking if encap action exists or not before holding
encap tbl lock.

Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
borkmann pushed a commit that referenced this pull request Aug 4, 2023
Petr Machata says:

====================
selftests: New selftests for out-of-order-operations patches in mlxsw

In the past, the mlxsw driver made the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices needed to be added to
the bridge before IP addresses were configured on that bridge or SVI added
on top of it, because whatever happened before a netdevice was mlxsw upper
was generally ignored by mlxsw. Recently, several patch series were pushed
to introduce the bookkeeping and replays necessary to offload the full
state, not just the immediate configuration step.

In this patchset, introduce new selftests that directly exercise the out of
order code paths in mlxsw.

- Patch #1 adds new tests into the existing selftest router_bridge.sh.
- Patches #2-#5 add new generic selftests.
- Patches #6-#8 add new mlxsw-specific selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Aug 14, 2023
Alexander Lobakin says:

====================
page_pool: a couple of assorted optimizations

That initially was a spin-off of the IAVF PP series[0], but has grown
(and shrunk) since then a bunch. In fact, it consists of three
semi-independent blocks:

* #1-2: Compile-time optimization. Split page_pool.h into 2 headers to
  not overbloat the consumers not needing complex inline helpers and
  then stop including it in skbuff.h at all. The first patch is also
  prereq for the whole series.
* #3: Improve cacheline locality for users of the Page Pool frag API.
* #4-6: Use direct cache recycling more aggressively, when it is safe
  obviously. In addition, make sure nobody wants to use Page Pool API
  with disabled interrupts.

Patches #1 and #5 are authored by Yunsheng and Jakub respectively, with
small modifications from my side as per ML discussions.
For the perf numbers for #3-6, please see individual commit messages.

Also available on my GH with many more Page Pool goodies[1].

[0] https://lore.kernel.org/netdev/20230530150035.1943669-1-aleksander.lobakin@intel.com
[1] https://github.com/alobakin/linux/commits/iavf-pp-frag
====================

Link: https://lore.kernel.org/r/20230804180529.2483231-1-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann pushed a commit that referenced this pull request Aug 24, 2023
…l/git/netfilter/nf

Florisn Westphal says:

====================
These are netfilter fixes for the *net* tree.

First patch resolves a false-positive lockdep splat:
rcu_dereference is used outside of rcu read lock.  Let lockdep
validate that the transaction mutex is locked.

Second patch fixes a kdoc warning added in previous PR.

Third patch fixes a memory leak:
The catchall element isn't disabled correctly, this allows
userspace to deactivate the element again. This results in refcount
underflow which in turn prevents memory release. This was always
broken since the feature was added in 5.13.

Patch 4 fixes an incorrect change in the previous pull request:
Adding a duplicate key to a set should work if the duplicate key
has expired, restore this behaviour. All from myself.

Patch #5 resolves an old historic artifact in sctp conntrack:
a 300ms timeout for shutdown_ack. Increase this to 3s.  From Xin Long.

Patch #6 fixes a sysctl data race in ipvs, two threads can clobber the
sysctl value, from Sishuai Gong. This is a day-0 bug that predates git
history.

Patches 7, 8 and 9, from Pablo Neira Ayuso, are also followups
for the previous GC rework in nf_tables: The netlink notifier and the
netns exit path must both increment the gc worker seqcount, else worker
may encounter stale (free'd) pointers.
================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
Noticed with:

  make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

Direct leak of 45 byte(s) in 1 object(s) allocated from:
    #0 0x7f213f87243b in strdup (/lib64/libasan.so.8+0x7243b)
    #1 0x63d15f in evsel__set_filter util/evsel.c:1371
    #2 0x63d15f in evsel__append_filter util/evsel.c:1387
    #3 0x63d15f in evsel__append_tp_filter util/evsel.c:1400
    #4 0x62cd52 in evlist__append_tp_filter util/evlist.c:1145
    #5 0x62cd52 in evlist__append_tp_filter_pids util/evlist.c:1196
    #6 0x541e49 in trace__set_filter_loop_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3646
    #7 0x541e49 in trace__set_filter_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3670
    #8 0x541e49 in trace__run /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3970
    #9 0x541e49 in cmd_trace /home/acme/git/perf-tools/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools/tools/perf/perf.c:421
    torvalds#13 0x4196da in main /home/acme/git/perf-tools/tools/perf/perf.c:537
    torvalds#14 0x7f213e84a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

Free it on evsel__exit().

Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/20230719202951.534582-2-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
To plug these leaks detected with:

  $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

  =================================================================
  ==473890==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 112 byte(s) in 1 object(s) allocated from:
    #0 0x7fdf19aba097 in calloc (/lib64/libasan.so.8+0xba097)
    #1 0x987836 in zalloc (/home/acme/bin/perf+0x987836)
    #2 0x5367ae in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1289
    #3 0x5367ae in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307
    #4 0x5367ae in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468
    #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177
    #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685
    #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712
    #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055
    #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Direct leak of 2048 byte(s) in 1 object(s) allocated from:
    #0 0x7f788fcba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af)
    #1 0x5337c0 in trace__sys_enter /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2342
    #2 0x52bfb4 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3191
    #3 0x52bfb4 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3699
    #4 0x542883 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3726
    #5 0x542883 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4069
    #6 0x542883 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5155
    #7 0x5ef232 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #8 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #9 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    #10 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    #11 0x7f788ec4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Indirect leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7fdf19aba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af)
    #1 0x77b335 in intlist__new util/intlist.c:116
    #2 0x5367fd in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1293
    #3 0x5367fd in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307
    #4 0x5367fd in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468
    #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177
    #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685
    #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712
    #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055
    #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/20230719202951.534582-4-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
In 3cb4d5e ("perf trace: Free syscall tp fields in
evsel->priv") it only was freeing if strcmp(evsel->tp_format->system,
"syscalls") returned zero, while the corresponding initialization of
evsel->priv was being performed if it was _not_ zero, i.e. if the tp
system wasn't 'syscalls'.

Just stop looking for that and free it if evsel->priv was set, which
should be equivalent.

Also use the pre-existing evsel_trace__delete() function.

This resolves these leaks, detected with:

  $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

  =================================================================
  ==481565==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
      #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966)
      #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307
      #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333
      #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458
      #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480
      #6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212
      #7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891
      #8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156
      #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
      #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
      torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
      #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966)
      #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307
      #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333
      #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458
      #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480
      #6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205
      #7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891
      #8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156
      #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
      #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
      torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s).
  [root@quaco ~]#

With this we plug all leaks with "perf trace sleep 1".

Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv")
Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Riccardo Mancini <rickyman7@gmail.com>
Link: https://lore.kernel.org/lkml/20230719202951.534582-5-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
…failure to add a probe

Building perf with EXTRA_CFLAGS="-fsanitize=address" a leak is detect
when trying to add a probe to a non-existent function:

  # perf probe -x ~/bin/perf dso__neW
  Probe point 'dso__neW' not found.
    Error: Failed to add events.

  =================================================================
  ==296634==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 128 byte(s) in 1 object(s) allocated from:
      #0 0x7f67642ba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x7f67641a76f1 in allocate_cfi (/lib64/libdw.so.1+0x3f6f1)

  Direct leak of 65 byte(s) in 1 object(s) allocated from:
      #0 0x7f67642b95b5 in __interceptor_realloc.part.0 (/lib64/libasan.so.8+0xb95b5)
      #1 0x6cac75 in strbuf_grow util/strbuf.c:64
      #2 0x6ca934 in strbuf_init util/strbuf.c:25
      #3 0x9337d2 in synthesize_perf_probe_point util/probe-event.c:2018
      #4 0x92be51 in try_to_find_probe_trace_events util/probe-event.c:964
      #5 0x93d5c6 in convert_to_probe_trace_events util/probe-event.c:3512
      #6 0x93d6d5 in convert_perf_probe_events util/probe-event.c:3529
      #7 0x56f37f in perf_add_probe_events /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:354
      #8 0x572fbc in __cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:738
      #9 0x5730f2 in cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:766
      #10 0x635d81 in run_builtin /var/home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #11 0x6362c1 in handle_internal_command /var/home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #12 0x63667a in run_argv /var/home/acme/git/perf-tools-next/tools/perf/perf.c:421
      torvalds#13 0x636b8d in main /var/home/acme/git/perf-tools-next/tools/perf/perf.c:537
      torvalds#14 0x7f676302950f in __libc_start_call_main (/lib64/libc.so.6+0x2950f)

  SUMMARY: AddressSanitizer: 193 byte(s) leaked in 2 allocation(s).
  #

synthesize_perf_probe_point() returns a "detachec" strbuf, i.e. a
malloc'ed string that needs to be free'd.

An audit will be performed to find other such cases.

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZM0l1Oxamr4SVjfY@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
While debugging a segfault on 'perf lock contention' without an
available perf.data file I noticed that it was basically calling:

	perf_session__delete(ERR_PTR(-1))

Resulting in:

  (gdb) run lock contention
  Starting program: /root/bin/perf lock contention
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib64/libthread_db.so.1".
  failed to open perf.data: No such file or directory  (try 'perf record' first)
  Initializing perf session failed

  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  2858		if (!session->auxtrace)
  (gdb) p session
  $1 = (struct perf_session *) 0xffffffffffffffff
  (gdb) bt
  #0  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  #1  0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300
  #2  0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161
  #3  0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604
  #4  0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322
  #5  0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375
  #6  0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419
  #7  0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535
  (gdb)

So just set it to NULL after using PTR_ERR(session) to decode the error
as perf_session__delete(NULL) is supported.

The same problem was found in 'perf top' after an audit of all
perf_session__new() failure handling.

Fixes: 6ef81c5 ("perf session: Return error code for perf_session__new() function on failure")
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexey Budankov <alexey.budankov@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jeremie Galarneau <jeremie.galarneau@efficios.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kate Stewart <kstewart@linuxfoundation.org>
Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
Cc: Mukesh Ojha <mojha@codeaurora.org>
Cc: Nageswara R Sastry <rnsastry@linux.vnet.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: Shawn Landden <shawn@git.icu>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tzvetomir Stoyanov <tstoyanov@vmware.com>
Link: https://lore.kernel.org/lkml/ZN4Q2rxxsL08A8rd@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
While debugging a segfault on 'perf lock contention' without an
available perf.data file I noticed that it was basically calling:

	perf_session__delete(ERR_PTR(-1))

Resulting in:

  (gdb) run lock contention
  Starting program: /root/bin/perf lock contention
  [Thread debugging using libthread_db enabled]
  Using host libthread_db library "/lib64/libthread_db.so.1".
  failed to open perf.data: No such file or directory  (try 'perf record' first)
  Initializing perf session failed

  Program received signal SIGSEGV, Segmentation fault.
  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  2858		if (!session->auxtrace)
  (gdb) p session
  $1 = (struct perf_session *) 0xffffffffffffffff
  (gdb) bt
  #0  0x00000000005e7515 in auxtrace__free (session=0xffffffffffffffff) at util/auxtrace.c:2858
  #1  0x000000000057bb4d in perf_session__delete (session=0xffffffffffffffff) at util/session.c:300
  #2  0x000000000047c421 in __cmd_contention (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2161
  #3  0x000000000047dc95 in cmd_lock (argc=0, argv=0x7fffffffe200) at builtin-lock.c:2604
  #4  0x0000000000501466 in run_builtin (p=0xe597a8 <commands+552>, argc=2, argv=0x7fffffffe200) at perf.c:322
  #5  0x00000000005016d5 in handle_internal_command (argc=2, argv=0x7fffffffe200) at perf.c:375
  #6  0x0000000000501824 in run_argv (argcp=0x7fffffffe02c, argv=0x7fffffffe020) at perf.c:419
  #7  0x0000000000501b11 in main (argc=2, argv=0x7fffffffe200) at perf.c:535
  (gdb)

So just set it to NULL after using PTR_ERR(session) to decode the error
as perf_session__delete(NULL) is supported.

Fixes: eef4fee ("perf lock: Dynamically allocate lockhash_table")
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Mamatha Inamdar <mamatha4@linux.vnet.ibm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ross Zwisler <zwisler@chromium.org>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Yang Jihong <yangjihong1@huawei.com>
Link: https://lore.kernel.org/lkml/ZN4R1AYfsD2J8lRs@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
Jiri Pirko says:

====================
expose devlink instances relationships

From: Jiri Pirko <jiri@nvidia.com>

Currently, the user can instantiate new SF using "devlink port add"
command. That creates an E-switch representor devlink port.

When user activates this SF, there is an auxiliary device created and
probed for it which leads to SF devlink instance creation.

There is 1:1 relationship between E-switch representor devlink port and
the SF auxiliary device devlink instance.

Also, for example in mlx5, one devlink instance is created for
PCI device and one is created for an auxiliary device that represents
the uplink port. The relation between these is invisible to the user.

Patches #1-#3 and #5 are small preparations.

Patch #4 adds netnsid attribute for nested devlink if that in a
different namespace.

Patch #5 is the main one in this set, introduces the relationship
tracking infrastructure later on used to track SFs, linecards and
devlink instance relationships with nested devlink instances.

Expose the relation to the user by introducing new netlink attribute
DEVLINK_PORT_FN_ATTR_DEVLINK which contains the devlink instance related
to devlink port function. This is done by patch #8.
Patch #9 implements this in mlx5 driver.

Patch #10 converts the linecard nested devlink handling to the newly
introduced rel infrastructure.

Patch #11 benefits from the rel infra and introduces possiblitily to
have relation between devlink instances.
Patch #12 implements this in mlx5 driver.

Examples:
$ devlink dev
pci/0000:08:00.0: nested_devlink auxiliary/mlx5_core.eth.0
pci/0000:08:00.1: nested_devlink auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.0

$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 106
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached roce enable
$ devlink port function set pci/0000:08:00.0/32768 state active
$ devlink port show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2

$ devlink port show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2 nested_devlink_netns ns1
====================

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
Sebastian Andrzej Siewior says:

====================
net: hsr: Properly parse HSRv1 supervisor frames.

this is a follow-up to
	https://lore.kernel.org/all/20230825153111.228768-1-lukma@denx.de/
replacing
	https://lore.kernel.org/all/20230914124731.1654059-1-lukma@denx.de/

by grabing/ adding tags and reposting with a commit message plus a
missing __packed to a struct (#2) plus extending the testsuite to sover
HSRv1 which is what broke here (#3-#5).

HSRv0 is (was) not affected.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Oct 6, 2023
Fix an error detected by memory sanitizer:
```
==4033==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x55fb0fbedfc7 in read_alias_info tools/perf/util/pmu.c:457:6
    #1 0x55fb0fbea339 in check_info_data tools/perf/util/pmu.c:1434:2
    #2 0x55fb0fbea339 in perf_pmu__check_alias tools/perf/util/pmu.c:1504:9
    #3 0x55fb0fbdca85 in parse_events_add_pmu tools/perf/util/parse-events.c:1429:32
    #4 0x55fb0f965230 in parse_events_parse tools/perf/util/parse-events.y:299:6
    #5 0x55fb0fbdf6b2 in parse_events__scanner tools/perf/util/parse-events.c:1822:8
    #6 0x55fb0fbdf8c1 in __parse_events tools/perf/util/parse-events.c:2094:8
    #7 0x55fb0fa8ffa9 in parse_events tools/perf/util/parse-events.h:41:9
    #8 0x55fb0fa8ffa9 in test_event tools/perf/tests/parse-events.c:2393:8
    #9 0x55fb0fa8f458 in test__pmu_events tools/perf/tests/parse-events.c:2551:15
    #10 0x55fb0fa6d93f in run_test tools/perf/tests/builtin-test.c:242:9
    #11 0x55fb0fa6d93f in test_and_print tools/perf/tests/builtin-test.c:271:8
    #12 0x55fb0fa6d082 in __cmd_test tools/perf/tests/builtin-test.c:442:5
    torvalds#13 0x55fb0fa6d082 in cmd_test tools/perf/tests/builtin-test.c:564:9
    torvalds#14 0x55fb0f942720 in run_builtin tools/perf/perf.c:322:11
    torvalds#15 0x55fb0f942486 in handle_internal_command tools/perf/perf.c:375:8
    torvalds#16 0x55fb0f941dab in run_argv tools/perf/perf.c:419:2
    torvalds#17 0x55fb0f941dab in main tools/perf/perf.c:535:3
```

Fixes: 7b723db ("perf pmu: Be lazy about loading event info files from sysfs")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20230914022425.1489035-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
borkmann pushed a commit that referenced this pull request Oct 6, 2023
The following call trace shows a deadlock issue due to recursive locking of
mutex "device_mutex". First lock acquire is in target_for_each_device() and
second in target_free_device().

 PID: 148266   TASK: ffff8be21ffb5d00  CPU: 10   COMMAND: "iscsi_ttx"
  #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f
  #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224
  #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee
  #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7
  #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3
  #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c
  #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod]
  #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod]
  #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f
  #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583
 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod]
 #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc
 #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod]
 torvalds#13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod]
 torvalds#14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod]
 torvalds#15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod]
 torvalds#16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07
 torvalds#17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod]
 torvalds#18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod]
 torvalds#19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080
 torvalds#20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364

Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
borkmann pushed a commit that referenced this pull request Oct 17, 2023
Fix the deadlock by refactoring the MR cache cleanup flow to flush the
workqueue without holding the rb_lock.
This adds a race between cache cleanup and creation of new entries which
we solve by denied creation of new entries after cache cleanup started.

Lockdep:
WARNING: possible circular locking dependency detected
 [ 2785.326074 ] 6.2.0-rc6_for_upstream_debug_2023_01_31_14_02 #1 Not tainted
 [ 2785.339778 ] ------------------------------------------------------
 [ 2785.340848 ] devlink/53872 is trying to acquire lock:
 [ 2785.341701 ] ffff888124f8c0c8 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0xc8/0x900
 [ 2785.343403 ]
 [ 2785.343403 ] but task is already holding lock:
 [ 2785.344464 ] ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]
 [ 2785.346273 ]
 [ 2785.346273 ] which lock already depends on the new lock.
 [ 2785.346273 ]
 [ 2785.347720 ]
 [ 2785.347720 ] the existing dependency chain (in reverse order) is:
 [ 2785.349003 ]
 [ 2785.349003 ] -> #1 (&dev->cache.rb_lock){+.+.}-{3:3}:
 [ 2785.350160 ]        __mutex_lock+0x14c/0x15c0
 [ 2785.350962 ]        delayed_cache_work_func+0x2d1/0x610 [mlx5_ib]
 [ 2785.352044 ]        process_one_work+0x7c2/0x1310
 [ 2785.352879 ]        worker_thread+0x59d/0xec0
 [ 2785.353636 ]        kthread+0x28f/0x330
 [ 2785.354370 ]        ret_from_fork+0x1f/0x30
 [ 2785.355135 ]
 [ 2785.355135 ] -> #0 ((work_completion)(&(&ent->dwork)->work)){+.+.}-{0:0}:
 [ 2785.356515 ]        __lock_acquire+0x2d8a/0x5fe0
 [ 2785.357349 ]        lock_acquire+0x1c1/0x540
 [ 2785.358121 ]        __flush_work+0xe8/0x900
 [ 2785.358852 ]        __cancel_work_timer+0x2c7/0x3f0
 [ 2785.359711 ]        mlx5_mkey_cache_cleanup+0xfb/0x250 [mlx5_ib]
 [ 2785.360781 ]        mlx5_ib_stage_pre_ib_reg_umr_cleanup+0x16/0x30 [mlx5_ib]
 [ 2785.361969 ]        __mlx5_ib_remove+0x68/0x120 [mlx5_ib]
 [ 2785.362960 ]        mlx5r_remove+0x63/0x80 [mlx5_ib]
 [ 2785.363870 ]        auxiliary_bus_remove+0x52/0x70
 [ 2785.364715 ]        device_release_driver_internal+0x3c1/0x600
 [ 2785.365695 ]        bus_remove_device+0x2a5/0x560
 [ 2785.366525 ]        device_del+0x492/0xb80
 [ 2785.367276 ]        mlx5_detach_device+0x1a9/0x360 [mlx5_core]
 [ 2785.368615 ]        mlx5_unload_one_devl_locked+0x5a/0x110 [mlx5_core]
 [ 2785.369934 ]        mlx5_devlink_reload_down+0x292/0x580 [mlx5_core]
 [ 2785.371292 ]        devlink_reload+0x439/0x590
 [ 2785.372075 ]        devlink_nl_cmd_reload+0xaef/0xff0
 [ 2785.372973 ]        genl_family_rcv_msg_doit.isra.0+0x1bd/0x290
 [ 2785.374011 ]        genl_rcv_msg+0x3ca/0x6c0
 [ 2785.374798 ]        netlink_rcv_skb+0x12c/0x360
 [ 2785.375612 ]        genl_rcv+0x24/0x40
 [ 2785.376295 ]        netlink_unicast+0x438/0x710
 [ 2785.377121 ]        netlink_sendmsg+0x7a1/0xca0
 [ 2785.377926 ]        sock_sendmsg+0xc5/0x190
 [ 2785.378668 ]        __sys_sendto+0x1bc/0x290
 [ 2785.379440 ]        __x64_sys_sendto+0xdc/0x1b0
 [ 2785.380255 ]        do_syscall_64+0x3d/0x90
 [ 2785.381031 ]        entry_SYSCALL_64_after_hwframe+0x46/0xb0
 [ 2785.381967 ]
 [ 2785.381967 ] other info that might help us debug this:
 [ 2785.381967 ]
 [ 2785.383448 ]  Possible unsafe locking scenario:
 [ 2785.383448 ]
 [ 2785.384544 ]        CPU0                    CPU1
 [ 2785.385383 ]        ----                    ----
 [ 2785.386193 ]   lock(&dev->cache.rb_lock);
 [ 2785.386940 ]				lock((work_completion)(&(&ent->dwork)->work));
 [ 2785.388327 ]				lock(&dev->cache.rb_lock);
 [ 2785.389425 ]   lock((work_completion)(&(&ent->dwork)->work));
 [ 2785.390414 ]
 [ 2785.390414 ]  *** DEADLOCK ***
 [ 2785.390414 ]
 [ 2785.391579 ] 6 locks held by devlink/53872:
 [ 2785.392341 ]  #0: ffffffff84c17a50 (cb_lock){++++}-{3:3}, at: genl_rcv+0x15/0x40
 [ 2785.393630 ]  #1: ffff888142280218 (&devlink->lock_key){+.+.}-{3:3}, at: devlink_get_from_attrs_lock+0x12d/0x2d0
 [ 2785.395324 ]  #2: ffff8881422d3c38 (&dev->lock_key){+.+.}-{3:3}, at: mlx5_unload_one_devl_locked+0x4a/0x110 [mlx5_core]
 [ 2785.397322 ]  #3: ffffffffa0e59068 (mlx5_intf_mutex){+.+.}-{3:3}, at: mlx5_detach_device+0x60/0x360 [mlx5_core]
 [ 2785.399231 ]  #4: ffff88810e3cb0e8 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x8d/0x600
 [ 2785.400864 ]  #5: ffff88817e8f1260 (&dev->cache.rb_lock){+.+.}-{3:3}, at: mlx5_mkey_cache_cleanup+0x77/0x250 [mlx5_ib]

Fixes: b958451 ("RDMA/mlx5: Change the cache structure to an RB-tree")
Signed-off-by: Shay Drory <shayd@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
borkmann pushed a commit that referenced this pull request Oct 17, 2023
Petr Machata says:

====================
mlxsw: Control the order of blocks in ACL region

Amit Cohen writes:

For 12 key blocks in the A-TCAM, rules are split into two records, which
constitute two lookups. The two records are linked using a
"large entry key ID".

Due to a Spectrum-4 hardware issue, KVD entries that correspond to key
blocks 0 to 5 of 12 key blocks will be placed in the same KVD pipe if they
only differ in their "large entry key ID", as it is ignored. This results
in a reduced scale, we can insert less than 20k filters and get an error:

    $ tc -b flower.batch
    RTNETLINK answers: Input/output error
    We have an error talking to the kernel

To reduce the probability of this issue, we can place key blocks with
high entropy in blocks 0 to 5. The idea is to place blocks that are often
changed in blocks 0 to 5, for example, key blocks that match on IPv4
addresses or the LSBs of IPv6 addresses. Such placement will reduce the
probability of these blocks to be same.

Mark several blocks with 'high_entropy' flag and place them in blocks 0
to 5. Note that the list of the blocks is just a suggestion, I will verify
it with architects.

Currently, there is a one loop that chooses which blocks should be used
for a given list of elements and fills the blocks - when a block is
chosen, it fills it in the region. To be able to control the order of
the blocks, separate between searching blocks and filling them. Several
pre-changes are required.

Patch set overview:
Patch #1 marks several blocks with 'high_entropy' flag.
Patches #2-#4 prepare the code for filling blocks at the end of the search.
Patch #5 changes the loop to just choose the blocks and fill the blocks at
the end.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Oct 17, 2023
Amit Cohen says:

====================
Extend VXLAN driver to support FDB flushing

The merge commit 9271686 ("Merge branch 'br-flush-filtering'") added
support for FDB flushing in bridge driver. Extend VXLAN driver to support
FDB flushing also. Add support for filtering by fields which are relevant
for VXLAN FDBs:
* Source VNI
* Nexthop ID
* 'router' flag
* Destination VNI
* Destination Port
* Destination IP

Without this set, flush for VXLAN device fails:
$ bridge fdb flush dev vx10
RTNETLINK answers: Operation not supported

With this set, such flush works with the relevant arguments, for example:
$ bridge fdb flush dev vx10 vni 5000 dst 193.2.2.1
< flush all vx10 entries with VNI 5000 and destination IP 193.2.2.1>

Some preparations are required, handle them before adding flushing support
in VXLAN driver. See more details in commit messages.

Patch set overview:
Patch #1 prepares flush policy to be used by VXLAN driver
Patches #2-#3 are preparations in VXLAN driver
Patch #4 adds an initial support for flushing in VXLAN driver
Patches #5-#9 add support for filtering by several attributes
Patch #10 adds a test for FDB flush with VXLAN
Patch #11 extends the test to check FDB flush with bridge
====================

Acked-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann added a commit that referenced this pull request Oct 17, 2023
Andrii Nakryiko says:

====================
This patch set fixes ambiguity in BPF verifier log output of SCALAR register
in the parts that emit umin/umax, smin/smax, etc ranges. See patch #4 for
details.

Also, patch #5 fixes an issue with verifier log missing instruction context
(state) output for conditionals that trigger precision marking. See details in
the patch.

First two patches are just improvements to two selftests that are very flaky
locally when run in parallel mode.

Patch #3 changes 'align' selftest to be less strict about exact verifier log
output (which patch #4 changes, breaking lots of align tests as written). Now
test does more of a register substate checks, mostly around expected var_off()
values. This 'align' selftests is one of the more brittle ones and requires
constant adjustment when verifier log output changes, without really catching
any new issues. So hopefully these changes can minimize future support efforts
for this specific set of tests.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
borkmann pushed a commit that referenced this pull request Oct 20, 2023
Chuyi Zhou says:

====================
Add Open-coded task, css_task and css iters

This is version 6 of task, css_task and css iters support.

--- Changelog ---

v5 -> v6:

Patch #3:
 * In bpf_iter_task_next, return pos rather than goto out. (Andrii)
Patch #2, #3, #4:
 * Add the missing __diag_ignore_all to avoid kernel build warning
Patch #5, #6, #7:
 * Add Andrii's ack

Patch #8:
 * In BPF prog iter_css_task_for_each, return -EPERM rather than 0, and
   ensure stack_mprotect() in iters.c not success. If not, it would cause
   the subsequent 'test_lsm' fail, since the 'is_stack' check in
   test_int_hook(lsm.c) would not be guaranteed.
   (https://github.com/kernel-patches/bpf/actions/runs/6489662214/job/17624665086?pr=5790)

v4 -> v5:https://lore.kernel.org/lkml/20231007124522.34834-1-zhouchuyi@bytedance.com/

Patch 3~4:
 * Relax the BUILD_BUG_ON check in bpf_iter_task_new and bpf_iter_css_new to avoid
   netdev/build_32bit CI error.
   (https://netdev.bots.linux.dev/static/nipa/790929/13412333/build_32bit/stderr)
Patch 8:
 * Initialize skel pointer to fix the LLVM-16 build CI error
   (https://github.com/kernel-patches/bpf/actions/runs/6462875618/job/17545170863)

v3 -> v4:https://lore.kernel.org/all/20230925105552.817513-1-zhouchuyi@bytedance.com/

* Address all the comments from Andrii in patch-3 ~ patch-6
* Collect Tejun's ack
* Add a extra patch to rename bpf_iter_task.c to bpf_iter_tasks.c
* Seperate three BPF program files for selftests (iters_task.c iters_css_task.c iters_css.c)

v2 -> v3:https://lore.kernel.org/lkml/20230912070149.969939-1-zhouchuyi@bytedance.com/

Patch 1 (cgroup: Prepare for using css_task_iter_*() in BPF)
  * Add tj's ack and Alexei's suggest-by.
Patch 2 (bpf: Introduce css_task open-coded iterator kfuncs)
  * Use bpf_mem_alloc/bpf_mem_free rather than kzalloc()
  * Add KF_TRUSTED_ARGS for bpf_iter_css_task_new (Alexei)
  * Move bpf_iter_css_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 3 (Introduce task open coded iterator kfuncs)
  * Change th API design keep consistent with SEC("iter/task"), support
    iterating all threads(BPF_TASK_ITERATE_ALL) and threads of a
    specific task (BPF_TASK_ITERATE_THREAD).(Andrii)
  * Move bpf_iter_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 4 (Introduce css open-coded iterator kfuncs)
  * Change th API design keep consistent with cgroup_iters, reuse
    BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST
    /BPF_CGROUP_ITER_ANCESTORS_UP(Andrii)
  * Add KF_TRUSTED_ARGS for bpf_iter_css_new
  * Move bpf_iter_css's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 5 (teach the verifier to enforce css_iter and task_iter in RCU CS)
  * Add KF flag KF_RCU_PROTECTED to maintain kfuncs which need RCU CS.(Andrii)
  * Consider STACK_ITER when using bpf_for_each_spilled_reg.
Patch 6 (Let bpf_iter_task_new accept null task ptr)
  * Add this extra patch to let bpf_iter_task_new accept a 'nullable'
  * task pointer(Andrii)
Patch 7 (selftests/bpf: Add tests for open-coded task and css iter)
  * Add failure testcase(Alexei)

Changes from v1(https://lore.kernel.org/lkml/20230827072057.1591929-1-zhouchuyi@bytedance.com/):
- Add a pre-patch to make some preparations before supporting css_task
  iters.(Alexei)
- Add an allowlist for css_task iters(Alexei)
- Let bpf progs do explicit bpf_rcu_read_lock() when using process
  iters and css_descendant iters.(Alexei)
---------------------

In some BPF usage scenarios, it will be useful to iterate the process and
css directly in the BPF program. One of the expected scenarios is
customizable OOM victim selection via BPF[1].

Inspired by Dave's task_vma iter[2], this patchset adds three types of
open-coded iterator kfuncs:

1. bpf_task_iters. It can be used to
1) iterate all process in the system, like for_each_forcess() in kernel.
2) iterate all threads in the system.
3) iterate all threads of a specific task

2. bpf_css_iters. It works like css_task_iter_{start, next, end} and would
be used to iterating tasks/threads under a css.

3. css_iters. It works like css_next_descendant_{pre, post} to iterating all
descendant css.

BPF programs can use these kfuncs directly or through bpf_for_each macro.

link[1]: https://lore.kernel.org/lkml/20230810081319.65668-1-zhouchuyi@bytedance.com/
link[2]: https://lore.kernel.org/all/20230810183513.684836-1-davemarchevsky@fb.com/
====================

Link: https://lore.kernel.org/r/20231018061746.111364-1-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann pushed a commit that referenced this pull request Oct 24, 2023
Eduard Zingerman says:

====================
exact states comparison for iterator convergence checks

Iterator convergence logic in is_state_visited() uses state_equals()
for states with branches counter > 0 to check if iterator based loop
converges. This is not fully correct because state_equals() relies on
presence of read and precision marks on registers. These marks are not
guaranteed to be finalized while state has branches.
Commit message for patch #3 describes a program that exhibits such
behavior.

This patch-set aims to fix iterator convergence logic by adding notion
of exact states comparison. Exact comparison does not rely on presence
of read or precision marks and thus is more strict.
As explained in commit message for patch #3 exact comparisons require
addition of speculative register bounds widening. The end result for
BPF verifier users could be summarized as follows:

(!) After this update verifier would reject programs that conjure an
    imprecise value on the first loop iteration and use it as precise
    on the second (for iterator based loops).

I urge people to at least skim over the commit message for patch #3.

Patches are organized as follows:
- patches #1,2: moving/extracting utility functions;
- patch #3: introduces exact mode for states comparison and adds
  widening heuristic;
- patch #4: adds test-cases that demonstrate why the series is
  necessary;
- patch #5: extends patch #3 with a notion of state loop entries,
  these entries have to be tracked to correctly identify that
  different verifier states belong to the same states loop;
- patch #6: adds a test-case that demonstrates a program
  which requires loop entry tracking for correct verification;
- patch #7: just adds a few debug prints.

The following actions are planned as a followup for this patch-set:
- implementation has to be adapted for callbacks handling logic as a
  part of a fix for [1];
- it is necessary to explore ways to improve widening heuristic to
  handle iters_task_vma test w/o need to insert barrier_var() calls;
- explored states eviction logic on cache miss has to be extended
  to either:
  - allow eviction of checkpoint states -or-
  - be sped up in case if there are many active checkpoints associated
    with the same instruction.

The patch-set is a followup for mailing list discussion [1].

Changelog:
- V2 [3] -> V3:
  - correct check for stack spills in widen_imprecise_scalars(),
    added test case progs/iters.c:widen_spill to check the behavior
    (suggested by Andrii);
  - allow eviction of checkpoint states in is_state_visited() to avoid
    pathological verifier performance when iterator based loop does not
    converge (discussion with Alexei).
- V1 [2] -> V2, applied changes suggested by Alexei offlist:
  - __explored_state() function removed;
  - same_callsites() function is now used in clean_live_states();
  - patches #1,2 are added as preparatory code movement;
  - in process_iter_next_call() a safeguard is added to verify that
    cur_st->parent exists and has expected insn index / call sites.

[1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/
[2] https://lore.kernel.org/bpf/20231021005939.1041-1-eddyz87@gmail.com/
[3] https://lore.kernel.org/bpf/20231022010812.9201-1-eddyz87@gmail.com/
====================

Link: https://lore.kernel.org/r/20231024000917.12153-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann pushed a commit that referenced this pull request Oct 27, 2023
Commit b035f5a ("mm: slab: reduce the kmalloc() minimum alignment
if DMA bouncing possible") allows architectures with non-coherent DMA to
define a small ARCH_KMALLOC_MINALIGN (e.g. sizeof(unsigned long long))
and this has been enabled on arm64. With KASAN_HW_TAGS enabled, however,
ARCH_SLAB_MINALIGN becomes 16 on arm64 (arch_slab_minalign() dynamically
selects it since commit d949a81 ("mm: make minimum slab alignment a
runtime property")). This can lead to a situation where kmalloc-8 caches
are attempted to be created with a kmem_caches.size aligned to 16. When
the cache is mergeable, it can lead to kernel warnings like:

sysfs: cannot create duplicate filename '/kernel/slab/:d-0000016'
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.6.0-rc1-00001-gda98843cd306-dirty #5
Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
Call trace:
 dump_backtrace+0x90/0xe8
 show_stack+0x18/0x24
 dump_stack_lvl+0x48/0x60
 dump_stack+0x18/0x24
 sysfs_warn_dup+0x64/0x80
 sysfs_create_dir_ns+0xe8/0x108
 kobject_add_internal+0x98/0x264
 kobject_init_and_add+0x8c/0xd8
 sysfs_slab_add+0x12c/0x248
 slab_sysfs_init+0x98/0x14c
 do_one_initcall+0x6c/0x1b0
 kernel_init_freeable+0x1c0/0x288
 kernel_init+0x24/0x1e0
 ret_from_fork+0x10/0x20
kobject: kobject_add_internal failed for :d-0000016 with -EEXIST, don't try to register things with the same name in the same directory.
SLUB: Unable to add boot slab dma-kmalloc-8 to sysfs

Limit the __kmalloc_minalign() return value (used to create the
kmalloc-* caches) to arch_slab_minalign() so that kmalloc-8 caches are
skipped when KASAN_HW_TAGS is enabled (both config and runtime).

Reported-by: Mark Rutland <mark.rutland@arm.com>
Fixes: b035f5a ("mm: slab: reduce the kmalloc() minimum alignment if DMA bouncing possible")
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: stable@vger.kernel.org # 6.5.x
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
borkmann pushed a commit that referenced this pull request Oct 27, 2023
Jiri Pirko says:

====================
devlink: fix a deadlock when taking devlink instance lock while holding RTNL lock

devlink_port_fill() may be called sometimes with RTNL lock held.
When putting the nested port function devlink instance attrs,
current code takes nested devlink instance lock. In that case lock
ordering is wrong.

Patch #1 is a dependency of patch #2.
Patch #2 converts the peernet2id_alloc() call to rely in RCU so it could
         called without devlink instance lock.
Patch #3 takes device reference for devlink instance making sure that
         device does not disappear before devlink_release() is called.
Patch #4 benefits from the preparations done in patches #2 and #3 and
         removes the problematic nested devlink lock aquisition.
Patched #5-#7 improve documentation to reflect this issue so it is
              avoided in the future.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Oct 27, 2023
Petr Machata says:

====================
mlxsw: Move allocation of LAG table to the driver

PGT is an in-HW table that maps addresses to sets of ports. Then when some
HW process needs a set of ports as an argument, instead of embedding the
actual set in the dynamic configuration, what gets configured is the
address referencing the set. The HW then works with the appropriate PGT
entry.

Within the PGT is placed a LAG table. That is a contiguous block of PGT
memory where each entry describes which ports are members of the
corresponding LAG port.

The PGT is split to two parts: one managed by the FW, and one managed by
the driver. Historically, the FW part included also the LAG table, referred
to as FW LAG mode. Giving the responsibility for placement of the LAG table
to the driver, referred to as SW LAG mode, makes the whole system more
flexible. The FW currently supports both FW and SW LAG modes. To shed
complexity, the FW should in the future only support SW LAG mode.

Hence this patchset, where support for placement of LAG is added to mlxsw.

There are FW versions out there that do not support SW LAG mode, and on
Spectrum-1 in particular, there is no plan to support it at all. mlxsw will
therefore have to support both modes of operation.

Another aspect is that at least on Spectrum-1, there are FW versions out
there that claim to support driver-placed LAG table, but then reject or
ignore configurations enabling the same. The driver thus has to have a say
in whether an attempt to configure SW LAG mode should even be done.

The feature is therefore expressed in terms of "does the driver prefer SW
LAG mode?", and "what LAG mode the PCI module managed to configure the FW
with". This is unlike current flood mode configuration, where the driver
can give a strict value, and that's what gets configured. But it gives a
chance to the driver to determine whether LAG mode should be enabled at
all.

The "does the driver prefer SW LAG mode?" bit is expressed as a boolean
lag_mode_prefer_sw. The reason for this is largely another feature that
will be introduced in a follow-up patchset: support for CFF flood mode. The
driver currently requires that the FW be configured with what is called
controlled flood mode. But on capable systems, CFF would be preferred. So
there are two values in flight: the preferred flood mode, and the fallback.
This could be expressed with an array of flood modes ordered by preference,
but that looks like an overkill in comparison. This flag/value model is
then reused for LAG mode as well, except the fallback value is absent and
implied to be FW, because there are no other values to choose from.

The patchset progresses as follows:

- Patches #1 to #5 adjust reg.h and cmd.h with new register fields,
  constants and remarks.

- Patches #6 and #7 add the ability to request SW LAG mode and to query the
  LAG mode that was actually negotiated. This is where the abovementioned
  lag_mode_prefer_sw flag is added.

- Patches #7 to #9 generalize PGT allocations to make it possible to
  allocate the LAG table, which is done in patch #10.

- In patch #11, toggle lag_mode_prefer_sw on Spectrum-2 and above, which
  makes the newly-added code live.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants