Skip to content

bpf: switch to memcg-based memory accounting #339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

kernel-patches-bot
Copy link

Pull request for series with
subject: bpf: switch to memcg-based memory accounting
version: 5
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159

@kernel-patches-bot
Copy link
Author

Master branch: c365387
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
version: 5

Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
error message:

Cmd('git') failed due to: exit code(128)
  cmdline: git am -3
  stdout: 'Applying: bpf: memcg-based memory accounting for bpf progs
Applying: bpf: prepare for memcg-based memory accounting for bpf maps
Applying: bpf: memcg-based memory accounting for bpf maps
Applying: bpf: refine memcg-based memory accounting for arraymap maps
Applying: bpf: refine memcg-based memory accounting for cpumap maps
Applying: bpf: memcg-based memory accounting for cgroup storage maps
Applying: bpf: eliminate rlimit-based memory accounting for bpf local storage maps
Patch failed at 0007 bpf: eliminate rlimit-based memory accounting for bpf local storage maps
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".'
  stderr: 'error: sha1 information is lacking or useless (kernel/bpf/bpf_local_storage.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch' to see the failed patch'

conflict:


@kernel-patches-bot
Copy link
Author

Master branch: c365387
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
version: 5

@kernel-patches-bot
Copy link
Author

Master branch: 0a58a65
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
version: 5

@kernel-patches-bot
Copy link
Author

Master branch: 904709f
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
version: 5

kernel-patches-bot and others added 18 commits November 13, 2020 23:48
Include memory used by bpf programs into the memcg-based accounting.
This includes the memory used by programs itself, auxiliary data,
statistics and bpf line info. A memory cgroup containing the
process which loads the program is getting charged.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
In the absolute majority of cases if a process is making a kernel
allocation, it's memory cgroup is getting charged.

Bpf maps can be updated from an interrupt context and in such
case there is no process which can be charged. It makes the memory
accounting of bpf maps non-trivial.

Fortunately, after commits 4127c65 ("mm: kmem: enable kernel
memcg accounting from interrupt contexts") and b87d8ce
("mm, memcg: rework remote charging API to support nesting")
it's finally possible.

To do it, a pointer to the memory cgroup of the process which created
the map is saved, and this cgroup is getting charged for all
allocations made from an interrupt context.

Allocations made from a process context will be accounted in a usual way.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
This patch enables memcg-based memory accounting for memory allocated
by __bpf_map_area_alloc(), which is used by many types of bpf maps for
large memory allocations.

Following patches in the series will refine the accounting for
some of the map types.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Include percpu arrays and auxiliary data into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Include metadata and percpu data into the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Account memory used by cgroup storage maps including metadata
structures.

Account the percpu memory for the percpu flavor of cgroup storage.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Include map metadata and the node size (struct bpf_dtab_netdev)
into the accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Include percpu objects and the size of map metadata into the
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Include lpm trie and lpm trie node objects into the memcg-based memory
accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Enable the memcg-based memory accounting for the memory used by
the bpf ringbuffer.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Account memory used by bpf local storage maps:
per-socket and per-inode storages.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Include internal metadata into the memcg-based memory accounting.
Also include the memory allocated on updating an element.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Extend xskmap memory accounting to include the memory taken by
the xsk_map_node structure.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for arraymap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for bpf_struct_ops maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for cpumap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for cgroup storage maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for devmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for hashtab maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for lpm_trie maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for queue_stack maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for reuseport_array maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for bpf ringbuffer.
It has been replaced with the memcg-based memory accounting.

bpf_ringbuf_alloc() can't return anything except ERR_PTR(-ENOMEM)
and a valid pointer, so to simplify the code make it return NULL
in the first case. This allows to drop a couple of lines in
ringbuf_map_alloc() and also makes it look similar to other memory
allocating function like kmalloc().

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
…h maps

Do not use rlimit-based memory accounting for sockmap and sockhash maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for stackmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for xskmap maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Roman Gushchin <guro@fb.com>
Do not use rlimit-based memory accounting for bpf local storage maps.
It has been replaced with the memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Remove rlimit-based accounting infrastructure code, which is not used
anymore.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Do not use rlimit-based memory accounting for bpf progs. It has been
replaced with memcg-based memory accounting.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Since bpf is not using rlimit memlock for the memory accounting
and control, do not change the limit in sample applications.

Signed-off-by: Roman Gushchin <guro@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
@kernel-patches-bot
Copy link
Author

Master branch: c14d61f
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
version: 5

@kernel-patches-bot
Copy link
Author

Master branch: 024cd2c
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
version: 5

Pull request is NOT updated. Failed to apply https://patchwork.kernel.org/project/netdevbpf/list/?series=383159
error message:

Cmd('git') failed due to: exit code(128)
  cmdline: git am -3
  stdout: 'Applying: bpf: memcg-based memory accounting for bpf progs
Applying: bpf: prepare for memcg-based memory accounting for bpf maps
Applying: bpf: memcg-based memory accounting for bpf maps
Applying: bpf: refine memcg-based memory accounting for arraymap maps
Applying: bpf: refine memcg-based memory accounting for cpumap maps
Applying: bpf: memcg-based memory accounting for cgroup storage maps
Applying: bpf: refine memcg-based memory accounting for devmap maps
Applying: bpf: refine memcg-based memory accounting for hashtab maps
Applying: bpf: memcg-based memory accounting for lpm_trie maps
Applying: bpf: memcg-based memory accounting for bpf ringbuffer
Applying: bpf: memcg-based memory accounting for bpf local storage maps
Applying: bpf: refine memcg-based memory accounting for sockmap and sockhash maps
Applying: bpf: refine memcg-based memory accounting for xskmap maps
Applying: bpf: eliminate rlimit-based memory accounting for arraymap maps
Applying: bpf: eliminate rlimit-based memory accounting for bpf_struct_ops maps
Applying: bpf: eliminate rlimit-based memory accounting for cpumap maps
Applying: bpf: eliminate rlimit-based memory accounting for cgroup storage maps
Applying: bpf: eliminate rlimit-based memory accounting for devmap maps
Applying: bpf: eliminate rlimit-based memory accounting for hashtab maps
Applying: bpf: eliminate rlimit-based memory accounting for lpm_trie maps
Applying: bpf: eliminate rlimit-based memory accounting for queue_stack_maps maps
Applying: bpf: eliminate rlimit-based memory accounting for reuseport_array maps
Applying: bpf: eliminate rlimit-based memory accounting for bpf ringbuffer
Applying: bpf: eliminate rlimit-based memory accounting for sockmap and sockhash maps
Applying: bpf: eliminate rlimit-based memory accounting for stackmap maps
Applying: bpf: eliminate rlimit-based memory accounting for xskmap maps
Applying: bpf: eliminate rlimit-based memory accounting for bpf local storage maps
Applying: bpf: eliminate rlimit-based memory accounting infra for bpf maps
Applying: bpf: eliminate rlimit-based memory accounting for bpf progs
Applying: bpf: samples: do not touch RLIMIT_MEMLOCK
Using index info to reconstruct a base tree...
M	samples/bpf/task_fd_query_user.c
M	samples/bpf/tracex2_user.c
M	samples/bpf/tracex3_user.c
M	samples/bpf/xdp_redirect_cpu_user.c
M	samples/bpf/xdp_rxq_info_user.c
Falling back to patching base and 3-way merge...
Auto-merging samples/bpf/xdp_rxq_info_user.c
CONFLICT (content): Merge conflict in samples/bpf/xdp_rxq_info_user.c
Auto-merging samples/bpf/xdp_redirect_cpu_user.c
CONFLICT (content): Merge conflict in samples/bpf/xdp_redirect_cpu_user.c
Auto-merging samples/bpf/tracex3_user.c
CONFLICT (content): Merge conflict in samples/bpf/tracex3_user.c
Auto-merging samples/bpf/tracex2_user.c
CONFLICT (content): Merge conflict in samples/bpf/tracex2_user.c
Auto-merging samples/bpf/task_fd_query_user.c
CONFLICT (content): Merge conflict in samples/bpf/task_fd_query_user.c
Patch failed at 0030 bpf: samples: do not touch RLIMIT_MEMLOCK
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".'
  stderr: 'error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch' to see the failed patch'

conflict:

diff --cc samples/bpf/task_fd_query_user.c
index b68bd2f8fdc9,0f2050ff54f9..000000000000
--- a/samples/bpf/task_fd_query_user.c
+++ b/samples/bpf/task_fd_query_user.c
@@@ -290,7 -290,6 +290,10 @@@ static int test_debug_fs_uprobe(char *b
  
  int main(int argc, char **argv)
  {
++<<<<<<< HEAD
 +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
++=======
++>>>>>>> bpf: samples: do not touch RLIMIT_MEMLOCK
  	extern char __executable_start;
  	char filename[256], buf[256];
  	__u64 uprobe_file_offset;
diff --cc samples/bpf/tracex2_user.c
index 3d6eab711d23,1626d51dfffd..000000000000
--- a/samples/bpf/tracex2_user.c
+++ b/samples/bpf/tracex2_user.c
@@@ -116,7 -116,6 +116,10 @@@ static void int_exit(int sig
  
  int main(int ac, char **argv)
  {
++<<<<<<< HEAD
 +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
++=======
++>>>>>>> bpf: samples: do not touch RLIMIT_MEMLOCK
  	long key, next_key, value;
  	struct bpf_link *links[2];
  	struct bpf_program *prog;
diff --cc samples/bpf/tracex3_user.c
index 83e0fecbb01a,33e16ba39f25..000000000000
--- a/samples/bpf/tracex3_user.c
+++ b/samples/bpf/tracex3_user.c
@@@ -107,7 -107,6 +107,10 @@@ static void print_hist(int fd
  
  int main(int ac, char **argv)
  {
++<<<<<<< HEAD
 +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
++=======
++>>>>>>> bpf: samples: do not touch RLIMIT_MEMLOCK
  	struct bpf_link *links[2];
  	struct bpf_program *prog;
  	struct bpf_object *obj;
diff --cc samples/bpf/xdp_redirect_cpu_user.c
index f78cb18319aa,576411612523..000000000000
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@@ -765,7 -765,6 +765,10 @@@ static int load_cpumap_prog(char *file_
  
  int main(int argc, char **argv)
  {
++<<<<<<< HEAD
 +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
++=======
++>>>>>>> bpf: samples: do not touch RLIMIT_MEMLOCK
  	char *prog_name = "xdp_cpu_map5_lb_hash_ip_pairs";
  	char *mprog_filename = "xdp_redirect_kern.o";
  	char *redir_interface = NULL, *redir_map = NULL;
diff --cc samples/bpf/xdp_rxq_info_user.c
index 93fa1bc54f13,74a2926eba08..000000000000
--- a/samples/bpf/xdp_rxq_info_user.c
+++ b/samples/bpf/xdp_rxq_info_user.c
@@@ -450,7 -450,6 +450,10 @@@ static void stats_poll(int interval, in
  int main(int argc, char **argv)
  {
  	__u32 cfg_options= NO_TOUCH ; /* Default: Don't touch packet memory */
++<<<<<<< HEAD
 +	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
++=======
++>>>>>>> bpf: samples: do not touch RLIMIT_MEMLOCK
  	struct bpf_prog_load_attr prog_load_attr = {
  		.prog_type	= BPF_PROG_TYPE_XDP,
  	};

@kernel-patches-bot
Copy link
Author

At least one diff in series https://patchwork.kernel.org/project/linux-mm/list/?series=385629 irrelevant now for [{'archived': False, 'project': 399, 'delegate': 121173}]

@kernel-patches-bot kernel-patches-bot deleted the series/336629=>bpf-next branch November 19, 2020 11:09
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 8, 2024
Add a test case which replaces an active ingress qdisc while keeping the
miniq in-tact during the transition period to the new clsact qdisc.

  # ./vmtest.sh -- ./test_progs -t tc_link
  [...]
  ./test_progs -t tc_link
  [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
  [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #332     tc_links_after:OK
  #333     tc_links_append:OK
  #334     tc_links_basic:OK
  #335     tc_links_before:OK
  #336     tc_links_chain_classic:OK
  #337     tc_links_chain_mixed:OK
  #338     tc_links_dev_chain0:OK
  #339     tc_links_dev_cleanup:OK
  #340     tc_links_dev_mixed:OK
  #341     tc_links_ingress:OK
  #342     tc_links_invalid:OK
  #343     tc_links_prepend:OK
  #344     tc_links_replace:OK
  #345     tc_links_revision:OK
  Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 8, 2024
Add a test case which replaces an active ingress qdisc while keeping the
miniq in-tact during the transition period to the new clsact qdisc.

  # ./vmtest.sh -- ./test_progs -t tc_link
  [...]
  ./test_progs -t tc_link
  [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
  [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #332     tc_links_after:OK
  #333     tc_links_append:OK
  #334     tc_links_basic:OK
  #335     tc_links_before:OK
  #336     tc_links_chain_classic:OK
  #337     tc_links_chain_mixed:OK
  #338     tc_links_dev_chain0:OK
  #339     tc_links_dev_cleanup:OK
  #340     tc_links_dev_mixed:OK
  #341     tc_links_ingress:OK
  #342     tc_links_invalid:OK
  #343     tc_links_prepend:OK
  #344     tc_links_replace:OK
  #345     tc_links_revision:OK
  Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jul 8, 2024
Add a test case which replaces an active ingress qdisc while keeping the
miniq in-tact during the transition period to the new clsact qdisc.

  # ./vmtest.sh -- ./test_progs -t tc_link
  [...]
  ./test_progs -t tc_link
  [    3.412871] bpf_testmod: loading out-of-tree module taints kernel.
  [    3.413343] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel
  #332     tc_links_after:OK
  #333     tc_links_append:OK
  #334     tc_links_basic:OK
  #335     tc_links_before:OK
  #336     tc_links_chain_classic:OK
  #337     tc_links_chain_mixed:OK
  #338     tc_links_dev_chain0:OK
  #339     tc_links_dev_cleanup:OK
  #340     tc_links_dev_mixed:OK
  #341     tc_links_ingress:OK
  #342     tc_links_invalid:OK
  #343     tc_links_prepend:OK
  #344     tc_links_replace:OK
  #345     tc_links_revision:OK
  Summary: 14/0 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240708133130.11609-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
kernel-patches-daemon-bpf bot pushed a commit that referenced this pull request Jan 6, 2025
Using mutex lock in IO hot path causes the kernel BUG sleeping while
atomic. Shinichiro[1], first encountered this issue while running blktest
nvme/052 shown below:

BUG: sleeping function called from invalid context at kernel/locking/mutex.c:585
in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 996, name: (udev-worker)
preempt_count: 0, expected: 0
RCU nest depth: 1, expected: 0
2 locks held by (udev-worker)/996:
 #0: ffff8881004570c8 (mapping.invalidate_lock){.+.+}-{3:3}, at: page_cache_ra_unbounded+0x155/0x5c0
 #1: ffffffff8607eaa0 (rcu_read_lock){....}-{1:2}, at: blk_mq_flush_plug_list+0xa75/0x1950
CPU: 2 UID: 0 PID: 996 Comm: (udev-worker) Not tainted 6.12.0-rc3+ #339
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x6a/0x90
 __might_resched.cold+0x1f7/0x23d
 ? __pfx___might_resched+0x10/0x10
 ? vsnprintf+0xdeb/0x18f0
 __mutex_lock+0xf4/0x1220
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_vsnprintf+0x10/0x10
 ? __pfx___mutex_lock+0x10/0x10
 ? snprintf+0xa5/0xe0
 ? xas_load+0x1ce/0x3f0
 ? nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 nvmet_subsys_nsid_exists+0xb9/0x150 [nvmet]
 ? __pfx_nvmet_subsys_nsid_exists+0x10/0x10 [nvmet]
 nvmet_req_find_ns+0x24e/0x300 [nvmet]
 nvmet_req_init+0x694/0xd40 [nvmet]
 ? blk_mq_start_request+0x11c/0x750
 ? nvme_setup_cmd+0x369/0x990 [nvme_core]
 nvme_loop_queue_rq+0x2a7/0x7a0 [nvme_loop]
 ? __pfx___lock_acquire+0x10/0x10
 ? __pfx_nvme_loop_queue_rq+0x10/0x10 [nvme_loop]
 __blk_mq_issue_directly+0xe2/0x1d0
 ? __pfx___blk_mq_issue_directly+0x10/0x10
 ? blk_mq_request_issue_directly+0xc2/0x140
 blk_mq_plug_issue_direct+0x13f/0x630
 ? lock_acquire+0x2d/0xc0
 ? blk_mq_flush_plug_list+0xa75/0x1950
 blk_mq_flush_plug_list+0xa9d/0x1950
 ? __pfx_blk_mq_flush_plug_list+0x10/0x10
 ? __pfx_mpage_readahead+0x10/0x10
 __blk_flush_plug+0x278/0x4d0
 ? __pfx___blk_flush_plug+0x10/0x10
 ? lock_release+0x460/0x7a0
 blk_finish_plug+0x4e/0x90
 read_pages+0x51b/0xbc0
 ? __pfx_read_pages+0x10/0x10
 ? lock_release+0x460/0x7a0
 page_cache_ra_unbounded+0x326/0x5c0
 force_page_cache_ra+0x1ea/0x2f0
 filemap_get_pages+0x59e/0x17b0
 ? __pfx_filemap_get_pages+0x10/0x10
 ? lock_is_held_type+0xd5/0x130
 ? __pfx___might_resched+0x10/0x10
 ? find_held_lock+0x2d/0x110
 filemap_read+0x317/0xb70
 ? up_write+0x1ba/0x510
 ? __pfx_filemap_read+0x10/0x10
 ? inode_security+0x54/0xf0
 ? selinux_file_permission+0x36d/0x420
 blkdev_read_iter+0x143/0x3b0
 vfs_read+0x6ac/0xa20
 ? __pfx_vfs_read+0x10/0x10
 ? __pfx_vm_mmap_pgoff+0x10/0x10
 ? __pfx___seccomp_filter+0x10/0x10
 ksys_read+0xf7/0x1d0
 ? __pfx_ksys_read+0x10/0x10
 do_syscall_64+0x93/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on+0x78/0x100
 ? do_syscall_64+0x9f/0x180
 ? lockdep_hardirqs_on_prepare+0x16d/0x400
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f565bd1ce11
Code: 00 48 8b 15 09 90 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 d0 ad 01 00 f3 0f 1e fa 80 3d 35 12 0e 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
RSP: 002b:00007ffd6e7a20c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f565bd1ce11
RDX: 0000000000001000 RSI: 00007f565babb000 RDI: 0000000000000014
RBP: 00007ffd6e7a2130 R08: 00000000ffffffff R09: 0000000000000000
R10: 0000556000bfa610 R11: 0000000000000246 R12: 000000003ffff000
R13: 0000556000bfa5b0 R14: 0000000000000e00 R15: 0000556000c07328
 </TASK>

Apparently, the above issue is caused due to using mutex lock while
we're in IO hot path. It's a regression caused with commit 5053639
("nvmet: fix nvme status code when namespace is disabled"). The mutex
->su_mutex is used to find whether a disabled nsid exists in the config
group or not. This is to differentiate between a nsid that is disabled
vs non-existent.

To mitigate the above issue, we've worked upon a fix[2] where we now
insert nsid in subsys Xarray as soon as it's created under config group
and later when that nsid is enabled, we add an Xarray mark on it and set
ns->enabled to true. The Xarray mark is useful while we need to loop
through all enabled namepsaces under a subsystem using xa_for_each_marked()
API. If later a nsid is disabled then we clear Xarray mark from it and also
set ns->enabled to false. It's only when nsid is deleted from the config
group we delete it from the Xarray.

So with this change, now we could easily differentiate a nsid is disabled
(i.e. Xarray entry for ns exists but ns->enabled is set to false) vs non-
existent (i.e.Xarray entry for ns doesn't exist).

Link: https://lore.kernel.org/linux-nvme/20241022070252.GA11389@lst.de/ [2]
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-nvme/tqcy3sveity7p56v7ywp7ssyviwcb3w4623cnxj3knoobfcanq@yxgt2mjkbkam/ [1]
Fixes: 5053639 ("nvmet: fix nvme status code when namespace is disabled")
Fix-suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants