forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 15
Btrfs device related enhancements #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
This will return EIO when __bread() fails to read SB, instead of EINVAL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
…INTK defined error handling logic behaves differently with or without CONFIG_PRINTK defined, since there are two copies of the same function which a bit of different logic One, when CONFIG_PRINTK is defined, code is __btrfs_std_error(..) { :: save_error_info(fs_info); if (sb->s_flags & MS_BORN) btrfs_handle_error(fs_info); } and two when CONFIG_PRINTK is not defined, the code is __btrfs_std_error(..) { :: if (sb->s_flags & MS_BORN) { save_error_info(fs_info); btrfs_handle_error(fs_info); } } I doubt if this was intentional ? and appear to have caused since we maintain two copies of the same function and they got diverged with commits. Now to decide which logic is correct reviewed changes as below, 533574c Commit added two copies of this function cf79ffb Commit made change to only one copy of the function and to the copy when CONFIG_PRINTK is defined. To fix this, instead of maintaining two copies of same function approach, maintain single function, and just put the extra portion of the code under CONFIG_PRINTK define. This patch just does that. And keeps code of with CONFIG_PRINTK defined. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…ot found Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead of -ENOENT. Next this removes the logging when user specifies "missing" and we don't find it in the kernel device list. Logging are for system events not for user input errors. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
This uses a chunk of code from btrfs_read_dev_super() and creates a function called btrfs_read_dev_one_super() so that next patch can use it for scratch superblock. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed bufhead to bh] Signed-off-by: David Sterba <dsterba@suse.com>
This patch updates and renames btrfs_scratch_superblocks, (which is used by the replace device thread), with those fixes from the scratch superblock code section of btrfs_rm_device(). The fixes are: Scratch all copies of superblock Notify kobject that superblock has been changed Update time on the device So that btrfs_rm_device() can use the function btrfs_scratch_superblocks() instead of its own scratch code. And further replace deivce code which similarly releases device back to the system, will have the fixes from the btrfs device delete. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed to btrfs_scratch_superblock] Signed-off-by: David Sterba <dsterba@suse.com>
By general rule of thumb there shouldn't be any way that user land could trigger a kernel operation just by sending wrong arguments. Here do commit cleanups after user input has been verified. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Originally the message was not in a helper but ended up there. We should print error messages from callers instead. Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>
To avoid deadlock described in commit 084b6e7 ("btrfs: Fix a lockdep warning when running xfstest."), we should move kobj stuff out of dev_replace lock range. "It is because the btrfs_kobj_{add/rm}_device() will call memory allocation with GFP_KERNEL, which may flush fs page cache to free space, waiting for it self to do the commit, causing the deadlock. To solve the problem, move btrfs_kobj_{add/rm}_device() out of the dev_replace lock range, also involing split the btrfs_rm_dev_replace_srcdev() function into remove and free parts. Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are called out of the lock range." Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> [added lockup description] Signed-off-by: David Sterba <dsterba@suse.com>
This patch will log return value of add/del_qgroup_relation() and pass the err code of btrfs_run_qgroups to the btrfs_std_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com>
A part of code from btrfs_scan_one_device() is moved to a new function btrfs_read_disk_super(), so that former function looks cleaner and moves the code to ensure null terminating label to it as well. Further there is opportunity to merge various duplicate code on read disk super. Earlier attempt on this was highlighted that there was some issues for which there are multiple versions, however it was not clear what was issue. So until its worked out we can keep it in a separate function. Signed-off-by: Anand Jain <anand.jain@oracle.com>
Optional Label may or may not be set, or it might be set at some time later. However while debugging to search through the kernel logs the scripts would need the logs to be consistent, so logs search key words shouldn't depend on the optional variables, instead fsid is better. Signed-off-by: Anand Jain <anand.jain@oracle.com>
From the issue diagnosable point of view, log if the device path is changed. Signed-off-by: Anand Jain <anand.jain@oracle.com>
Looks like oversight, call brelse() when checksum fails. Further down the code, in the non error path, we do call brelse() and so we don't see brelse() in the goto error paths. Signed-off-by: Anand Jain <anand.jain@oracle.com>
This adds an enhancement to show the seed fsid and its devices on the btrfs sysfs. The way sprouting handles fs_devices: clone seed fs_devices and add to the fs_uuids mem copy seed fs_devices and assign to fs_devices->seed (move dev_list) evacuate seed fs_devices contents to hold sprout fs devices contents So to be inline with this fs_devices changes during seeding, represent seed fsid under the sprout fsid, this is achieved by using the kobject_move() The end result will be, /sys/fs/btrfs/sprout-fsid/seed/level-1-seed-fsid/seed/(if)level-2-seed-fsid Signed-off-by: Anand Jain <anand.jain@oracle.com>
We need fsid kobject to hold pool attributes however its created only when fs is mounted. So, this patch changes the life cycle of the fsid and devices kobjects /sys/fs/btrfs/<fsid> and /sys/fs/btrfs/<fsid>/devices, from created and destroyed by mount and unmount event to created and destroyed by scanned and module-unload events respectively. However this does not alter life cycle of fs attributes as such. Signed-off-by: Anand Jain <anand.jain@oracle.com>
move a section of btrfs_rm_device() code to check for min number of the devices into the function __check_raid_min_devices() v2: commit update and title renamed from Btrfs: move check for min number of devices to a function Signed-off-by: Anand Jain <anand.jain@oracle.com>
__check_raid_min_device() which was pealed from btrfs_rm_device() maintianed its original code to show the block move. This patch cleans up __check_raid_min_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com>
The patch renames btrfs_dev_replace_find_srcdev() to btrfs_find_device_by_user_input() and moves it to volumes.c. so that delete device can use it. v2: changed title from 'Btrfs: create rename btrfs_dev_replace_find_srcdev()' and commit update Signed-off-by: Anand Jain <anand.jain@oracle.com>
btrfs_rm_device() has a section of the code which can be replaced btrfs_find_device_by_user_input() Signed-off-by: Anand Jain <anand.jain@oracle.com>
The operation of device replace and device delete follows same steps upto some depth with in btrfs kernel, however they don't share codes. This enhancement will help replace and delete to share codes. Btrfs: enhance check device_path in btrfs_find_device_by_user_input() Signed-off-by: Anand Jain <anand.jain@oracle.com>
With the previous patches now the btrfs_scratch_superblocks() is ready to be used in btrfs_rm_device() so use it. Signed-off-by: Anand Jain <anand.jain@oracle.com>
This patch makes btrfs_fs_devices and btrfs_device information readable from sysfs. This uses the sysfs group visible entry point to mark certain attributes visible/hidden depending the FS state. The new extended layout is as shown below. /sys/fs/btrfs/ ./7b047f4d-c2ce-4f22-94a3-68c09057f1bf* fsid* missing_devices num_devices* open_devices opened* rotating rw_devices seeding total_devices* total_rw_bytes ./e6701882-220a-4416-98ac-a99f095bddcc* active_pending bdev bytes_used can_discard devid* dev_root_fsid devstats_valid dev_totalbytes generation* in_fs_metadata io_align io_width missing name* nobarriers replace_tgtdev sector_size total_bytes type uuid* writeable (* indicates that attribute will be visible even when device is unmounted but registered with btrfs kernel) v2: use btrfs_error() not btrfs_err() reword subject form : Btrfs: add sysfs layout to show btrfs_fs_devices and btrfs_device attributes Signed-off-by: Anand Jain <anand.jain@oracle.com>
kdave
pushed a commit
that referenced
this pull request
Feb 5, 2025
This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may trigger watchdog. watchdog: Watchdog detected hard LOCKUP on cpu 173 RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0 Call Trace: _raw_spin_lock_irqsave+0x31/0x40 folio_lruvec_lock_irqsave+0x5f/0x90 folio_batch_move_lru+0x91/0x150 lru_add_drain_per_cpu+0x1c/0x40 process_one_work+0x17d/0x350 worker_thread+0x27b/0x3a0 kthread+0xe8/0x120 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1b/0x30 lruvec->lru_lock owner: PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0" #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde [exception RIP: isolate_lru_folios+403] RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88 RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048 RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008 R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0 #8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354 #9 [ffffc90006fb7ef8] kthread at ffffffffa5748238 crash> Scenario: User processe are requesting a large amount of memory and keep page active. Then a module continuously requests memory from ZONE_DMA32 area. Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached. However pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. Reproduce: Terminal 1: Construct to continuously increase pages active(anon). mkdir /tmp/memory mount -t tmpfs -o size=1024000M tmpfs /tmp/memory dd if=/dev/zero of=/tmp/memory/block bs=4M tail /tmp/memory/block Terminal 2: vmstat -a 1 active will increase. procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ... r b swpd free inact active si so bi bo 1 0 0 1445623076 45898836 83646008 0 0 0 1 0 0 1445623076 43450228 86094616 0 0 0 1 0 0 1445623076 41003480 88541364 0 0 0 1 0 0 1445623076 38557088 90987756 0 0 0 1 0 0 1445623076 36109688 93435156 0 0 0 1 0 0 1445619552 33663256 95881632 0 0 0 1 0 0 1445619804 31217140 98327792 0 0 0 1 0 0 1445619804 28769988 100774944 0 0 0 1 0 0 1445619804 26322348 103222584 0 0 0 1 0 0 1445619804 23875592 105669340 0 0 0 cat /proc/meminfo | head Active(anon) increase. MemTotal: 1579941036 kB MemFree: 1445618500 kB MemAvailable: 1453013224 kB Buffers: 6516 kB Cached: 128653956 kB SwapCached: 0 kB Active: 118110812 kB Inactive: 11436620 kB Active(anon): 115345744 kB Inactive(anon): 945292 kB When the Active(anon) is 115345744 kB, insmod module triggers the ZONE_DMA32 watermark. perf record -e vmscan:mm_vmscan_lru_isolate -aR perf script isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2 nr_skipped=2 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29 nr_skipped=29 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon See nr_scanned=28835844. 28835844 * 4k = 115343376KB approximately equal to 115345744 kB. If increase Active(anon) to 1000G then insmod module triggers the ZONE_DMA32 watermark. hard lockup will occur. In my device nr_scanned = 0000000003e3e937 when hard lockup. Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB. [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 ffffc90006fb7c30: 0000000000000020 0000000000000000 ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000 ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8 ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48 ffffc90006fb7c70: 0000000000000000 0000000000000000 ffffc90006fb7c80: 0000000000000000 0000000000000000 ffffc90006fb7c90: 0000000000000000 0000000000000000 ffffc90006fb7ca0: 0000000000000000 0000000003e3e937 ffffc90006fb7cb0: 0000000000000000 0000000000000000 ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000 About the Fixes: Why did it take eight years to be discovered? The problem requires the following conditions to occur: 1. The device memory should be large enough. 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. 3. The memory in ZONE_DMA32 needs to reach the watermark. If the memory is not large enough, or if the usage design of ZONE_DMA32 area memory is reasonable, this problem is difficult to detect. notes: The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL, but other suitable scenarios may also trigger the problem. Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn Fixes: b2e1875 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye <liuye@kylinos.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave
pushed a commit
that referenced
this pull request
Feb 17, 2025
We have several places across the kernel where we want to access another task's syscall arguments, such as ptrace(2), seccomp(2), etc., by making a call to syscall_get_arguments(). This works for register arguments right away by accessing the task's `regs' member of `struct pt_regs', however for stack arguments seen with 32-bit/o32 kernels things are more complicated. Technically they ought to be obtained from the user stack with calls to an access_remote_vm(), but we have an easier way available already. So as to be able to access syscall stack arguments as regular function arguments following the MIPS calling convention we copy them over from the user stack to the kernel stack in arch/mips/kernel/scall32-o32.S, in handle_sys(), to the current stack frame's outgoing argument space at the top of the stack, which is where the handler called expects to see its incoming arguments. This area is also pointed at by the `pt_regs' pointer obtained by task_pt_regs(). Make the o32 stack argument space a proper member of `struct pt_regs' then, by renaming the existing member from `pad0' to `args' and using generated offsets to access the space. No functional change though. With the change in place the o32 kernel stack frame layout at the entry to a syscall handler invoked by handle_sys() is therefore as follows: $sp + 68 -> | ... | <- pt_regs.regs[9] +---------------------+ $sp + 64 -> | $t0 | <- pt_regs.regs[8] +---------------------+ $sp + 60 -> | $a3/argument #4 | <- pt_regs.regs[7] +---------------------+ $sp + 56 -> | $a2/argument #3 | <- pt_regs.regs[6] +---------------------+ $sp + 52 -> | $a1/argument #2 | <- pt_regs.regs[5] +---------------------+ $sp + 48 -> | $a0/argument #1 | <- pt_regs.regs[4] +---------------------+ $sp + 44 -> | $v1 | <- pt_regs.regs[3] +---------------------+ $sp + 40 -> | $v0 | <- pt_regs.regs[2] +---------------------+ $sp + 36 -> | $at | <- pt_regs.regs[1] +---------------------+ $sp + 32 -> | $zero | <- pt_regs.regs[0] +---------------------+ $sp + 28 -> | stack argument #8 | <- pt_regs.args[7] +---------------------+ $sp + 24 -> | stack argument #7 | <- pt_regs.args[6] +---------------------+ $sp + 20 -> | stack argument #6 | <- pt_regs.args[5] +---------------------+ $sp + 16 -> | stack argument #5 | <- pt_regs.args[4] +---------------------+ $sp + 12 -> | psABI space for $a3 | <- pt_regs.args[3] +---------------------+ $sp + 8 -> | psABI space for $a2 | <- pt_regs.args[2] +---------------------+ $sp + 4 -> | psABI space for $a1 | <- pt_regs.args[1] +---------------------+ $sp + 0 -> | psABI space for $a0 | <- pt_regs.args[0] +---------------------+ holding user data received and with the first 4 frame slots reserved by the psABI for the compiler to spill the incoming arguments from $a0-$a3 registers (which it sometimes does according to its needs) and the next 4 frame slots designated by the psABI for any stack function arguments that follow. This data is also available for other tasks to peek/poke at as reqired and where permitted. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
kdave
pushed a commit
that referenced
this pull request
Feb 17, 2025
…rary mm Erhard reports the following KASAN hit on Talos II (power9) with kernel 6.13: [ 12.028126] ================================================================== [ 12.028198] BUG: KASAN: user-memory-access in copy_to_kernel_nofault+0x8c/0x1a0 [ 12.028260] Write of size 8 at addr 0000187e458f2000 by task systemd/1 [ 12.028346] CPU: 87 UID: 0 PID: 1 Comm: systemd Tainted: G T 6.13.0-P9-dirty #3 [ 12.028408] Tainted: [T]=RANDSTRUCT [ 12.028446] Hardware name: T2P9D01 REV 1.01 POWER9 0x4e1202 opal:skiboot-bc106a0 PowerNV [ 12.028500] Call Trace: [ 12.028536] [c000000008dbf3b0] [c000000001656a48] dump_stack_lvl+0xbc/0x110 (unreliable) [ 12.028609] [c000000008dbf3f0] [c0000000006e2fc8] print_report+0x6b0/0x708 [ 12.028666] [c000000008dbf4e0] [c0000000006e2454] kasan_report+0x164/0x300 [ 12.028725] [c000000008dbf600] [c0000000006e54d4] kasan_check_range+0x314/0x370 [ 12.028784] [c000000008dbf640] [c0000000006e6310] __kasan_check_write+0x20/0x40 [ 12.028842] [c000000008dbf660] [c000000000578e8c] copy_to_kernel_nofault+0x8c/0x1a0 [ 12.028902] [c000000008dbf6a0] [c0000000000acfe4] __patch_instructions+0x194/0x210 [ 12.028965] [c000000008dbf6e0] [c0000000000ade80] patch_instructions+0x150/0x590 [ 12.029026] [c000000008dbf7c0] [c0000000001159bc] bpf_arch_text_copy+0x6c/0xe0 [ 12.029085] [c000000008dbf800] [c000000000424250] bpf_jit_binary_pack_finalize+0x40/0xc0 [ 12.029147] [c000000008dbf830] [c000000000115dec] bpf_int_jit_compile+0x3bc/0x930 [ 12.029206] [c000000008dbf990] [c000000000423720] bpf_prog_select_runtime+0x1f0/0x280 [ 12.029266] [c000000008dbfa00] [c000000000434b18] bpf_prog_load+0xbb8/0x1370 [ 12.029324] [c000000008dbfb70] [c000000000436ebc] __sys_bpf+0x5ac/0x2e00 [ 12.029379] [c000000008dbfd00] [c00000000043a228] sys_bpf+0x28/0x40 [ 12.029435] [c000000008dbfd20] [c000000000038eb4] system_call_exception+0x334/0x610 [ 12.029497] [c000000008dbfe50] [c00000000000c270] system_call_vectored_common+0xf0/0x280 [ 12.029561] --- interrupt: 3000 at 0x3fff82f5cfa8 [ 12.029608] NIP: 00003fff82f5cfa8 LR: 00003fff82f5cfa8 CTR: 0000000000000000 [ 12.029660] REGS: c000000008dbfe80 TRAP: 3000 Tainted: G T (6.13.0-P9-dirty) [ 12.029735] MSR: 900000000280f032 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI> CR: 42004848 XER: 00000000 [ 12.029855] IRQMASK: 0 GPR00: 0000000000000169 00003fffdcf789a0 00003fff83067100 0000000000000005 GPR04: 00003fffdcf78a98 0000000000000090 0000000000000000 0000000000000008 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR12: 0000000000000000 00003fff836ff7e0 c000000000010678 0000000000000000 GPR16: 0000000000000000 0000000000000000 00003fffdcf78f28 00003fffdcf78f90 GPR20: 0000000000000000 0000000000000000 0000000000000000 00003fffdcf78f80 GPR24: 00003fffdcf78f70 00003fffdcf78d10 00003fff835c7239 00003fffdcf78bd8 GPR28: 00003fffdcf78a98 0000000000000000 0000000000000000 000000011f547580 [ 12.030316] NIP [00003fff82f5cfa8] 0x3fff82f5cfa8 [ 12.030361] LR [00003fff82f5cfa8] 0x3fff82f5cfa8 [ 12.030405] --- interrupt: 3000 [ 12.030444] ================================================================== Commit c28c15b ("powerpc/code-patching: Use temporary mm for Radix MMU") is inspired from x86 but unlike x86 is doesn't disable KASAN reports during patching. This wasn't a problem at the begining because __patch_mem() is not instrumented. Commit 465cabc ("powerpc/code-patching: introduce patch_instructions()") use copy_to_kernel_nofault() to copy several instructions at once. But when using temporary mm the destination is not regular kernel memory but a kind of kernel-like memory located in user address space. Because it is not in kernel address space it is not covered by KASAN shadow memory. Since commit e4137f0 ("mm, kasan, kmsan: instrument copy_from/to_kernel_nofault") KASAN reports bad accesses from copy_to_kernel_nofault(). Here a bad access to user memory is reported because KASAN detects the lack of shadow memory and the address is below TASK_SIZE. Do like x86 in commit b3fd8e8 ("x86/alternatives: Use temporary mm for text poking") and disable KASAN reports during patching when using temporary mm. Reported-by: Erhard Furtner <erhard_f@mailbox.org> Close: https://lore.kernel.org/all/20250201151435.48400261@yea/ Fixes: 465cabc ("powerpc/code-patching: introduce patch_instructions()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com> Link: https://patch.msgid.link/1c05b2a1b02ad75b981cfc45927e0b4a90441046.1738577687.git.christophe.leroy@csgroup.eu
kdave
pushed a commit
that referenced
this pull request
Feb 21, 2025
Since commit 6037802 ("power: supply: core: implement extension API") there is the following ABBA deadlock (simplified) between the LED trigger code and the power-supply code: 1) When registering a power-supply class device, power_supply_register() calls led_trigger_register() from power_supply_create_triggers() in a scoped_guard(rwsem_read, &psy->extensions_sem) context. led_trigger_register() then in turn takes a LED subsystem lock. So here we have the following locking order: * Read-lock extensions_sem * Lock LED subsystem lock(s) 2) When registering a LED class device, with its default trigger set to a power-supply LED trigger (which has already been registered) The LED class code calls power_supply_led_trigger_activate() when setting up the default trigger. power_supply_led_trigger_activate() calls power_supply_get_property() to determine the initial value of to assign to the LED and that read-locks extensions_sem. So now we have the following locking order: * Lock LED subsystem lock(s) * Read-lock extensions_sem Fixing this is easy, there is no need to hold the extensions_sem when calling power_supply_create_triggers() since all triggers are always created rather then checking for the presence of certain attributes as power_supply_add_hwmon_sysfs() does. Move power_supply_create_triggers() out of the guard block to fix this. Here is the lockdep report fixed by this change: [ 31.249343] ====================================================== [ 31.249378] WARNING: possible circular locking dependency detected [ 31.249413] 6.13.0-rc6+ torvalds#251 Tainted: G C E [ 31.249440] ------------------------------------------------------ [ 31.249471] (udev-worker)/553 is trying to acquire lock: [ 31.249501] ffff892adbcaf660 (&psy->extensions_sem){.+.+}-{4:4}, at: power_supply_get_property.part.0+0x22/0x150 [ 31.249574] but task is already holding lock: [ 31.249603] ffff892adbc0bad0 (&led_cdev->trigger_lock){+.+.}-{4:4}, at: led_trigger_set_default+0x34/0xe0 [ 31.249657] which lock already depends on the new lock. [ 31.249696] the existing dependency chain (in reverse order) is: [ 31.249735] -> #2 (&led_cdev->trigger_lock){+.+.}-{4:4}: [ 31.249778] down_write+0x3b/0xd0 [ 31.249803] led_trigger_set_default+0x34/0xe0 [ 31.249833] led_classdev_register_ext+0x311/0x3a0 [ 31.249863] input_leds_connect+0x1dc/0x2a0 [ 31.249889] input_attach_handler.isra.0+0x75/0x90 [ 31.249921] input_register_device.cold+0xa1/0x150 [ 31.249955] hidinput_connect+0x8a2/0xb80 [ 31.249982] hid_connect+0x582/0x5c0 [ 31.250007] hid_hw_start+0x3f/0x60 [ 31.250030] hid_device_probe+0x122/0x1f0 [ 31.250053] really_probe+0xde/0x340 [ 31.250080] __driver_probe_device+0x78/0x110 [ 31.250105] driver_probe_device+0x1f/0xa0 [ 31.250132] __device_attach_driver+0x85/0x110 [ 31.250160] bus_for_each_drv+0x78/0xc0 [ 31.250184] __device_attach+0xb0/0x1b0 [ 31.250207] bus_probe_device+0x94/0xb0 [ 31.250230] device_add+0x64a/0x860 [ 31.250252] hid_add_device+0xe5/0x240 [ 31.250279] usbhid_probe+0x4dc/0x620 [ 31.250303] usb_probe_interface+0xe4/0x2a0 [ 31.250329] really_probe+0xde/0x340 [ 31.250353] __driver_probe_device+0x78/0x110 [ 31.250377] driver_probe_device+0x1f/0xa0 [ 31.250404] __device_attach_driver+0x85/0x110 [ 31.250431] bus_for_each_drv+0x78/0xc0 [ 31.250455] __device_attach+0xb0/0x1b0 [ 31.250478] bus_probe_device+0x94/0xb0 [ 31.250501] device_add+0x64a/0x860 [ 31.250523] usb_set_configuration+0x606/0x8a0 [ 31.250552] usb_generic_driver_probe+0x3e/0x60 [ 31.250579] usb_probe_device+0x3d/0x120 [ 31.250605] really_probe+0xde/0x340 [ 31.250629] __driver_probe_device+0x78/0x110 [ 31.250653] driver_probe_device+0x1f/0xa0 [ 31.250680] __device_attach_driver+0x85/0x110 [ 31.250707] bus_for_each_drv+0x78/0xc0 [ 31.250731] __device_attach+0xb0/0x1b0 [ 31.250753] bus_probe_device+0x94/0xb0 [ 31.250776] device_add+0x64a/0x860 [ 31.250798] usb_new_device.cold+0x141/0x38f [ 31.250828] hub_event+0x1166/0x1980 [ 31.250854] process_one_work+0x20f/0x580 [ 31.250879] worker_thread+0x1d1/0x3b0 [ 31.250904] kthread+0xee/0x120 [ 31.250926] ret_from_fork+0x30/0x50 [ 31.250954] ret_from_fork_asm+0x1a/0x30 [ 31.250982] -> #1 (triggers_list_lock){++++}-{4:4}: [ 31.251022] down_write+0x3b/0xd0 [ 31.251045] led_trigger_register+0x40/0x1b0 [ 31.251074] power_supply_register_led_trigger+0x88/0x150 [ 31.251107] power_supply_create_triggers+0x55/0xe0 [ 31.251135] __power_supply_register.part.0+0x34e/0x4a0 [ 31.251164] devm_power_supply_register+0x70/0xc0 [ 31.251190] bq27xxx_battery_setup+0x1a1/0x6d0 [bq27xxx_battery] [ 31.251235] bq27xxx_battery_i2c_probe+0xe5/0x17f [bq27xxx_battery_i2c] [ 31.251272] i2c_device_probe+0x125/0x2b0 [ 31.251299] really_probe+0xde/0x340 [ 31.251324] __driver_probe_device+0x78/0x110 [ 31.251348] driver_probe_device+0x1f/0xa0 [ 31.251375] __driver_attach+0xba/0x1c0 [ 31.251398] bus_for_each_dev+0x6b/0xb0 [ 31.251421] bus_add_driver+0x111/0x1f0 [ 31.251445] driver_register+0x6e/0xc0 [ 31.251470] i2c_register_driver+0x41/0xb0 [ 31.251498] do_one_initcall+0x5e/0x3a0 [ 31.251522] do_init_module+0x60/0x220 [ 31.251550] __do_sys_init_module+0x15f/0x190 [ 31.251575] do_syscall_64+0x93/0x180 [ 31.251598] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 31.251629] -> #0 (&psy->extensions_sem){.+.+}-{4:4}: [ 31.251668] __lock_acquire+0x13ce/0x21c0 [ 31.251694] lock_acquire+0xcf/0x2e0 [ 31.251719] down_read+0x3e/0x170 [ 31.251741] power_supply_get_property.part.0+0x22/0x150 [ 31.251774] power_supply_update_leds+0x8d/0x230 [ 31.251804] power_supply_led_trigger_activate+0x18/0x20 [ 31.251837] led_trigger_set+0x1fc/0x300 [ 31.251863] led_trigger_set_default+0x90/0xe0 [ 31.251892] led_classdev_register_ext+0x311/0x3a0 [ 31.251921] devm_led_classdev_multicolor_register_ext+0x6e/0xb80 [led_class_multicolor] [ 31.251969] ktd202x_probe+0x464/0x5c0 [leds_ktd202x] [ 31.252002] i2c_device_probe+0x125/0x2b0 [ 31.252027] really_probe+0xde/0x340 [ 31.252052] __driver_probe_device+0x78/0x110 [ 31.252076] driver_probe_device+0x1f/0xa0 [ 31.252103] __driver_attach+0xba/0x1c0 [ 31.252125] bus_for_each_dev+0x6b/0xb0 [ 31.252148] bus_add_driver+0x111/0x1f0 [ 31.252172] driver_register+0x6e/0xc0 [ 31.252197] i2c_register_driver+0x41/0xb0 [ 31.252225] do_one_initcall+0x5e/0x3a0 [ 31.252248] do_init_module+0x60/0x220 [ 31.252274] __do_sys_init_module+0x15f/0x190 [ 31.253986] do_syscall_64+0x93/0x180 [ 31.255826] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 31.257614] other info that might help us debug this: [ 31.257619] Chain exists of: &psy->extensions_sem --> triggers_list_lock --> &led_cdev->trigger_lock [ 31.257630] Possible unsafe locking scenario: [ 31.257632] CPU0 CPU1 [ 31.257633] ---- ---- [ 31.257634] lock(&led_cdev->trigger_lock); [ 31.257637] lock(triggers_list_lock); [ 31.257640] lock(&led_cdev->trigger_lock); [ 31.257643] rlock(&psy->extensions_sem); [ 31.257646] *** DEADLOCK *** [ 31.289433] 4 locks held by (udev-worker)/553: [ 31.289443] #0: ffff892ad9658108 (&dev->mutex){....}-{4:4}, at: __driver_attach+0xaf/0x1c0 [ 31.289463] #1: ffff892adbc0bbc8 (&led_cdev->led_access){+.+.}-{4:4}, at: led_classdev_register_ext+0x1c7/0x3a0 [ 31.289476] #2: ffffffffad0e30b0 (triggers_list_lock){++++}-{4:4}, at: led_trigger_set_default+0x2c/0xe0 [ 31.289487] #3: ffff892adbc0bad0 (&led_cdev->trigger_lock){+.+.}-{4:4}, at: led_trigger_set_default+0x34/0xe0 Fixes: 6037802 ("power: supply: core: implement extension API") Cc: Thomas Weißschuh <linux@weissschuh.net> Cc: Armin Wolf <W_Armin@gmx.de> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250130140035.20636-1-hdegoede@redhat.com Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
kdave
pushed a commit
that referenced
this pull request
Mar 3, 2025
…/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.14, take #3 - Fix TCR_EL2 configuration to not use the ASID in TTBR1_EL2 and not mess-up T1SZ/PS by using the HCR_EL2.E2H==0 layout. - Bring back the VMID allocation to the vcpu_load phase, ensuring that we only setup VTTBR_EL2 once on VHE. This cures an ugly race that would lead to running with an unallocated VMID.
kdave
pushed a commit
that referenced
this pull request
Mar 10, 2025
Use raw_spinlock in order to fix spurious messages about invalid context when spinlock debugging is enabled. The lock is only used to serialize register access. [ 4.239592] ============================= [ 4.239595] [ BUG: Invalid wait context ] [ 4.239599] 6.13.0-rc7-arm64-renesas-05496-gd088502a519f torvalds#35 Not tainted [ 4.239603] ----------------------------- [ 4.239606] kworker/u8:5/76 is trying to lock: [ 4.239609] ffff0000091898a0 (&p->lock){....}-{3:3}, at: gpio_rcar_config_interrupt_input_mode+0x34/0x164 [ 4.239641] other info that might help us debug this: [ 4.239643] context-{5:5} [ 4.239646] 5 locks held by kworker/u8:5/76: [ 4.239651] #0: ffff0000080fb148 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work+0x190/0x62c [ 4.250180] OF: /soc/sound@ec500000/ports/port@0/endpoint: Read of boolean property 'frame-master' with a value. [ 4.254094] #1: ffff80008299bd80 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work+0x1b8/0x62c [ 4.254109] #2: ffff00000920c8f8 [ 4.258345] OF: /soc/sound@ec500000/ports/port@1/endpoint: Read of boolean property 'bitclock-master' with a value. [ 4.264803] (&dev->mutex){....}-{4:4}, at: __device_attach_async_helper+0x3c/0xdc [ 4.264820] #3: ffff00000a50ca40 (request_class#2){+.+.}-{4:4}, at: __setup_irq+0xa0/0x690 [ 4.264840] #4: [ 4.268872] OF: /soc/sound@ec500000/ports/port@1/endpoint: Read of boolean property 'frame-master' with a value. [ 4.273275] ffff00000a50c8c8 (lock_class){....}-{2:2}, at: __setup_irq+0xc4/0x690 [ 4.296130] renesas_sdhi_internal_dmac ee100000.mmc: mmc1 base at 0x00000000ee100000, max clock rate 200 MHz [ 4.304082] stack backtrace: [ 4.304086] CPU: 1 UID: 0 PID: 76 Comm: kworker/u8:5 Not tainted 6.13.0-rc7-arm64-renesas-05496-gd088502a519f torvalds#35 [ 4.304092] Hardware name: Renesas Salvator-X 2nd version board based on r8a77965 (DT) [ 4.304097] Workqueue: async async_run_entry_fn [ 4.304106] Call trace: [ 4.304110] show_stack+0x14/0x20 (C) [ 4.304122] dump_stack_lvl+0x6c/0x90 [ 4.304131] dump_stack+0x14/0x1c [ 4.304138] __lock_acquire+0xdfc/0x1584 [ 4.426274] lock_acquire+0x1c4/0x33c [ 4.429942] _raw_spin_lock_irqsave+0x5c/0x80 [ 4.434307] gpio_rcar_config_interrupt_input_mode+0x34/0x164 [ 4.440061] gpio_rcar_irq_set_type+0xd4/0xd8 [ 4.444422] __irq_set_trigger+0x5c/0x178 [ 4.448435] __setup_irq+0x2e4/0x690 [ 4.452012] request_threaded_irq+0xc4/0x190 [ 4.456285] devm_request_threaded_irq+0x7c/0xf4 [ 4.459398] ata1: link resume succeeded after 1 retries [ 4.460902] mmc_gpiod_request_cd_irq+0x68/0xe0 [ 4.470660] mmc_start_host+0x50/0xac [ 4.474327] mmc_add_host+0x80/0xe4 [ 4.477817] tmio_mmc_host_probe+0x2b0/0x440 [ 4.482094] renesas_sdhi_probe+0x488/0x6f4 [ 4.486281] renesas_sdhi_internal_dmac_probe+0x60/0x78 [ 4.491509] platform_probe+0x64/0xd8 [ 4.495178] really_probe+0xb8/0x2a8 [ 4.498756] __driver_probe_device+0x74/0x118 [ 4.503116] driver_probe_device+0x3c/0x154 [ 4.507303] __device_attach_driver+0xd4/0x160 [ 4.511750] bus_for_each_drv+0x84/0xe0 [ 4.515588] __device_attach_async_helper+0xb0/0xdc [ 4.520470] async_run_entry_fn+0x30/0xd8 [ 4.524481] process_one_work+0x210/0x62c [ 4.528494] worker_thread+0x1ac/0x340 [ 4.532245] kthread+0x10c/0x110 [ 4.535476] ret_from_fork+0x10/0x20 Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250121135833.3769310-1-niklas.soderlund+renesas@ragnatech.se Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
kdave
pushed a commit
that referenced
this pull request
Mar 14, 2025
When mb-xdp is set and return is XDP_PASS, packet is converted from xdp_buff to sk_buff with xdp_update_skb_shared_info() in bnxt_xdp_build_skb(). bnxt_xdp_build_skb() passes incorrect truesize argument to xdp_update_skb_shared_info(). The truesize is calculated as BNXT_RX_PAGE_SIZE * sinfo->nr_frags but the skb_shared_info was wiped by napi_build_skb() before. So it stores sinfo->nr_frags before bnxt_xdp_build_skb() and use it instead of getting skb_shared_info from xdp_get_shared_info_from_buff(). Splat looks like: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 0 at net/core/skbuff.c:6072 skb_try_coalesce+0x504/0x590 Modules linked in: xt_nat xt_tcpudp veth af_packet xt_conntrack nft_chain_nat xt_MASQUERADE nf_conntrack_netlink xfrm_user xt_addrtype nft_coms CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Not tainted 6.14.0-rc2+ #3 RIP: 0010:skb_try_coalesce+0x504/0x590 Code: 4b fd ff ff 49 8b 34 24 40 80 e6 40 0f 84 3d fd ff ff 49 8b 74 24 48 40 f6 c6 01 0f 84 2e fd ff ff 48 8d 4e ff e9 25 fd ff ff <0f> 0b e99 RSP: 0018:ffffb62c4120caa8 EFLAGS: 00010287 RAX: 0000000000000003 RBX: ffffb62c4120cb14 RCX: 0000000000000ec0 RDX: 0000000000001000 RSI: ffffa06e5d7dc000 RDI: 0000000000000003 RBP: ffffa06e5d7ddec0 R08: ffffa06e6120a800 R09: ffffa06e7a119900 R10: 0000000000002310 R11: ffffa06e5d7dcec0 R12: ffffe4360575f740 R13: ffffe43600000000 R14: 0000000000000002 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffffa0755f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f147b76b0f8 CR3: 00000001615d4000 CR4: 00000000007506f0 PKRU: 55555554 Call Trace: <IRQ> ? __warn+0x84/0x130 ? skb_try_coalesce+0x504/0x590 ? report_bug+0x18a/0x1a0 ? handle_bug+0x53/0x90 ? exc_invalid_op+0x14/0x70 ? asm_exc_invalid_op+0x16/0x20 ? skb_try_coalesce+0x504/0x590 inet_frag_reasm_finish+0x11f/0x2e0 ip_defrag+0x37a/0x900 ip_local_deliver+0x51/0x120 ip_sublist_rcv_finish+0x64/0x70 ip_sublist_rcv+0x179/0x210 ip_list_rcv+0xf9/0x130 How to reproduce: <Node A> ip link set $interface1 xdp obj xdp_pass.o ip link set $interface1 mtu 9000 up ip a a 10.0.0.1/24 dev $interface1 <Node B> ip link set $interfac2 mtu 9000 up ip a a 10.0.0.2/24 dev $interface2 ping 10.0.0.1 -s 65000 Following ping.py patch adds xdp-mb-pass case. so ping.py is going to be able to reproduce this issue. Fixes: 1dc4c55 ("bnxt: adding bnxt_xdp_build_skb to build skb from multibuffer xdp_buff") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Link: https://patch.msgid.link/20250309134219.91670-2-ap420073@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kdave
pushed a commit
that referenced
this pull request
Mar 14, 2025
…cal section A circular lock dependency splat has been seen involving down_trylock(): ====================================================== WARNING: possible circular locking dependency detected 6.12.0-41.el10.s390x+debug ------------------------------------------------------ dd/32479 is trying to acquire lock: 0015a20accd0d4f8 ((console_sem).lock){-.-.}-{2:2}, at: down_trylock+0x26/0x90 but task is already holding lock: 000000017e461698 (&zone->lock){-.-.}-{2:2}, at: rmqueue_bulk+0xac/0x8f0 the existing dependency chain (in reverse order) is: -> #4 (&zone->lock){-.-.}-{2:2}: -> #3 (hrtimer_bases.lock){-.-.}-{2:2}: -> #2 (&rq->__lock){-.-.}-{2:2}: -> #1 (&p->pi_lock){-.-.}-{2:2}: -> #0 ((console_sem).lock){-.-.}-{2:2}: The console_sem -> pi_lock dependency is due to calling try_to_wake_up() while holding the console_sem raw_spinlock. This dependency can be broken by using wake_q to do the wakeup instead of calling try_to_wake_up() under the console_sem lock. This will also make the semaphore's raw_spinlock become a terminal lock without taking any further locks underneath it. The hrtimer_bases.lock is a raw_spinlock while zone->lock is a spinlock. The hrtimer_bases.lock -> zone->lock dependency happens via the debug_objects_fill_pool() helper function in the debugobjects code. -> #4 (&zone->lock){-.-.}-{2:2}: __lock_acquire+0xe86/0x1cc0 lock_acquire.part.0+0x258/0x630 lock_acquire+0xb8/0xe0 _raw_spin_lock_irqsave+0xb4/0x120 rmqueue_bulk+0xac/0x8f0 __rmqueue_pcplist+0x580/0x830 rmqueue_pcplist+0xfc/0x470 rmqueue.isra.0+0xdec/0x11b0 get_page_from_freelist+0x2ee/0xeb0 __alloc_pages_noprof+0x2c2/0x520 alloc_pages_mpol_noprof+0x1fc/0x4d0 alloc_pages_noprof+0x8c/0xe0 allocate_slab+0x320/0x460 ___slab_alloc+0xa58/0x12b0 __slab_alloc.isra.0+0x42/0x60 kmem_cache_alloc_noprof+0x304/0x350 fill_pool+0xf6/0x450 debug_object_activate+0xfe/0x360 enqueue_hrtimer+0x34/0x190 __run_hrtimer+0x3c8/0x4c0 __hrtimer_run_queues+0x1b2/0x260 hrtimer_interrupt+0x316/0x760 do_IRQ+0x9a/0xe0 do_irq_async+0xf6/0x160 Normally a raw_spinlock to spinlock dependency is not legitimate and will be warned if CONFIG_PROVE_RAW_LOCK_NESTING is enabled, but debug_objects_fill_pool() is an exception as it explicitly allows this dependency for non-PREEMPT_RT kernel without causing PROVE_RAW_LOCK_NESTING lockdep splat. As a result, this dependency is legitimate and not a bug. Anyway, semaphore is the only locking primitive left that is still using try_to_wake_up() to do wakeup inside critical section, all the other locking primitives had been migrated to use wake_q to do wakeup outside of the critical section. It is also possible that there are other circular locking dependencies involving printk/console_sem or other existing/new semaphores lurking somewhere which may show up in the future. Let just do the migration now to wake_q to avoid headache like this. Reported-by: yzbot+ed801a886dfdbfe7136d@syzkaller.appspotmail.com Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20250307232717.1759087-3-boqun.feng@gmail.com
kdave
pushed a commit
that referenced
this pull request
Mar 25, 2025
Non-hybrid CPU variants that share the same Family/Model could be differentiated by their cpu-type. x86_match_cpu() currently does not use cpu-type for CPU matching. Dave Hansen suggested to use below conditions to match CPU-type: 1. If CPU_TYPE_ANY (the wildcard), then matched 2. If hybrid, then matched 3. If !hybrid, look at the boot CPU and compare the cpu-type to determine if it is a match. This special case for hybrid systems allows more compact vulnerability list. Imagine that "Haswell" CPUs might or might not be hybrid and that only Atom cores are vulnerable to Meltdown. That means there are three possibilities: 1. P-core only 2. Atom only 3. Atom + P-core (aka. hybrid) One might be tempted to code up the vulnerability list like this: MATCH( HASWELL, X86_FEATURE_HYBRID, MELTDOWN) MATCH_TYPE(HASWELL, ATOM, MELTDOWN) Logically, this matches #2 and #3. But that's a little silly. You would only ask for the "ATOM" match in cases where there *WERE* hybrid cores in play. You shouldn't have to _also_ ask for hybrid cores explicitly. In short, assume that processors that enumerate Hybrid==1 have a vulnerable core type. Update x86_match_cpu() to also match cpu-type. Also treat hybrid systems as special, and match them to any cpu-type. Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/r/20250311-add-cpu-type-v8-4-e8514dcaaff2@linux.intel.com
kdave
pushed a commit
that referenced
this pull request
Mar 25, 2025
syzbot reported a deadlock in lock_system_sleep() (see below). The write operation to "/sys/module/hibernate/parameters/compressor" conflicts with the registration of ieee80211 device, resulting in a deadlock when attempting to acquire system_transition_mutex under param_lock. To avoid this deadlock, change hibernate_compressor_param_set() to use mutex_trylock() for attempting to acquire system_transition_mutex and return -EBUSY when it fails. Task flags need not be saved or adjusted before calling mutex_trylock(&system_transition_mutex) because the caller is not going to end up waiting for this mutex and if it runs concurrently with system suspend in progress, it will be frozen properly when it returns to user space. syzbot report: syz-executor895/5833 is trying to acquire lock: ffffffff8e0828c8 (system_transition_mutex){+.+.}-{4:4}, at: lock_system_sleep+0x87/0xa0 kernel/power/main.c:56 but task is already holding lock: ffffffff8e07dc68 (param_lock){+.+.}-{4:4}, at: kernel_param_lock kernel/params.c:607 [inline] ffffffff8e07dc68 (param_lock){+.+.}-{4:4}, at: param_attr_store+0xe6/0x300 kernel/params.c:586 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (param_lock){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730 ieee80211_rate_control_ops_get net/mac80211/rate.c:220 [inline] rate_control_alloc net/mac80211/rate.c:266 [inline] ieee80211_init_rate_ctrl_alg+0x18d/0x6b0 net/mac80211/rate.c:1015 ieee80211_register_hw+0x20cd/0x4060 net/mac80211/main.c:1531 mac80211_hwsim_new_radio+0x304e/0x54e0 drivers/net/wireless/virtual/mac80211_hwsim.c:5558 init_mac80211_hwsim+0x432/0x8c0 drivers/net/wireless/virtual/mac80211_hwsim.c:6910 do_one_initcall+0x128/0x700 init/main.c:1257 do_initcall_level init/main.c:1319 [inline] do_initcalls init/main.c:1335 [inline] do_basic_setup init/main.c:1354 [inline] kernel_init_freeable+0x5c7/0x900 init/main.c:1568 kernel_init+0x1c/0x2b0 init/main.c:1457 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:148 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #2 (rtnl_mutex){+.+.}-{4:4}: __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730 wg_pm_notification drivers/net/wireguard/device.c:80 [inline] wg_pm_notification+0x49/0x180 drivers/net/wireguard/device.c:64 notifier_call_chain+0xb7/0x410 kernel/notifier.c:85 notifier_call_chain_robust kernel/notifier.c:120 [inline] blocking_notifier_call_chain_robust kernel/notifier.c:345 [inline] blocking_notifier_call_chain_robust+0xc9/0x170 kernel/notifier.c:333 pm_notifier_call_chain_robust+0x27/0x60 kernel/power/main.c:102 snapshot_open+0x189/0x2b0 kernel/power/user.c:77 misc_open+0x35a/0x420 drivers/char/misc.c:179 chrdev_open+0x237/0x6a0 fs/char_dev.c:414 do_dentry_open+0x735/0x1c40 fs/open.c:956 vfs_open+0x82/0x3f0 fs/open.c:1086 do_open fs/namei.c:3830 [inline] path_openat+0x1e88/0x2d80 fs/namei.c:3989 do_filp_open+0x20c/0x470 fs/namei.c:4016 do_sys_openat2+0x17a/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x175/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #1 ((pm_chain_head).rwsem){++++}-{4:4}: down_read+0x9a/0x330 kernel/locking/rwsem.c:1524 blocking_notifier_call_chain_robust kernel/notifier.c:344 [inline] blocking_notifier_call_chain_robust+0xa9/0x170 kernel/notifier.c:333 pm_notifier_call_chain_robust+0x27/0x60 kernel/power/main.c:102 snapshot_open+0x189/0x2b0 kernel/power/user.c:77 misc_open+0x35a/0x420 drivers/char/misc.c:179 chrdev_open+0x237/0x6a0 fs/char_dev.c:414 do_dentry_open+0x735/0x1c40 fs/open.c:956 vfs_open+0x82/0x3f0 fs/open.c:1086 do_open fs/namei.c:3830 [inline] path_openat+0x1e88/0x2d80 fs/namei.c:3989 do_filp_open+0x20c/0x470 fs/namei.c:4016 do_sys_openat2+0x17a/0x1e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x175/0x210 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (system_transition_mutex){+.+.}-{4:4}: check_prev_add kernel/locking/lockdep.c:3163 [inline] check_prevs_add kernel/locking/lockdep.c:3282 [inline] validate_chain kernel/locking/lockdep.c:3906 [inline] __lock_acquire+0x249e/0x3c40 kernel/locking/lockdep.c:5228 lock_acquire.part.0+0x11b/0x380 kernel/locking/lockdep.c:5851 __mutex_lock_common kernel/locking/mutex.c:585 [inline] __mutex_lock+0x19b/0xb10 kernel/locking/mutex.c:730 lock_system_sleep+0x87/0xa0 kernel/power/main.c:56 hibernate_compressor_param_set+0x1c/0x210 kernel/power/hibernate.c:1452 param_attr_store+0x18f/0x300 kernel/params.c:588 module_attr_store+0x55/0x80 kernel/params.c:924 sysfs_kf_write+0x117/0x170 fs/sysfs/file.c:139 kernfs_fop_write_iter+0x33d/0x500 fs/kernfs/file.c:334 new_sync_write fs/read_write.c:586 [inline] vfs_write+0x5ae/0x1150 fs/read_write.c:679 ksys_write+0x12b/0x250 fs/read_write.c:731 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcd/0x250 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: system_transition_mutex --> rtnl_mutex --> param_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(param_lock); lock(rtnl_mutex); lock(param_lock); lock(system_transition_mutex); *** DEADLOCK *** Reported-by: syzbot+ace60642828c074eb913@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ace60642828c074eb913 Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com> Link: https://patch.msgid.link/20250224013139.3994500-1-lizhi.xu@windriver.com [ rjw: New subject matching the code changes, changelog edits ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
kdave
pushed a commit
that referenced
this pull request
Mar 31, 2025
perf test 11 hwmon fails on s390 with this error # ./perf test -Fv 11 --- start --- ---- end ---- 11.1: Basic parsing test : Ok --- start --- Testing 'temp_test_hwmon_event1' Using CPUID IBM,3931,704,A01,3.7,002f temp_test_hwmon_event1 -> hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/ FAILED tests/hwmon_pmu.c:189 Unexpected config for 'temp_test_hwmon_event1', 292470092988416 != 655361 ---- end ---- 11.2: Parsing without PMU name : FAILED! --- start --- Testing 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/' FAILED tests/hwmon_pmu.c:189 Unexpected config for 'hwmon_a_test_hwmon_pmu/temp_test_hwmon_event1/', 292470092988416 != 655361 ---- end ---- 11.3: Parsing with PMU name : FAILED! # The root cause is in member test_event::config which is initialized to 0xA0001 or 655361. During event parsing a long list event parsing functions are called and end up with this gdb call stack: #0 hwmon_pmu__config_term (hwm=0x168dfd0, attr=0x3ffffff5ee8, term=0x168db60, err=0x3ffffff81c8) at util/hwmon_pmu.c:623 #1 hwmon_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, err=0x3ffffff81c8) at util/hwmon_pmu.c:662 #2 0x00000000012f870c in perf_pmu__config_terms (pmu=0x168dfd0, attr=0x3ffffff5ee8, terms=0x3ffffff5ea8, zero=false, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1519 #3 0x00000000012f88a4 in perf_pmu__config (pmu=0x168dfd0, attr=0x3ffffff5ee8, head_terms=0x3ffffff5ea8, apply_hardcoded=false, err=0x3ffffff81c8) at util/pmu.c:1545 #4 0x00000000012680c4 in parse_events_add_pmu (parse_state=0x3ffffff7fb8, list=0x168dc00, pmu=0x168dfd0, const_parsed_terms=0x3ffffff6090, auto_merge_stats=true, alternate_hw_config=10) at util/parse-events.c:1508 #5 0x00000000012684c6 in parse_events_multi_pmu_add (parse_state=0x3ffffff7fb8, event_name=0x168ec10 "temp_test_hwmon_event1", hw_config=10, const_parsed_terms=0x0, listp=0x3ffffff6230, loc_=0x3ffffff70e0) at util/parse-events.c:1592 #6 0x00000000012f0e4e in parse_events_parse (_parse_state=0x3ffffff7fb8, scanner=0x16878c0) at util/parse-events.y:293 #7 0x00000000012695a0 in parse_events__scanner (str=0x3ffffff81d8 "temp_test_hwmon_event1", input=0x0, parse_state=0x3ffffff7fb8) at util/parse-events.c:1867 #8 0x000000000126a1e8 in __parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", pmu_filter=0x0, err=0x3ffffff81c8, fake_pmu=false, warn_if_reordered=true, fake_tp=false) at util/parse-events.c:2136 #9 0x00000000011e36aa in parse_events (evlist=0x168b580, str=0x3ffffff81d8 "temp_test_hwmon_event1", err=0x3ffffff81c8) at /root/linux/tools/perf/util/parse-events.h:41 #10 0x00000000011e3e64 in do_test (i=0, with_pmu=false, with_alias=false) at tests/hwmon_pmu.c:164 torvalds#11 0x00000000011e422c in test__hwmon_pmu (with_pmu=false) at tests/hwmon_pmu.c:219 torvalds#12 0x00000000011e431c in test__hwmon_pmu_without_pmu (test=0x1610368 <suite.hwmon_pmu>, subtest=1) at tests/hwmon_pmu.c:23 where the attr::config is set to value 292470092988416 or 0x10a0000000000 in line 625 of file ./util/hwmon_pmu.c: attr->config = key.type_and_num; However member key::type_and_num is defined as union and bit field: union hwmon_pmu_event_key { long type_and_num; struct { int num :16; enum hwmon_type type :8; }; }; s390 is big endian and Intel is little endian architecture. The events for the hwmon dummy pmu have num = 1 or num = 2 and type is set to HWMON_TYPE_TEMP (which is 10). On s390 this assignes member key::type_and_num the value of 0x10a0000000000 (which is 292470092988416) as shown in above trace output. Fix this and export the structure/union hwmon_pmu_event_key so the test shares the same implementation as the event parsing functions for union and bit fields. This should avoid endianess issues on all platforms. Output after: # ./perf test -F 11 11.1: Basic parsing test : Ok 11.2: Parsing without PMU name : Ok 11.3: Parsing with PMU name : Ok # Fixes: 531ee0f ("perf test: Add hwmon "PMU" test") Signed-off-by: Thomas Richter <tmricht@linux.ibm.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250131112400.568975-1-tmricht@linux.ibm.com Signed-off-by: Namhyung Kim <namhyung@kernel.org>
kdave
pushed a commit
that referenced
this pull request
Mar 31, 2025
Ian told me that there are many memory leaks in the hierarchy mode. I can easily reproduce it with the follwing command. $ make DEBUG=1 EXTRA_CFLAGS=-fsanitize=leak $ perf record --latency -g -- ./perf test -w thloop $ perf report -H --stdio ... Indirect leak of 168 byte(s) in 21 object(s) allocated from: #0 0x7f3414c16c65 in malloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:75 #1 0x55ed3602346e in map__get util/map.h:189 #2 0x55ed36024cc4 in hist_entry__init util/hist.c:476 #3 0x55ed36025208 in hist_entry__new util/hist.c:588 #4 0x55ed36027c05 in hierarchy_insert_entry util/hist.c:1587 #5 0x55ed36027e2e in hists__hierarchy_insert_entry util/hist.c:1638 #6 0x55ed36027fa4 in hists__collapse_insert_entry util/hist.c:1685 #7 0x55ed360283e8 in hists__collapse_resort util/hist.c:1776 #8 0x55ed35de0323 in report__collapse_hists /home/namhyung/project/linux/tools/perf/builtin-report.c:735 #9 0x55ed35de15b4 in __cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1119 #10 0x55ed35de43dc in cmd_report /home/namhyung/project/linux/tools/perf/builtin-report.c:1867 torvalds#11 0x55ed35e66767 in run_builtin /home/namhyung/project/linux/tools/perf/perf.c:351 torvalds#12 0x55ed35e66a0e in handle_internal_command /home/namhyung/project/linux/tools/perf/perf.c:404 torvalds#13 0x55ed35e66b67 in run_argv /home/namhyung/project/linux/tools/perf/perf.c:448 torvalds#14 0x55ed35e66eb0 in main /home/namhyung/project/linux/tools/perf/perf.c:556 torvalds#15 0x7f340ac33d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 ... $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak' 93 I found that hist_entry__delete() missed to release child entries in the hierarchy tree (hroot_{in,out}). It needs to iterate the child entries and call hist_entry__delete() recursively. After this change: $ perf report -H --stdio 2>&1 | grep -c '^Indirect leak' 0 Reported-by: Ian Rogers <irogers@google.com> Tested-by Thomas Falcon <thomas.falcon@intel.com> Reviewed-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250307061250.320849-2-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
kdave
pushed a commit
that referenced
this pull request
Mar 31, 2025
The env.pmu_mapping can be leaked when it reads data from a pipe on AMD. For a pipe data, it reads the header data including pmu_mapping from PERF_RECORD_HEADER_FEATURE runtime. But it's already set in: perf_session__new() __perf_session__new() evlist__init_trace_event_sample_raw() evlist__has_amd_ibs() perf_env__nr_pmu_mappings() Then it'll overwrite that when it processes the HEADER_FEATURE record. Here's a report from address sanitizer. Direct leak of 2689 byte(s) in 1 object(s) allocated from: #0 0x7fed8f814596 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98 #1 0x5595a7d416b1 in strbuf_grow util/strbuf.c:64 #2 0x5595a7d414ef in strbuf_init util/strbuf.c:25 #3 0x5595a7d0f4b7 in perf_env__read_pmu_mappings util/env.c:362 #4 0x5595a7d12ab7 in perf_env__nr_pmu_mappings util/env.c:517 #5 0x5595a7d89d2f in evlist__has_amd_ibs util/amd-sample-raw.c:315 #6 0x5595a7d87fb2 in evlist__init_trace_event_sample_raw util/sample-raw.c:23 #7 0x5595a7d7f893 in __perf_session__new util/session.c:179 #8 0x5595a7b79572 in perf_session__new util/session.h:115 #9 0x5595a7b7e9dc in cmd_report builtin-report.c:1603 #10 0x5595a7c019eb in run_builtin perf.c:351 torvalds#11 0x5595a7c01c92 in handle_internal_command perf.c:404 torvalds#12 0x5595a7c01deb in run_argv perf.c:448 torvalds#13 0x5595a7c02134 in main perf.c:556 torvalds#14 0x7fed85833d67 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 Let's free the existing pmu_mapping data if any. Cc: Ravi Bangoria <ravi.bangoria@amd.com> Link: https://lore.kernel.org/r/20250311000416.817631-1-namhyung@kernel.org Signed-off-by: Namhyung Kim <namhyung@kernel.org>
kdave
pushed a commit
that referenced
this pull request
Apr 1, 2025
Patch series "Fix calculations in trace_balance_dirty_pages() for cgwb", v2. In my experiment, I found that the output of trace_balance_dirty_pages() in the cgroup writeback scenario was strange because trace_balance_dirty_pages() always uses global_wb_domain.dirty_limit for related calculations instead of the dirty_limit of the corresponding memcg's wb_domain. The basic idea of the fix is to store the hard dirty limit value computed in wb_position_ratio() into struct dirty_throttle_control and use it for calculations in trace_balance_dirty_pages(). This patch (of 3): Currently, trace_balance_dirty_pages() already has 12 parameters. In the patch #3, I initially attempted to introduce an additional parameter. However, in include/linux/trace_events.h, bpf_trace_run12() only supports up to 12 parameters and bpf_trace_run13() does not exist. To reduce the number of parameters in trace_balance_dirty_pages(), we can make it accept a pointer to struct dirty_throttle_control as a parameter. To achieve this, we need to move the definition of struct dirty_throttle_control from mm/page-writeback.c to include/linux/writeback.h. Link: https://lkml.kernel.org/r/20250304110318.159567-1-yizhou.tang@shopee.com Link: https://lkml.kernel.org/r/20250304110318.159567-2-yizhou.tang@shopee.com Signed-off-by: Tang Yizhou <yizhou.tang@shopee.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Tang Yizhou <yizhou.tang@shopee.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave
pushed a commit
that referenced
this pull request
Apr 1, 2025
Patch series "mm: page_ext: Introduce new iteration API", v3. Introduction ============ [ Thanks to David Hildenbrand for identifying the root cause of this issue and proving guidance on how to fix it. The new API idea, bugs and misconceptions are all mine though ] Currently, trying to reserve 1G pages with page_owner=on and sparsemem causes a crash. The reproducer is very simple: 1. Build the kernel with CONFIG_SPARSEMEM=y and the table extensions 2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line 3. Reserve one 1G page at run-time, this should crash (see patch 1 for backtrace) [ A crash with page_table_check is also possible, but harder to trigger ] Apparently, starting with commit cf54f31 ("mm/hugetlb: use __GFP_COMP for gigantic folios") we now pass the full allocation order to page extension clients and the page extension implementation assumes that all PFNs of an allocation range will be stored in the same memory section (which is not true for 1G pages). To fix this, this series introduces a new iteration API for page extension objects. The API checks if the next page extension object can be retrieved from the current section or if it needs to look up for it in another section. Please, find all details in patch 1. I tested this series on arm64 and x86 by reserving 1G pages at run-time and doing kernel builds (always with page_owner=on and page_table_check=on). This patch (of 3): The page extension implementation assumes that all page extensions of a given page order are stored in the same memory section. The function page_ext_next() relies on this assumption by adding an offset to the current object to return the next adjacent page extension. This behavior works as expected for flatmem but fails for sparsemem when using 1G pages. The commit cf54f31 ("mm/hugetlb: use __GFP_COMP for gigantic folios") exposes this issue, making it possible for a crash when using page_owner or page_table_check page extensions. The problem is that for 1G pages, the page extensions may span memory section boundaries and be stored in different memory sections. This issue was not visible before commit cf54f31 ("mm/hugetlb: use __GFP_COMP for gigantic folios") because alloc_contig_pages() never passed more than MAX_PAGE_ORDER to post_alloc_hook(). However, the series introducing mentioned commit changed this behavior allowing the full 1G page order to be passed. Reproducer: 1. Build the kernel with CONFIG_SPARSEMEM=y and table extensions support 2. Pass 'default_hugepagesz=1 page_owner=on' in the kernel command-line 3. Reserve one 1G page at run-time, this should crash (backtrace below) To address this issue, this commit introduces a new API for iterating through page extensions. The main iteration macro is for_each_page_ext() and it must be called with the RCU read lock taken. Here's an usage example: """ struct page_ext_iter iter; struct page_ext *page_ext; ... rcu_read_lock(); for_each_page_ext(page, 1 << order, page_ext, iter) { struct my_page_ext *obj = get_my_page_ext_obj(page_ext); ... } rcu_read_unlock(); """ The loop construct uses page_ext_iter_next() which checks to see if we have crossed sections in the iteration. In this case, page_ext_iter_next() retrieves the next page_ext object from another section. Thanks to David Hildenbrand for helping identify the root cause and providing suggestions on how to fix and optmize the solution (final implementation and bugs are all mine through). Lastly, here's the backtrace, without kasan you can get random crashes: [ 76.052526] BUG: KASAN: slab-out-of-bounds in __update_page_owner_handle+0x238/0x298 [ 76.060283] Write of size 4 at addr ffff07ff96240038 by task tee/3598 [ 76.066714] [ 76.068203] CPU: 88 UID: 0 PID: 3598 Comm: tee Kdump: loaded Not tainted 6.13.0-rep1 #3 [ 76.076202] Hardware name: WIWYNN Mt.Jade Server System B81.030Z1.0007/Mt.Jade Motherboard, BIOS 2.10.20220810 (SCP: 2.10.20220810) 2022/08/10 [ 76.088972] Call trace: [ 76.091411] show_stack+0x20/0x38 (C) [ 76.095073] dump_stack_lvl+0x80/0xf8 [ 76.098733] print_address_description.constprop.0+0x88/0x398 [ 76.104476] print_report+0xa8/0x278 [ 76.108041] kasan_report+0xa8/0xf8 [ 76.111520] __asan_report_store4_noabort+0x20/0x30 [ 76.116391] __update_page_owner_handle+0x238/0x298 [ 76.121259] __set_page_owner+0xdc/0x140 [ 76.125173] post_alloc_hook+0x190/0x1d8 [ 76.129090] alloc_contig_range_noprof+0x54c/0x890 [ 76.133874] alloc_contig_pages_noprof+0x35c/0x4a8 [ 76.138656] alloc_gigantic_folio.isra.0+0x2c0/0x368 [ 76.143616] only_alloc_fresh_hugetlb_folio.isra.0+0x24/0x150 [ 76.149353] alloc_pool_huge_folio+0x11c/0x1f8 [ 76.153787] set_max_huge_pages+0x364/0xca8 [ 76.157961] __nr_hugepages_store_common+0xb0/0x1a0 [ 76.162829] nr_hugepages_store+0x108/0x118 [ 76.167003] kobj_attr_store+0x3c/0x70 [ 76.170745] sysfs_kf_write+0xfc/0x188 [ 76.174492] kernfs_fop_write_iter+0x274/0x3e0 [ 76.178927] vfs_write+0x64c/0x8e0 [ 76.182323] ksys_write+0xf8/0x1f0 [ 76.185716] __arm64_sys_write+0x74/0xb0 [ 76.189630] invoke_syscall.constprop.0+0xd8/0x1e0 [ 76.194412] do_el0_svc+0x164/0x1e0 [ 76.197891] el0_svc+0x40/0xe0 [ 76.200939] el0t_64_sync_handler+0x144/0x168 [ 76.205287] el0t_64_sync+0x1ac/0x1b0 Link: https://lkml.kernel.org/r/cover.1741301089.git.luizcap@redhat.com Link: https://lkml.kernel.org/r/a45893880b7e1601082d39d2c5c8b50bcc096305.1741301089.git.luizcap@redhat.com Fixes: cf54f31 ("mm/hugetlb: use __GFP_COMP for gigantic folios") Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Luiz Capitulino <luizcap@redhat.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave
pushed a commit
that referenced
this pull request
Apr 1, 2025
Syzkaller reports a bug as follows: Injecting memory failure for pfn 0x18b00e at process virtual address 0x20ffd000 Memory failure: 0x18b00e: dirty swapcache page still referenced by 2 users Memory failure: 0x18b00e: recovery action for dirty swapcache page: Failed page: refcount:2 mapcount:0 mapping:0000000000000000 index:0x20ffd pfn:0x18b00e memcg:ffff0000dd6d9000 anon flags: 0x5ffffe00482011(locked|dirty|arch_1|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0xfffff) raw: 005ffffe00482011 dead000000000100 dead000000000122 ffff0000e232a7c9 raw: 0000000000020ffd 0000000000000000 00000002ffffffff ffff0000dd6d9000 page dumped because: VM_BUG_ON_FOLIO(!folio_test_uptodate(folio)) ------------[ cut here ]------------ kernel BUG at mm/swap_state.c:184! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 0 PID: 60 Comm: kswapd0 Not tainted 6.6.0-gcb097e7de84e #3 Hardware name: linux,dummy-virt (DT) pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : add_to_swap+0xbc/0x158 lr : add_to_swap+0xbc/0x158 sp : ffff800087f37340 x29: ffff800087f37340 x28: fffffc00052c0380 x27: ffff800087f37780 x26: ffff800087f37490 x25: ffff800087f37c78 x24: ffff800087f377a0 x23: ffff800087f37c50 x22: 0000000000000000 x21: fffffc00052c03b4 x20: 0000000000000000 x19: fffffc00052c0380 x18: 0000000000000000 x17: 296f696c6f662865 x16: 7461646f7470755f x15: 747365745f6f696c x14: 6f6621284f494c4f x13: 0000000000000001 x12: ffff600036d8b97b x11: 1fffe00036d8b97a x10: ffff600036d8b97a x9 : dfff800000000000 x8 : 00009fffc9274686 x7 : ffff0001b6c5cbd3 x6 : 0000000000000001 x5 : ffff0000c25896c0 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : ffff0000c25896c0 x0 : 0000000000000000 Call trace: add_to_swap+0xbc/0x158 shrink_folio_list+0x12ac/0x2648 shrink_inactive_list+0x318/0x948 shrink_lruvec+0x450/0x720 shrink_node_memcgs+0x280/0x4a8 shrink_node+0x128/0x978 balance_pgdat+0x4f0/0xb20 kswapd+0x228/0x438 kthread+0x214/0x230 ret_from_fork+0x10/0x20 I can reproduce this issue with the following steps: 1) When a dirty swapcache page is isolated by reclaim process and the page isn't locked, inject memory failure for the page. me_swapcache_dirty() clears uptodate flag and tries to delete from lru, but fails. Reclaim process will put the hwpoisoned page back to lru. 2) The process that maps the hwpoisoned page exits, the page is deleted the page will never be freed and will be in the lru forever. 3) If we trigger a reclaim again and tries to reclaim the page, add_to_swap() will trigger VM_BUG_ON_FOLIO due to the uptodate flag is cleared. To fix it, skip the hwpoisoned page in shrink_folio_list(). Besides, the hwpoison folio may not be unmapped by hwpoison_user_mappings() yet, unmap it in shrink_folio_list(), otherwise the folio will fail to be unmaped by hwpoison_user_mappings() since the folio isn't in lru list. Link: https://lkml.kernel.org/r/20250318083939.987651-3-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger,kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave
pushed a commit
that referenced
this pull request
Apr 3, 2025
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush() generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC, which causes the flush_bio to be throttled by wbt_wait(). An example from v5.4, similar problem also exists in upstream: crash> bt 2091206 PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0" #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8 #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4 #2 [ffff800084a2f880] schedule at ffff800040bfa4b4 #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4 #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0 #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254 #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38 #8 [ffff800084a2fa60] generic_make_request at ffff800040570138 #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4 #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs] torvalds#11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs] torvalds#12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs] torvalds#13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs] torvalds#14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs] torvalds#15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs] torvalds#16 [ffff800084a2fdb0] process_one_work at ffff800040111d08 torvalds#17 [ffff800084a2fe00] worker_thread at ffff8000401121cc torvalds#18 [ffff800084a2fe70] kthread at ffff800040118de4 After commit 2def284 ("xfs: don't allow log IO to be throttled"), the metadata submitted by xlog_write_iclog() should not be throttled. But due to the existence of the dm layer, throttling flush_bio indirectly causes the metadata bio to be throttled. Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes wbt_should_throttle() return false to avoid wbt_wait(). Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Tianxiang Peng <txpeng@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Apr 3, 2025
A cache device failing to resume due to mapping errors should not be retried, as the failure leaves a partially initialized policy object. Repeating the resume operation risks triggering BUG_ON when reloading cache mappings into the incomplete policy object. Reproduce steps: 1. create a cache metadata consisting of 512 or more cache blocks, with some mappings stored in the first array block of the mapping array. Here we use cache_restore v1.0 to build the metadata. cat <<EOF >> cmeta.xml <superblock uuid="" block_size="128" nr_cache_blocks="512" \ policy="smq" hint_width="4"> <mappings> <mapping cache_block="0" origin_block="0" dirty="false"/> </mappings> </superblock> EOF dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2 dmsetup remove cmeta 2. wipe the second array block of the mapping array to simulate data degradations. mapping_root=$(dd if=/dev/sdc bs=1c count=8 skip=192 \ 2>/dev/null | hexdump -e '1/8 "%u\n"') ablock=$(dd if=/dev/sdc bs=1c count=8 skip=$((4096*mapping_root+2056)) \ 2>/dev/null | hexdump -e '1/8 "%u\n"') dd if=/dev/zero of=/dev/sdc bs=4k count=1 seek=$ablock 3. try bringing up the cache device. The resume is expected to fail due to the broken array block. dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc 262144" dmsetup create cache --notable dmsetup load cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" dmsetup resume cache 4. try resuming the cache again. An unexpected BUG_ON is triggered while loading cache mappings. dmsetup resume cache Kernel logs: (snip) ------------[ cut here ]------------ kernel BUG at drivers/md/dm-cache-policy-smq.c:752! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 0 UID: 0 PID: 332 Comm: dmsetup Not tainted 6.13.4 #3 RIP: 0010:smq_load_mapping+0x3e5/0x570 Fix by disallowing resume operations for devices that failed the initial attempt. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Apr 3, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 4, 2025
Two fixes from the recent logging changes: bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt context, or with rcu_read_lock() held. The one syzbot found is in bch2_bkey_pick_read_device bch2_dev_rcu bch2_fs_inconsistent We're starting to switch to lift the printbufs up to higher levels so we can emit better log messages and print them all in one go (avoid garbling), so that conversion will help with spotting these in the future; when we declare a printbuf it must be flagged if we're in an atomic context. Secondly, in btree_node_write_endio: 00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321 00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6 00085 preempt_count: 10001, expected: 0 00085 RCU nest depth: 0, expected: 0 00085 4 locks held by bch-reclaim/fa6/618: 00085 #0: ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198 00085 #1: ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440 00085 #2: ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440 00085 #3: ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130 00085 irq event stamp: 328 00085 hardirqs last enabled at (327): [<ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0 00085 hardirqs last disabled at (328): [<ffffffc080971a10>] el1_interrupt+0x20/0x60 00085 softirqs last enabled at (0): [<ffffffc08002f920>] copy_process+0x7c8/0x2118 00085 softirqs last disabled at (0): [<0000000000000000>] 0x0 00085 Preemption disabled at: 00085 [<ffffffc08003ada0>] irq_enter_rcu+0x18/0x90 00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted 6.14.0-rc6-ktest-g04630bde23e8 #18798 00085 Hardware name: linux,dummy-virt (DT) 00085 Call trace: 00085 show_stack+0x1c/0x30 (C) 00085 dump_stack_lvl+0x84/0xc0 00085 dump_stack+0x14/0x20 00085 __might_resched+0x180/0x288 00085 __might_sleep+0x4c/0x88 00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0 00085 krealloc_noprof+0x1a0/0x2d8 00085 bch2_printbuf_make_room+0x9c/0x120 00085 bch2_prt_printf+0x60/0x1b8 00085 btree_node_write_endio+0x1b0/0x2d8 00085 bio_endio+0x138/0x1f0 00085 btree_node_write_endio+0xe8/0x2d8 00085 bio_endio+0x138/0x1f0 00085 blk_update_request+0x220/0x4c0 00085 blk_mq_end_request+0x28/0x148 00085 virtblk_request_done+0x64/0xe8 00085 blk_mq_complete_request+0x34/0x40 00085 virtblk_done+0x78/0x130 00085 vring_interrupt+0x6c/0xb0 00085 __handle_irq_event_percpu+0x8c/0x2e0 00085 handle_irq_event+0x50/0xb0 00085 handle_fasteoi_irq+0xc4/0x250 00085 handle_irq_desc+0x44/0x60 00085 generic_handle_domain_irq+0x20/0x30 00085 gic_handle_irq+0x54/0xc8 00085 call_on_irq_stack+0x24/0x40 Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
kdave
pushed a commit
that referenced
this pull request
Apr 4, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 7, 2025
v2: - Created a single error handling unlock and exit in veth_pool_store - Greatly expanded commit message with previous explanatory-only text Summary: Use rtnl_mutex to synchronize veth_pool_store with itself, ibmveth_close and ibmveth_open, preventing multiple calls in a row to napi_disable. Background: Two (or more) threads could call veth_pool_store through writing to /sys/devices/vio/30000002/pool*/*. You can do this easily with a little shell script. This causes a hang. I configured LOCKDEP, compiled ibmveth.c with DEBUG, and built a new kernel. I ran this test again and saw: Setting pool0/active to 0 Setting pool1/active to 1 [ 73.911067][ T4365] ibmveth 30000002 eth0: close starting Setting pool1/active to 1 Setting pool1/active to 0 [ 73.911367][ T4366] ibmveth 30000002 eth0: close starting [ 73.916056][ T4365] ibmveth 30000002 eth0: close complete [ 73.916064][ T4365] ibmveth 30000002 eth0: open starting [ 110.808564][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification. [ 230.808495][ T712] systemd-journald[712]: Sent WATCHDOG=1 notification. [ 243.683786][ T123] INFO: task stress.sh:4365 blocked for more than 122 seconds. [ 243.683827][ T123] Not tainted 6.14.0-01103-g2df0c02dab82-dirty #8 [ 243.683833][ T123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 243.683838][ T123] task:stress.sh state:D stack:28096 pid:4365 tgid:4365 ppid:4364 task_flags:0x400040 flags:0x00042000 [ 243.683852][ T123] Call Trace: [ 243.683857][ T123] [c00000000c38f690] [0000000000000001] 0x1 (unreliable) [ 243.683868][ T123] [c00000000c38f840] [c00000000001f908] __switch_to+0x318/0x4e0 [ 243.683878][ T123] [c00000000c38f8a0] [c000000001549a70] __schedule+0x500/0x12a0 [ 243.683888][ T123] [c00000000c38f9a0] [c00000000154a878] schedule+0x68/0x210 [ 243.683896][ T123] [c00000000c38f9d0] [c00000000154ac80] schedule_preempt_disabled+0x30/0x50 [ 243.683904][ T123] [c00000000c38fa00] [c00000000154dbb0] __mutex_lock+0x730/0x10f0 [ 243.683913][ T123] [c00000000c38fb10] [c000000001154d40] napi_enable+0x30/0x60 [ 243.683921][ T123] [c00000000c38fb40] [c000000000f4ae94] ibmveth_open+0x68/0x5dc [ 243.683928][ T123] [c00000000c38fbe0] [c000000000f4aa20] veth_pool_store+0x220/0x270 [ 243.683936][ T123] [c00000000c38fc70] [c000000000826278] sysfs_kf_write+0x68/0xb0 [ 243.683944][ T123] [c00000000c38fcb0] [c0000000008240b8] kernfs_fop_write_iter+0x198/0x2d0 [ 243.683951][ T123] [c00000000c38fd00] [c00000000071b9ac] vfs_write+0x34c/0x650 [ 243.683958][ T123] [c00000000c38fdc0] [c00000000071bea8] ksys_write+0x88/0x150 [ 243.683966][ T123] [c00000000c38fe10] [c0000000000317f4] system_call_exception+0x124/0x340 [ 243.683973][ T123] [c00000000c38fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec ... [ 243.684087][ T123] Showing all locks held in the system: [ 243.684095][ T123] 1 lock held by khungtaskd/123: [ 243.684099][ T123] #0: c00000000278e370 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x50/0x248 [ 243.684114][ T123] 4 locks held by stress.sh/4365: [ 243.684119][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150 [ 243.684132][ T123] #1: c000000041aea888 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0 [ 243.684143][ T123] #2: c0000000366fb9a8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0 [ 243.684155][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_enable+0x30/0x60 [ 243.684166][ T123] 5 locks held by stress.sh/4366: [ 243.684170][ T123] #0: c00000003a4cd3f8 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x88/0x150 [ 243.684183][ T123] #1: c00000000aee2288 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x154/0x2d0 [ 243.684194][ T123] #2: c0000000366f4ba8 (kn->active#64){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x160/0x2d0 [ 243.684205][ T123] #3: c000000035ff4cb8 (&dev->lock){+.+.}-{3:3}, at: napi_disable+0x30/0x60 [ 243.684216][ T123] #4: c0000003ff9bbf18 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0x138/0x12a0 From the ibmveth debug, two threads are calling veth_pool_store, which calls ibmveth_close and ibmveth_open. Here's the sequence: T4365 T4366 ----------------- ----------------- --------- veth_pool_store veth_pool_store ibmveth_close ibmveth_close napi_disable napi_disable ibmveth_open napi_enable <- HANG ibmveth_close calls napi_disable at the top and ibmveth_open calls napi_enable at the top. https://docs.kernel.org/networking/napi.html]] says The control APIs are not idempotent. Control API calls are safe against concurrent use of datapath APIs but an incorrect sequence of control API calls may result in crashes, deadlocks, or race conditions. For example, calling napi_disable() multiple times in a row will deadlock. In the normal open and close paths, rtnl_mutex is acquired to prevent other callers. This is missing from veth_pool_store. Use rtnl_mutex in veth_pool_store fixes these hangs. Signed-off-by: Dave Marquardt <davemarq@linux.ibm.com> Fixes: 860f242 ("[PATCH] ibmveth change buffer pools dynamically") Reviewed-by: Nick Child <nnac123@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250402154403.386744-1-davemarq@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kdave
pushed a commit
that referenced
this pull request
Apr 9, 2025
…cesses Acquire a lock on kvm->srcu when userspace is getting MP state to handle a rather extreme edge case where "accepting" APIC events, i.e. processing pending INIT or SIPI, can trigger accesses to guest memory. If the vCPU is in L2 with INIT *and* a TRIPLE_FAULT request pending, then getting MP state will trigger a nested VM-Exit by way of ->check_nested_events(), and emuating the nested VM-Exit can access guest memory. The splat was originally hit by syzkaller on a Google-internal kernel, and reproduced on an upstream kernel by hacking the triple_fault_event_test selftest to stuff a pending INIT, store an MSR on VM-Exit (to generate a memory access on VMX), and do vcpu_mp_state_get() to trigger the scenario. ============================= WARNING: suspicious RCU usage 6.14.0-rc3-b112d356288b-vmx/pi_lockdep_false_pos-lock #3 Not tainted ----------------------------- include/linux/kvm_host.h:1058 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by triple_fault_ev/1256: #0: ffff88810df5a330 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x8b/0x9a0 [kvm] stack backtrace: CPU: 11 UID: 1000 PID: 1256 Comm: triple_fault_ev Not tainted 6.14.0-rc3-b112d356288b-vmx #3 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x7f/0x90 lockdep_rcu_suspicious+0x144/0x190 kvm_vcpu_gfn_to_memslot+0x156/0x180 [kvm] kvm_vcpu_read_guest+0x3e/0x90 [kvm] read_and_check_msr_entry+0x2e/0x180 [kvm_intel] __nested_vmx_vmexit+0x550/0xde0 [kvm_intel] kvm_check_nested_events+0x1b/0x30 [kvm] kvm_apic_accept_events+0x33/0x100 [kvm] kvm_arch_vcpu_ioctl_get_mpstate+0x30/0x1d0 [kvm] kvm_vcpu_ioctl+0x33e/0x9a0 [kvm] __x64_sys_ioctl+0x8b/0xb0 do_syscall_64+0x6c/0x170 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20250401150504.829812-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Apr 14, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 14, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 15, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 15, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 17, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 22, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Apr 23, 2025
There is a potential deadlock if we do report zones in an IO context, detailed in below lockdep report. When one process do a report zones and another process freezes the block device, the report zones side cannot allocate a tag because the freeze is already started. This can thus result in new block group creation to hang forever, blocking the write path. Thankfully, a new block group should be created on empty zones. So, reporting the zones is not necessary and we can set the write pointer = 0 and load the zone capacity from the block layer using bdev_zone_capacity() helper. ====================================================== WARNING: possible circular locking dependency detected 6.14.0-rc1 torvalds#252 Not tainted ------------------------------------------------------ modprobe/1110 is trying to acquire lock: ffff888100ac83e0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: __flush_work+0x38f/0xb60 but task is already holding lock: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}: blk_queue_enter+0x3d9/0x500 blk_mq_alloc_request+0x47d/0x8e0 scsi_execute_cmd+0x14f/0xb80 sd_zbc_do_report_zones+0x1c1/0x470 sd_zbc_report_zones+0x362/0xd60 blkdev_report_zones+0x1b1/0x2e0 btrfs_get_dev_zones+0x215/0x7e0 [btrfs] btrfs_load_block_group_zone_info+0x6d2/0x2c10 [btrfs] btrfs_make_block_group+0x36b/0x870 [btrfs] btrfs_create_chunk+0x147d/0x2320 [btrfs] btrfs_chunk_alloc+0x2ce/0xcf0 [btrfs] start_transaction+0xce6/0x1620 [btrfs] btrfs_uuid_scan_kthread+0x4ee/0x5b0 [btrfs] kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #2 (&fs_info->dev_replace.rwsem){++++}-{4:4}: down_read+0x9b/0x470 btrfs_map_block+0x2ce/0x2ce0 [btrfs] btrfs_submit_chunk+0x2d4/0x16c0 [btrfs] btrfs_submit_bbio+0x16/0x30 [btrfs] btree_write_cache_pages+0xb5a/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #1 (&fs_info->zoned_meta_io_lock){+.+.}-{4:4}: __mutex_lock+0x1aa/0x1360 btree_write_cache_pages+0x252/0xf90 [btrfs] do_writepages+0x17f/0x7b0 __writeback_single_inode+0x114/0xb00 writeback_sb_inodes+0x52b/0xe00 wb_writeback+0x1a7/0x800 wb_workfn+0x12a/0xbd0 process_one_work+0x85a/0x1460 worker_thread+0x5e2/0xfc0 kthread+0x39d/0x750 ret_from_fork+0x30/0x70 ret_from_fork_asm+0x1a/0x30 -> #0 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __lock_acquire+0x2f52/0x5ea0 lock_acquire+0x1b1/0x540 __flush_work+0x3ac/0xb60 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 del_gendisk+0x841/0xa20 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] device_release_driver_internal+0x368/0x520 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 do_syscall_64+0x93/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e other info that might help us debug this: Chain exists of: (work_completion)(&(&wb->dwork)->work) --> &fs_info->dev_replace.rwsem --> &q->q_usage_counter(queue)torvalds#16 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&q->q_usage_counter(queue)torvalds#16); lock(&fs_info->dev_replace.rwsem); lock(&q->q_usage_counter(queue)torvalds#16); lock((work_completion)(&(&wb->dwork)->work)); *** DEADLOCK *** 5 locks held by modprobe/1110: #0: ffff88811f7bc108 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #1: ffff8881022ee0e0 (&shost->scan_mutex){+.+.}-{4:4}, at: scsi_remove_host+0x20/0x2a0 #2: ffff88811b4c4378 (&dev->mutex){....}-{4:4}, at: device_release_driver_internal+0x8f/0x520 #3: ffff8881205b6f20 (&q->q_usage_counter(queue)torvalds#16){++++}-{0:0}, at: sd_remove+0x85/0x130 #4: ffffffffa3284360 (rcu_read_lock){....}-{1:3}, at: __flush_work+0xda/0xb60 stack backtrace: CPU: 0 UID: 0 PID: 1110 Comm: modprobe Not tainted 6.14.0-rc1 torvalds#252 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x6a/0x90 print_circular_bug.cold+0x1e0/0x274 check_noncircular+0x306/0x3f0 ? __pfx_check_noncircular+0x10/0x10 ? mark_lock+0xf5/0x1650 ? __pfx_check_irq_usage+0x10/0x10 ? lockdep_lock+0xca/0x1c0 ? __pfx_lockdep_lock+0x10/0x10 __lock_acquire+0x2f52/0x5ea0 ? __pfx___lock_acquire+0x10/0x10 ? __pfx_mark_lock+0x10/0x10 lock_acquire+0x1b1/0x540 ? __flush_work+0x38f/0xb60 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_lock_release+0x10/0x10 ? mark_held_locks+0x94/0xe0 ? __flush_work+0x38f/0xb60 __flush_work+0x3ac/0xb60 ? __flush_work+0x38f/0xb60 ? __pfx_mark_lock+0x10/0x10 ? __pfx___flush_work+0x10/0x10 ? __pfx_wq_barrier_func+0x10/0x10 ? __pfx___might_resched+0x10/0x10 ? mark_held_locks+0x94/0xe0 wb_shutdown+0x15b/0x1f0 bdi_unregister+0x172/0x5b0 ? __pfx_bdi_unregister+0x10/0x10 ? up_write+0x1ba/0x510 del_gendisk+0x841/0xa20 ? __pfx_del_gendisk+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x35/0x60 ? __pm_runtime_resume+0x79/0x110 sd_remove+0x85/0x130 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 __scsi_remove_device+0x272/0x340 scsi_forget_host+0xf7/0x170 scsi_remove_host+0xd2/0x2a0 sdebug_driver_remove+0x52/0x2f0 [scsi_debug] ? kernfs_remove_by_name_ns+0xc0/0xf0 device_release_driver_internal+0x368/0x520 ? kobject_put+0x5d/0x4a0 bus_remove_device+0x1f1/0x3f0 device_del+0x3bd/0x9c0 ? __pfx_device_del+0x10/0x10 ? __pfx___mutex_unlock_slowpath+0x10/0x10 device_unregister+0x13/0xa0 sdebug_do_remove_host+0x1fb/0x290 [scsi_debug] scsi_debug_exit+0x17/0x70 [scsi_debug] __do_sys_delete_module.isra.0+0x321/0x520 ? __pfx___do_sys_delete_module.isra.0+0x10/0x10 ? __pfx_slab_free_after_rcu_debug+0x10/0x10 ? kasan_save_stack+0x2c/0x50 ? kasan_record_aux_stack+0xa3/0xb0 ? __call_rcu_common.constprop.0+0xc4/0xfb0 ? kmem_cache_free+0x3a0/0x590 ? __x64_sys_close+0x78/0xd0 do_syscall_64+0x93/0x180 ? lock_is_held_type+0xd5/0x130 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? lockdep_hardirqs_on+0x78/0x100 ? __call_rcu_common.constprop.0+0x3c0/0xfb0 ? __pfx___call_rcu_common.constprop.0+0x10/0x10 ? kmem_cache_free+0x3a0/0x590 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 ? __pfx___x64_sys_openat+0x10/0x10 ? lockdep_hardirqs_on_prepare+0x16d/0x400 ? do_syscall_64+0x9f/0x180 ? lockdep_hardirqs_on+0x78/0x100 ? do_syscall_64+0x9f/0x180 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f436712b68b RSP: 002b:00007ffe9f1a8658 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 00005559b367fd80 RCX: 00007f436712b68b RDX: 0000000000000000 RSI: 0000000000000800 RDI: 00005559b367fde8 RBP: 00007ffe9f1a8680 R08: 1999999999999999 R09: 0000000000000000 R10: 00007f43671a5fe0 R11: 0000000000000206 R12: 0000000000000000 R13: 00007ffe9f1a86b0 R14: 0000000000000000 R15: 0000000000000000 </TASK> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> CC: <stable@vger.kernel.org> # 6.13+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Please consider these fixes, in the area of handling devices and sysfs. Mainly they are bug fixes and framework changes. And towards the end of this patch set, I have two patches which are introducing two new features, device delete by devid and sysfs attributes for btrfs pool. These patches were sent to mailing list before.
Kindly note few of the subject are changed for good and to backtrack the old subject are maintained in the changelog. Also the review comment changes that some of the patches went through are also in the changelog, which probably should be deleted when merged.
Thanks, Anand