forked from torvalds/linux
-
Notifications
You must be signed in to change notification settings - Fork 15
Pull request #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Pull request #1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
This will return EIO when __bread() fails to read SB, instead of EINVAL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
…INTK defined error handling logic behaves differently with or without CONFIG_PRINTK defined, since there are two copies of the same function which a bit of different logic One, when CONFIG_PRINTK is defined, code is __btrfs_std_error(..) { :: save_error_info(fs_info); if (sb->s_flags & MS_BORN) btrfs_handle_error(fs_info); } and two when CONFIG_PRINTK is not defined, the code is __btrfs_std_error(..) { :: if (sb->s_flags & MS_BORN) { save_error_info(fs_info); btrfs_handle_error(fs_info); } } I doubt if this was intentional ? and appear to have caused since we maintain two copies of the same function and they got diverged with commits. Now to decide which logic is correct reviewed changes as below, 533574c Commit added two copies of this function cf79ffb Commit made change to only one copy of the function and to the copy when CONFIG_PRINTK is defined. To fix this, instead of maintaining two copies of same function approach, maintain single function, and just put the extra portion of the code under CONFIG_PRINTK define. This patch just does that. And keeps code of with CONFIG_PRINTK defined. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
…ot found Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead of -ENOENT. Next this removes the logging when user specifies "missing" and we don't find it in the kernel device list. Logging are for system events not for user input errors. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
This uses a chunk of code from btrfs_read_dev_super() and creates a function called btrfs_read_dev_one_super() so that next patch can use it for scratch superblock. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed bufhead to bh] Signed-off-by: David Sterba <dsterba@suse.com>
This patch updates and renames btrfs_scratch_superblocks, (which is used by the replace device thread), with those fixes from the scratch superblock code section of btrfs_rm_device(). The fixes are: Scratch all copies of superblock Notify kobject that superblock has been changed Update time on the device So that btrfs_rm_device() can use the function btrfs_scratch_superblocks() instead of its own scratch code. And further replace deivce code which similarly releases device back to the system, will have the fixes from the btrfs device delete. Signed-off-by: Anand Jain <anand.jain@oracle.com> [renamed to btrfs_scratch_superblock] Signed-off-by: David Sterba <dsterba@suse.com>
By general rule of thumb there shouldn't be any way that user land could trigger a kernel operation just by sending wrong arguments. Here do commit cleanups after user input has been verified. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Originally the message was not in a helper but ended up there. We should print error messages from callers instead. Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com> [reworded subject and changelog] Signed-off-by: David Sterba <dsterba@suse.com>
To avoid deadlock described in commit 084b6e7 ("btrfs: Fix a lockdep warning when running xfstest."), we should move kobj stuff out of dev_replace lock range. "It is because the btrfs_kobj_{add/rm}_device() will call memory allocation with GFP_KERNEL, which may flush fs page cache to free space, waiting for it self to do the commit, causing the deadlock. To solve the problem, move btrfs_kobj_{add/rm}_device() out of the dev_replace lock range, also involing split the btrfs_rm_dev_replace_srcdev() function into remove and free parts. Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are called out of the lock range." Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> [added lockup description] Signed-off-by: David Sterba <dsterba@suse.com>
This patch will log return value of add/del_qgroup_relation() and pass the err code of btrfs_run_qgroups to the btrfs_std_error(). Signed-off-by: Anand Jain <anand.jain@oracle.com>
A part of code from btrfs_scan_one_device() is moved to a new function btrfs_read_disk_super(), so that former function looks cleaner and moves the code to ensure null terminating label to it as well. Further there is opportunity to merge various duplicate code on read disk super. Earlier attempt on this was highlighted that there was some issues for which there are multiple versions, however it was not clear what was issue. So until its worked out we can keep it in a separate function. Signed-off-by: Anand Jain <anand.jain@oracle.com>
Optional Label may or may not be set, or it might be set at some time later. However while debugging to search through the kernel logs the scripts would need the logs to be consistent, so logs search key words shouldn't depend on the optional variables, instead fsid is better. Signed-off-by: Anand Jain <anand.jain@oracle.com>
From the issue diagnosable point of view, log if the device path is changed. Signed-off-by: Anand Jain <anand.jain@oracle.com>
looks like oversight, call brelse() when checksum fails. further down the code in the non error path we do call brelse() and so we don't see brelse() in the goto error.. paths. Signed-off-by: Anand Jain <anand.jain@oracle.com>
optimize check for stale device to only be checked when there is device added or changed. If there is no update to the device, there is no need to call btrfs_free_stale_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com>
This adds an enhancement to show the seed fsid and its devices on the btrfs sysfs. The way sprouting handles fs_devices: clone seed fs_devices and add to the fs_uuids mem copy seed fs_devices and assign to fs_devices->seed (move dev_list) evacuate seed fs_devices contents to hold sprout fs devices contents So to be inline with this fs_devices changes during seeding, represent seed fsid under the sprout fsid, this is achieved by using the kobject_move() The end result will be, /sys/fs/btrfs/sprout-fsid/seed/level-1-seed-fsid/seed/(if)level-2-seed-fsid Signed-off-by: Anand Jain <anand.jain@oracle.com>
We need fsid kobject to hold pool attributes however its created only when fs is mounted. So, this patch changes the life cycle of the fsid and devices kobjects /sys/fs/btrfs/<fsid> and /sys/fs/btrfs/<fsid>/devices, from created and destroyed by mount and unmount event to created and destroyed by scanned and module-unload events respectively. However this does not alter life cycle of fs attributes as such. Signed-off-by: Anand Jain <anand.jain@oracle.com>
move a section of btrfs_rm_device() code to check for min number of the devices into the function __check_raid_min_devices() v2: commit update and title renamed from Btrfs: move check for min number of devices to a function Signed-off-by: Anand Jain <anand.jain@oracle.com>
__check_raid_min_device() which was pealed from btrfs_rm_device() maintianed its original code to show the block move. This patch cleans up __check_raid_min_device(). Signed-off-by: Anand Jain <anand.jain@oracle.com>
The patch renames btrfs_dev_replace_find_srcdev() to btrfs_find_device_by_user_input() and moves it to volumes.c. so that delete device can use it. v2: changed title from 'Btrfs: create rename btrfs_dev_replace_find_srcdev()' and commit update Signed-off-by: Anand Jain <anand.jain@oracle.com>
btrfs_rm_device() has a section of the code which can be replaced btrfs_find_device_by_user_input() Signed-off-by: Anand Jain <anand.jain@oracle.com>
The operation of device replace and device delete follows same steps upto some depth with in btrfs kernel, however they don't share codes. This enhancement will help replace and delete to share codes. Btrfs: enhance check device_path in btrfs_find_device_by_user_input() Signed-off-by: Anand Jain <anand.jain@oracle.com>
With the previous patches now the btrfs_scratch_superblocks() is ready to be used in btrfs_rm_device() so use it. Signed-off-by: Anand Jain <anand.jain@oracle.com>
This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses enhanced struct btrfs_ioctl_vol_args_v2 to carry devid as an user argument. The patch won't delete the old ioctl interface and remains backward compatible with user land progs. Test case/script: echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk mount /dev/sdd /btrfs dmsetup suspend bad_disk echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk dmsetup resume bad_disk echo "bad disk failed. now deleting/replacing" btrfs dev del 3 /btrfs echo $? btrfs fi show /btrfs umount /btrfs btrfs-show-super /dev/sdd | egrep num_device dmsetup remove bad_disk wipefs -a /dev/sdf Signed-off-by: Anand Jain <anand.jain@oracle.com> Reported-by: Martin <m_btrfs@ml1.co.uk>
kdave
pushed a commit
that referenced
this pull request
Jun 2, 2025
Patch series "hung_task: extend blocking task stacktrace dump to semaphore", v5. Inspired by mutex blocker tracking[1], this patch series extend the feature to not only dump the blocker task holding a mutex but also to support semaphores. Unlike mutexes, semaphores lack explicit ownership tracking, making it challenging to identify the root cause of hangs. To address this, we introduce a last_holder field to the semaphore structure, which is updated when a task successfully calls down() and cleared during up(). The assumption is that if a task is blocked on a semaphore, the holders must not have released it. While this does not guarantee that the last holder is one of the current blockers, it likely provides a practical hint for diagnosing semaphore-related stalls. With this change, the hung task detector can now show blocker task's info like below: [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked for more than 120 seconds. [Tue Apr 8 12:19:07 2025] Tainted: G E 6.14.0-rc6+ #1 [Tue Apr 8 12:19:07 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Apr 8 12:19:07 2025] task:cat state:D stack:0 pid:945 tgid:945 ppid:828 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0xe3/0xf0 [Tue Apr 8 12:19:07 2025] ? __folio_mod_stat+0x2a/0x80 [Tue Apr 8 12:19:07 2025] ? set_ptes.constprop.0+0x27/0x90 [Tue Apr 8 12:19:07 2025] __down_common+0x155/0x280 [Tue Apr 8 12:19:07 2025] down+0x53/0x70 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x23/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f419478f46e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007fff1c4d2668 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f419478f46e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f4194683000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f4194683000 R08: 00007f4194682010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked on a semaphore likely last held by task cat:938 [Tue Apr 8 12:19:07 2025] task:cat state:S stack:0 pid:938 tgid:938 ppid:584 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] ? _raw_spin_unlock_irqrestore+0xe/0x40 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0x77/0xf0 [Tue Apr 8 12:19:07 2025] ? __pfx_process_timeout+0x10/0x10 [Tue Apr 8 12:19:07 2025] msleep_interruptible+0x49/0x60 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x2d/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f7c584a646e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007ffdba8ce158 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f7c584a646e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f7c5839a000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f7c5839a000 R08: 00007f7c58399010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> This patch (of 3): This patch replaces 'struct mutex *blocker_mutex' with 'unsigned long blocker', as only one blocker is active at a time. The blocker filed can store both the lock addrees and the lock type, with LSB used to encode the type as Masami suggested, making it easier to extend the feature to cover other types of locks. Also, once the lock type is determined, we can directly extract the address and cast it to a lock pointer ;) Link: https://lkml.kernel.org/r/20250414145945.84916-1-ioworker0@gmail.com Link: https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com [1] Link: https://lkml.kernel.org/r/20250414145945.84916-2-ioworker0@gmail.com Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Li <amaindex@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave
pushed a commit
that referenced
this pull request
Jun 2, 2025
Inspired by mutex blocker tracking[1], this patch makes a trade-off to balance the overhead and utility of the hung task detector. Unlike mutexes, semaphores lack explicit ownership tracking, making it challenging to identify the root cause of hangs. To address this, we introduce a last_holder field to the semaphore structure, which is updated when a task successfully calls down() and cleared during up(). The assumption is that if a task is blocked on a semaphore, the holders must not have released it. While this does not guarantee that the last holder is one of the current blockers, it likely provides a practical hint for diagnosing semaphore-related stalls. With this change, the hung task detector can now show blocker task's info like below: [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked for more than 120 seconds. [Tue Apr 8 12:19:07 2025] Tainted: G E 6.14.0-rc6+ #1 [Tue Apr 8 12:19:07 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Apr 8 12:19:07 2025] task:cat state:D stack:0 pid:945 tgid:945 ppid:828 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0xe3/0xf0 [Tue Apr 8 12:19:07 2025] ? __folio_mod_stat+0x2a/0x80 [Tue Apr 8 12:19:07 2025] ? set_ptes.constprop.0+0x27/0x90 [Tue Apr 8 12:19:07 2025] __down_common+0x155/0x280 [Tue Apr 8 12:19:07 2025] down+0x53/0x70 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x23/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f419478f46e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007fff1c4d2668 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f419478f46e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f4194683000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f4194683000 R08: 00007f4194682010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> [Tue Apr 8 12:19:07 2025] INFO: task cat:945 blocked on a semaphore likely last held by task cat:938 [Tue Apr 8 12:19:07 2025] task:cat state:S stack:0 pid:938 tgid:938 ppid:584 task_flags:0x400000 flags:0x00000000 [Tue Apr 8 12:19:07 2025] Call Trace: [Tue Apr 8 12:19:07 2025] <TASK> [Tue Apr 8 12:19:07 2025] __schedule+0x491/0xbd0 [Tue Apr 8 12:19:07 2025] ? _raw_spin_unlock_irqrestore+0xe/0x40 [Tue Apr 8 12:19:07 2025] schedule+0x27/0xf0 [Tue Apr 8 12:19:07 2025] schedule_timeout+0x77/0xf0 [Tue Apr 8 12:19:07 2025] ? __pfx_process_timeout+0x10/0x10 [Tue Apr 8 12:19:07 2025] msleep_interruptible+0x49/0x60 [Tue Apr 8 12:19:07 2025] read_dummy_semaphore+0x2d/0x60 [Tue Apr 8 12:19:07 2025] full_proxy_read+0x5f/0xa0 [Tue Apr 8 12:19:07 2025] vfs_read+0xbc/0x350 [Tue Apr 8 12:19:07 2025] ? __count_memcg_events+0xa5/0x140 [Tue Apr 8 12:19:07 2025] ? count_memcg_events.constprop.0+0x1a/0x30 [Tue Apr 8 12:19:07 2025] ? handle_mm_fault+0x180/0x260 [Tue Apr 8 12:19:07 2025] ksys_read+0x66/0xe0 [Tue Apr 8 12:19:07 2025] do_syscall_64+0x51/0x120 [Tue Apr 8 12:19:07 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e [Tue Apr 8 12:19:07 2025] RIP: 0033:0x7f7c584a646e [Tue Apr 8 12:19:07 2025] RSP: 002b:00007ffdba8ce158 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [Tue Apr 8 12:19:07 2025] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f7c584a646e [Tue Apr 8 12:19:07 2025] RDX: 0000000000020000 RSI: 00007f7c5839a000 RDI: 0000000000000003 [Tue Apr 8 12:19:07 2025] RBP: 00007f7c5839a000 R08: 00007f7c58399010 R09: 0000000000000000 [Tue Apr 8 12:19:07 2025] R10: fffffffffffffbc5 R11: 0000000000000246 R12: 0000000000000000 [Tue Apr 8 12:19:07 2025] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000 [Tue Apr 8 12:19:07 2025] </TASK> [1] https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com Link: https://lkml.kernel.org/r/20250414145945.84916-3-ioworker0@gmail.com Signed-off-by: Mingzhe Yang <mingzhe.yang@ly.com> Signed-off-by: Lance Yang <ioworker0@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Li <amaindex@outlook.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave
pushed a commit
that referenced
this pull request
Jun 2, 2025
A configfs /sys/kernel/config/crash_dm_crypt_keys is provided for user space to make the dm crypt keys persist for the kdump kernel. Take the case of dumping to a LUKS-encrypted target as an example, here is the life cycle of the kdump copies of LUKS volume keys, 1. After the 1st kernel loads the initramfs during boot, systemd uses an user-input passphrase to de-crypt the LUKS volume keys or simply TPM-sealed volume keys and then save the volume keys to specified keyring (using the --link-vk-to-keyring API) and the keys will expire within specified time. 2. A user space tool (kdump initramfs loader like kdump-utils) create key items inside /sys/kernel/config/crash_dm_crypt_keys to inform the 1st kernel which keys are needed. 3. When the kdump initramfs is loaded by the kexec_file_load syscall, the 1st kernel will iterate created key items, save the keys to kdump reserved memory. 4. When the 1st kernel crashes and the kdump initramfs is booted, the kdump initramfs asks the kdump kernel to create a user key using the key stored in kdump reserved memory by writing yes to /sys/kernel/crash_dm_crypt_keys/restore. Then the LUKS encrypted device is unlocked with libcryptsetup's --volume-key-keyring API. 5. The system gets rebooted to the 1st kernel after dumping vmcore to the LUKS encrypted device is finished Eventually the keys have to stay in the kdump reserved memory for the kdump kernel to unlock encrypted volumes. During this process, some measures like letting the keys expire within specified time are desirable to reduce security risk. This patch assumes, 1) there are 128 LUKS devices at maximum to be unlocked thus MAX_KEY_NUM=128. 2) a key description won't exceed 128 bytes thus KEY_DESC_MAX_LEN=128. And here is a demo on how to interact with /sys/kernel/config/crash_dm_crypt_keys, # Add key #1 mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720 # Add key #1's description echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description # how many keys do we have now? cat /sys/kernel/config/crash_dm_crypt_keys/count 1 # Add key# 2 in the same way # how many keys do we have now? cat /sys/kernel/config/crash_dm_crypt_keys/count 2 # the tree structure of /crash_dm_crypt_keys configfs tree /sys/kernel/config/crash_dm_crypt_keys/ /sys/kernel/config/crash_dm_crypt_keys/ ├── 7d26b7b4-e342-4d2d-b660-7426b0996720 │ └── description ├── count ├── fce2cd38-4d59-4317-8ce2-1fd24d52c46a │ └── description Link: https://lkml.kernel.org/r/20250502011246.99238-3-coxu@redhat.com Signed-off-by: Coiby Xu <coxu@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: "Daniel P. Berrange" <berrange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Young <dyoung@redhat.com> Cc: Jan Pazdziora <jpazdziora@redhat.com> Cc: Liu Pingfan <kernelfans@gmail.com> Cc: Milan Broz <gmazyland@gmail.com> Cc: Ondrej Kozina <okozina@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave
pushed a commit
that referenced
this pull request
Jun 2, 2025
…deomode_to_var If fb_add_videomode() in do_register_framebuffer() fails to allocate memory for fb_videomode, it will later lead to a null-ptr dereference in fb_videomode_to_var(), as the fb_info is registered while not having the mode in modelist that is expected to be there, i.e. the one that is described in fb_info->var. ================================================================ general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 1 PID: 30371 Comm: syz-executor.1 Not tainted 5.10.226-syzkaller #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:fb_videomode_to_var+0x24/0x610 drivers/video/fbdev/core/modedb.c:901 Call Trace: display_to_var+0x3a/0x7c0 drivers/video/fbdev/core/fbcon.c:929 fbcon_resize+0x3e2/0x8f0 drivers/video/fbdev/core/fbcon.c:2071 resize_screen drivers/tty/vt/vt.c:1176 [inline] vc_do_resize+0x53a/0x1170 drivers/tty/vt/vt.c:1263 fbcon_modechanged+0x3ac/0x6e0 drivers/video/fbdev/core/fbcon.c:2720 fbcon_update_vcs+0x43/0x60 drivers/video/fbdev/core/fbcon.c:2776 do_fb_ioctl+0x6d2/0x740 drivers/video/fbdev/core/fbmem.c:1128 fb_ioctl+0xe7/0x150 drivers/video/fbdev/core/fbmem.c:1203 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:739 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x67/0xd1 ================================================================ Even though fbcon_init() checks beforehand if fb_match_mode() in var_to_display() fails, it can not prevent the panic because fbcon_init() does not return error code. Considering this and the comment in the code about fb_match_mode() returning NULL - "This should not happen" - it is better to prevent registering the fb_info if its mode was not set successfully. Also move fb_add_videomode() closer to the beginning of do_register_framebuffer() to avoid having to do the cleanup on fail. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru> Signed-off-by: Helge Deller <deller@gmx.de>
kdave
pushed a commit
that referenced
this pull request
Jun 2, 2025
If fb_add_videomode() in fb_set_var() fails to allocate memory for fb_videomode, later it may lead to a null-ptr dereference in fb_videomode_to_var(), as the fb_info is registered while not having the mode in modelist that is expected to be there, i.e. the one that is described in fb_info->var. ================================================================ general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 1 PID: 30371 Comm: syz-executor.1 Not tainted 5.10.226-syzkaller #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:fb_videomode_to_var+0x24/0x610 drivers/video/fbdev/core/modedb.c:901 Call Trace: display_to_var+0x3a/0x7c0 drivers/video/fbdev/core/fbcon.c:929 fbcon_resize+0x3e2/0x8f0 drivers/video/fbdev/core/fbcon.c:2071 resize_screen drivers/tty/vt/vt.c:1176 [inline] vc_do_resize+0x53a/0x1170 drivers/tty/vt/vt.c:1263 fbcon_modechanged+0x3ac/0x6e0 drivers/video/fbdev/core/fbcon.c:2720 fbcon_update_vcs+0x43/0x60 drivers/video/fbdev/core/fbcon.c:2776 do_fb_ioctl+0x6d2/0x740 drivers/video/fbdev/core/fbmem.c:1128 fb_ioctl+0xe7/0x150 drivers/video/fbdev/core/fbmem.c:1203 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:739 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x67/0xd1 ================================================================ The reason is that fb_info->var is being modified in fb_set_var(), and then fb_videomode_to_var() is called. If it fails to add the mode to fb_info->modelist, fb_set_var() returns error, but does not restore the old value of fb_info->var. Restore fb_info->var on failure the same way it is done earlier in the function. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru> Signed-off-by: Helge Deller <deller@gmx.de>
kdave
pushed a commit
that referenced
this pull request
Jun 3, 2025
Despite the fact that several lockdep-related checks are skipped when calling trylock* versions of the locking primitives, for example mutex_trylock, each time the mutex is acquired, a held_lock is still placed onto the lockdep stack by __lock_acquire() which is called regardless of whether the trylock* or regular locking API was used. This means that if the caller successfully acquires more than MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock, lockdep will still complain that the maximum depth of the held lock stack has been reached and disable itself. For example, the following error currently occurs in the ARM version of KVM, once the code tries to lock all vCPUs of a VM configured with more than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern systems, where having more than 48 CPUs is common, and it's also common to run VMs that have vCPU counts approaching that number: [ 328.171264] BUG: MAX_LOCK_DEPTH too low! [ 328.175227] turning off the locking correctness validator. [ 328.180726] Please attach the output of /proc/lock_stat to the bug report [ 328.187531] depth: 48 max: 48! [ 328.190678] 48 locks held by qemu-kvm/11664: [ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0 [ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 Luckily, in all instances that require locking all vCPUs, the 'kvm->lock' is taken a priori, and that fact makes it possible to use the little known feature of lockdep, called a 'nest_lock', to avoid this warning and subsequent lockdep self-disablement. The action of 'nested lock' being provided to lockdep's lock_acquire(), causes the lockdep to detect that the top of the held lock stack contains a lock of the same class and then increment its reference counter instead of pushing a new held_lock item onto that stack. See __lock_acquire for more information. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 3, 2025
Use kvm_trylock_all_vcpus instead of a custom implementation when locking all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs. This fixes the following false lockdep warning: [ 328.171264] BUG: MAX_LOCK_DEPTH too low! [ 328.175227] turning off the locking correctness validator. [ 328.180726] Please attach the output of /proc/lock_stat to the bug report [ 328.187531] depth: 48 max: 48! [ 328.190678] 48 locks held by qemu-kvm/11664: [ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0 [ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 3, 2025
syzkaller has found another ugly race in the VGIC, this time dealing with VGIC creation. Since kvm_vgic_create() doesn't sufficiently protect against in-flight vCPU creations, it is possible to get a vCPU into the kernel w/ an in-kernel VGIC but no allocation of private IRQs: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000d20 Mem abort info: ESR = 0x0000000096000046 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x06: level 2 translation fault Data abort info: ISV = 0, ISS = 0x00000046, ISS2 = 0x00000000 CM = 0, WnR = 1, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000103e4f000 [0000000000000d20] pgd=0800000102e1c403, p4d=0800000102e1c403, pud=0800000101146403, pmd=0000000000000000 Internal error: Oops: 0000000096000046 [#1] PREEMPT SMP CPU: 9 UID: 0 PID: 246 Comm: test Not tainted 6.14.0-rc6-00097-g0c90821f5db8 torvalds#16 Hardware name: linux,dummy-virt (DT) pstate: 814020c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--) pc : _raw_spin_lock_irqsave+0x34/0x8c lr : kvm_vgic_set_owner+0x54/0xa4 sp : ffff80008086ba20 x29: ffff80008086ba20 x28: ffff0000c19b5640 x27: 0000000000000000 x26: 0000000000000000 x25: ffff0000c4879bd0 x24: 000000000000001e x23: 0000000000000000 x22: 0000000000000000 x21: ffff0000c487af80 x20: ffff0000c487af18 x19: 0000000000000000 x18: 0000001afadd5a8b x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000001 x14: ffff0000c19b56c0 x13: 0030c9adf9d9889e x12: ffffc263710e1908 x11: 0000001afb0d74f2 x10: e0966b840b373664 x9 : ec806bf7d6a57cd5 x8 : ffff80008086b980 x7 : 0000000000000001 x6 : 0000000000000001 x5 : 0000000080800054 x4 : 4ec4ec4ec4ec4ec5 x3 : 0000000000000000 x2 : 0000000000000001 x1 : 0000000000000000 x0 : 0000000000000d20 Call trace: _raw_spin_lock_irqsave+0x34/0x8c (P) kvm_vgic_set_owner+0x54/0xa4 kvm_timer_enable+0xf4/0x274 kvm_arch_vcpu_run_pid_change+0xe0/0x380 kvm_vcpu_ioctl+0x93c/0x9e0 __arm64_sys_ioctl+0xb4/0xec invoke_syscall+0x48/0x110 el0_svc_common.constprop.0+0x40/0xe0 do_el0_svc+0x1c/0x28 el0_svc+0x30/0xd0 el0t_64_sync_handler+0x10c/0x138 el0t_64_sync+0x198/0x19c Code: b9000841 d503201f 52800001 52800022 (88e17c02) ---[ end trace 0000000000000000 ]--- Plug the race by explicitly checking for an in-progress vCPU creation and failing kvm_vgic_create() when that's the case. Add some comments to document all the things kvm_vgic_create() is trying to guard against too. Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Tested-by: Alexander Potapenko <glider@google.com> Link: https://lore.kernel.org/r/20250523194722.4066715-6-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
kdave
pushed a commit
that referenced
this pull request
Jun 3, 2025
…ux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.16, take #1 - Make the irqbypass hooks resilient to changes in the GSI<->MSI routing, avoiding behind stale vLPI mappings being left behind. The fix is to resolve the VGIC IRQ using the host IRQ (which is stable) and nuking the vLPI mapping upon a routing change. - Close another VGIC race where vCPU creation races with VGIC creation, leading to in-flight vCPUs entering the kernel w/o private IRQs allocated. - Fix a build issue triggered by the recently added workaround for Ampere's AC04_CPU_23 erratum. - Correctly sign-extend the VA when emulating a TLBI instruction potentially targeting a VNCR mapping. - Avoid dereferencing a NULL pointer in the VGIC debug code, which can happen if the device doesn't have any mapping yet.
kdave
pushed a commit
that referenced
this pull request
Jun 4, 2025
I started seeing this in recent Fedora 42 kernels: # uname -a Linux number 6.14.3-300.fc42.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Apr 20 16:08:39 UTC 2025 x86_64 GNU/Linux # # perf test vmlinux 1: vmlinux symtab matches kallsyms : FAILED! # Where we have Rust enabled: # grep CONFIG_RUST /boot/config-6.14.3-300.fc42.x86_64 CONFIG_RUSTC_VERSION=108600 CONFIG_RUST_IS_AVAILABLE=y CONFIG_RUSTC_LLVM_VERSION=200101 CONFIG_RUSTC_HAS_COERCE_POINTEE=y CONFIG_RUST=y CONFIG_RUSTC_VERSION_TEXT="rustc 1.86.0 (05f9846f8 2025-03-31) (Fedora 1.86.0-1.fc42)" CONFIG_RUST_FW_LOADER_ABSTRACTIONS=y CONFIG_RUST_PHYLIB_ABSTRACTIONS=y # CONFIG_RUST_DEBUG_ASSERTIONS is not set CONFIG_RUST_OVERFLOW_CHECKS=y # CONFIG_RUST_BUILD_ASSERT_ALLOW is not set # Looking at the reason for the failure: # perf test -v vmlinux |& grep ^ERR ERR : 0xffffffff99efc7d0: __pfx__RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_ not on kallsyms ERR : 0xffffffff99efc7e0: _RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_ not on kallsyms # But: # grep -w u /proc/kallsyms ffffffff99efc7d0 u __pfx__RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_ ffffffff99efc7e0 u _RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_ # The test checks that "vmlinux symtab matches kallsyms", so it finds those two symbols in vmlinux: # pahole --running_kernel_vmlinux /usr/lib/debug/lib/modules/6.14.3-300.fc42.x86_64/vmlinux # # readelf -sW /usr/lib/debug/lib/modules/6.14.3-300.fc42.x86_64/vmlinux | grep -Ew '(__pfx__RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_|_RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_)' 81844: ffffffff81efc7e0 524 FUNC LOCAL DEFAULT 1 _RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_ 144259: ffffffff81efc7d0 16 FUNC LOCAL DEFAULT 1 __pfx__RNCINvNtNtNtCsf5tcb0XGUW4_4core4iter8adapters3map12map_try_foldjNtCsagR6JbSOIa9_12drm_panic_qr7VersionuINtNtNtBa_3ops12control_flow11ControlFlowB10_ENcB10_0NCINvNvNtNtNtB8_6traits8iterator8Iterator4find5checkB10_NCNvMB12_B10_13from_segments0E0E0B12_ # It is there. From the nm documentation we can see that: "U" The symbol is undefined. "u" The symbol is a unique global symbol. This is a GNU extension to the standard set of ELF symbol bindings. For such a symbol the dynamic linker will make sure that in the entire process there is just one symbol with this name and type in use. So lets consider 'u' symbols in /proc/kallsyms when loading it to cover this case. Fedora:40 shows this as a 'l' symbol, so consider that as well. With this patch 'perf test 1' is happy again: # perf test vmlinux 1: vmlinux symtab matches kallsyms : Ok # Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/aBE_n0PGl3g6h-cS@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 4, 2025
Running the "perf script task-analyzer tests" with address sanitizer showed a double free: ``` FAIL: "test_csv_extended_times" Error message: "Failed to find required string:'Out-Out;'." ================================================================= ==19190==ERROR: AddressSanitizer: attempting double-free on 0x50b000017b10 in thread T0: #0 0x55da9601c78a in free (perf+0x26078a) (BuildId: e7ef50e08970f017a96fde6101c5e2491acc674a) #1 0x55da96640c63 in filename__read_build_id tools/perf/util/symbol-minimal.c:221:2 0x50b000017b10 is located 0 bytes inside of 112-byte region [0x50b000017b10,0x50b000017b80) freed by thread T0 here: #0 0x55da9601ce40 in realloc (perf+0x260e40) (BuildId: e7ef50e08970f017a96fde6101c5e2491acc674a) #1 0x55da96640ad6 in filename__read_build_id tools/perf/util/symbol-minimal.c:204:10 previously allocated by thread T0 here: #0 0x55da9601ca23 in malloc (perf+0x260a23) (BuildId: e7ef50e08970f017a96fde6101c5e2491acc674a) #1 0x55da966407e7 in filename__read_build_id tools/perf/util/symbol-minimal.c:181:9 SUMMARY: AddressSanitizer: double-free (perf+0x26078a) (BuildId: e7ef50e08970f017a96fde6101c5e2491acc674a) in free ==19190==ABORTING FAIL: "invocation of perf script report task-analyzer --csv-summary csvsummary --summary-extended command failed" Error message: "" FAIL: "test_csvsummary_extended" Error message: "Failed to find required string:'Out-Out;'." ---- end(-1) ---- 132: perf script task-analyzer tests : FAILED! ``` The buf_size if always set to phdr->p_filesz, but that may be 0 causing a free and realloc to return NULL. This is treated in filename__read_build_id like a failure and the buffer is freed again. To avoid this problem only grow buf, meaning the buf_size will never be 0. This also reduces the number of memory (re)allocations. Fixes: b691f64 ("perf symbols: Implement poor man's ELF parser") Signed-off-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250501070003.22251-1-irogers@google.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 4, 2025
Specify the threshold for dumping offcpu samples with --off-cpu-thresh, the unit is milliseconds. Default value is 500ms. Example: perf record --off-cpu --off-cpu-thresh 824 The example above collects direct off-cpu samples where the off-cpu time is longer than 824ms. Committer testing: After commenting out the end off-cpu dump to have just the ones that are added right after the task is scheduled back, and using a threshould of 1000ms, we see some periods (the 5th column, just before "offcpu-time" in the 'perf script' output) that are over 1000.000.000 nanoseconds: root@number:~# perf record --off-cpu --off-cpu-thresh 10000 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 3.902 MB perf.data (34335 samples) ] root@number:~# perf script <SNIP> Isolated Web Co 59932 [028] 63839.594437: 1000049427 offcpu-time: 7fe63c7976c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6) 7fe63c78c04c __futex_abstimed_wait_common+0x7c (/usr/lib64/libc.so.6) 7fe63c78e928 pthread_cond_timedwait@@GLIBC_2.3.2+0x178 (/usr/lib64/libc.so.6) 5599974a9fe7 mozilla::detail::ConditionVariableImpl::wait_for(mozilla::detail::MutexImpl&, mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> const&)+0xe7 (/usr/lib64/fir> 100000000 [unknown] ([unknown]) swapper 0 [025] 63839.594459: 195724 cycles:P: ffffffffac328270 read_tsc+0x0 ([kernel.kallsyms]) Isolated Web Co 59932 [010] 63839.594466: 1000055278 offcpu-time: 7fe63c7976c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6) 7fe63c78ba24 __syscall_cancel+0x14 (/usr/lib64/libc.so.6) 7fe63c804c4e __poll+0x1e (/usr/lib64/libc.so.6) 7fe633b0d1b8 PollWrapper(_GPollFD*, unsigned int, int) [clone .lto_priv.0]+0xf8 (/usr/lib64/firefox/libxul.so) 10000002c [unknown] ([unknown]) swapper 0 [027] 63839.594475: 134433 cycles:P: ffffffffad4c45d9 irqentry_enter+0x19 ([kernel.kallsyms]) swapper 0 [028] 63839.594499: 215838 cycles:P: ffffffffac39199a switch_mm_irqs_off+0x10a ([kernel.kallsyms]) MediaPD~oder #1 1407676 [027] 63839.594514: 134433 cycles:P: 7f982ef5e69f dct_IV(int*, int, int*)+0x24f (/usr/lib64/libfdk-aac.so.2.0.0) swapper 0 [024] 63839.594524: 267411 cycles:P: ffffffffad4c6ee6 poll_idle+0x56 ([kernel.kallsyms]) MediaSu~sor torvalds#75 1093827 [026] 63839.594555: 332652 cycles:P: 55be753ad030 moz_xmalloc+0x200 (/usr/lib64/firefox/firefox) swapper 0 [027] 63839.594616: 160548 cycles:P: ffffffffad144840 menu_select+0x570 ([kernel.kallsyms]) Isolated Web Co 14019 [027] 63839.595120: 1000050178 offcpu-time: 7fc9537cc6c2 __syscall_cancel_arch_end+0x0 (/usr/lib64/libc.so.6) 7fc9537c104c __futex_abstimed_wait_common+0x7c (/usr/lib64/libc.so.6) 7fc9537c3928 pthread_cond_timedwait@@GLIBC_2.3.2+0x178 (/usr/lib64/libc.so.6) 7fc95372a3c8 pt_TimedWait+0xb8 (/usr/lib64/libnspr4.so) 7fc95372a8d8 PR_WaitCondVar+0x68 (/usr/lib64/libnspr4.so) 7fc94afb1f7c WatchdogMain(void*)+0xac (/usr/lib64/firefox/libxul.so) 7fc947498660 [unknown] ([unknown]) 7fc9535fce88 [unknown] ([unknown]) 7fc94b620e60 WatchdogManager::~WatchdogManager()+0x0 (/usr/lib64/firefox/libxul.so) fff8548387f8b48 [unknown] ([unknown]) swapper 0 [003] 63839.595712: 212948 cycles:P: ffffffffacd5b865 acpi_os_read_port+0x55 ([kernel.kallsyms]) <SNIP> Suggested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Suggested-by: Ian Rogers <irogers@google.com> Suggested-by: Namhyung Kim <namhyung@kernel.org> Reviewed-by: Ian Rogers <irogers@google.com> Signed-off-by: Howard Chu <howardchu95@gmail.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Tested-by: Gautam Menghani <gautam@linux.ibm.com> Tested-by: Ian Rogers <irogers@google.com> Acked-by: Namhyung Kim <namhyung@kernel.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20241108204137.2444151-2-howardchu95@gmail.com Link: https://lore.kernel.org/r/20250501022809.449767-10-howardchu95@gmail.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 4, 2025
The same buf is used for the program headers and reading notes. As the notes memory may be reallocated then this corrupts the memory pointed to by the phdr. Using the same buffer is in any case a logic error. Rather than deal with the duplicated code, introduce an elf32 boolean and a union for either the elf32 or elf64 headers that are in use. Let the program headers have their own memory and grow the buffer for notes as necessary. Before `perf list -j` compiled with asan would crash with: ``` ==4176189==ERROR: AddressSanitizer: heap-use-after-free on address 0x5160000070b8 at pc 0x555d3b15075b bp 0x7ffebb5a8090 sp 0x7ffebb5a8088 READ of size 8 at 0x5160000070b8 thread T0 #0 0x555d3b15075a in filename__read_build_id tools/perf/util/symbol-minimal.c:212:25 #1 0x555d3ae43aff in filename__sprintf_build_id tools/perf/util/build-id.c:110:8 ... 0x5160000070b8 is located 312 bytes inside of 560-byte region [0x516000006f80,0x5160000071b0) freed by thread T0 here: #0 0x555d3ab21840 in realloc (perf+0x264840) (BuildId: 12dff2f6629f738e5012abdf0e90055518e70b5e) #1 0x555d3b1506e7 in filename__read_build_id tools/perf/util/symbol-minimal.c:206:11 ... previously allocated by thread T0 here: #0 0x555d3ab21423 in malloc (perf+0x264423) (BuildId: 12dff2f6629f738e5012abdf0e90055518e70b5e) #1 0x555d3b1503a2 in filename__read_build_id tools/perf/util/symbol-minimal.c:182:9 ... ``` Note: this bug is long standing and not introduced by the other asan fix in commit fa9c497 ("perf symbol-minimal: Fix double free in filename__read_build_id"). Fixes: b691f64 ("perf symbols: Implement poor man's ELF parser") Signed-off-by: Ian Rogers <irogers@google.com> Link: https://lore.kernel.org/r/20250528032637.198960-2-irogers@google.com Cc: Mark Rutland <mark.rutland@arm.com> Cc: Gary Guo <gary@garyguo.net> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Weilin Wang <weilin.wang@intel.com> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: James Clark <james.clark@linaro.org> Cc: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Benno Lossin <benno.lossin@proton.me> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Trevor Gross <tmgross@umich.edu> Cc: linux-kernel@vger.kernel.org Cc: linux-perf-users@vger.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 4, 2025
The following issue happens with a buggy module: BUG: unable to handle page fault for address: ffffffffc05d0218 PGD 1bd66f067 P4D 1bd66f067 PUD 1bd671067 PMD 101808067 PTE 0 Oops: Oops: 0000 [#1] SMP KASAN PTI Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS RIP: 0010:sized_strscpy+0x81/0x2f0 RSP: 0018:ffff88812d76fa08 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffffc0601010 RCX: dffffc0000000000 RDX: 0000000000000038 RSI: dffffc0000000000 RDI: ffff88812608da2d RBP: 8080808080808080 R08: ffff88812608da2d R09: ffff88812608da68 R10: ffff88812608d82d R11: ffff88812608d810 R12: 0000000000000038 R13: ffff88812608da2d R14: ffffffffc05d0218 R15: fefefefefefefeff FS: 00007fef552de740(0000) GS:ffff8884251c7000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc05d0218 CR3: 00000001146f0000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ftrace_mod_get_kallsym+0x1ac/0x590 update_iter_mod+0x239/0x5b0 s_next+0x5b/0xa0 seq_read_iter+0x8c9/0x1070 seq_read+0x249/0x3b0 proc_reg_read+0x1b0/0x280 vfs_read+0x17f/0x920 ksys_read+0xf3/0x1c0 do_syscall_64+0x5f/0x2e0 entry_SYSCALL_64_after_hwframe+0x76/0x7e The above issue may happen as follows: (1) Add kprobe tracepoint; (2) insmod test.ko; (3) Module triggers ftrace disabled; (4) rmmod test.ko; (5) cat /proc/kallsyms; --> Will trigger UAF as test.ko already removed; ftrace_mod_get_kallsym() ... strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); ... The problem is when a module triggers an issue with ftrace and sets ftrace_disable. The ftrace_disable is set when an anomaly is discovered and to prevent any more damage, ftrace stops all text modification. The issue that happened was that the ftrace_disable stops more than just the text modification. When a module is loaded, its init functions can also be traced. Because kallsyms deletes the init functions after a module has loaded, ftrace saves them when the module is loaded and function tracing is enabled. This allows the output of the function trace to show the init function names instead of just their raw memory addresses. When a module is removed, ftrace_release_mod() is called, and if ftrace_disable is set, it just returns without doing anything more. The problem here is that it leaves the mod_list still around and if kallsyms is called, it will call into this code and access the module memory that has already been freed as it will return: strscpy(module_name, mod_map->mod->name, MODULE_NAME_LEN); Where the "mod" no longer exists and triggers a UAF bug. Link: https://lore.kernel.org/all/20250523135452.626d8dcd@gandalf.local.home/ Cc: stable@vger.kernel.org Fixes: aba4b5c ("ftrace: Save module init functions kallsyms symbols for tracing") Link: https://lore.kernel.org/20250529111955.2349189-2-yebin@huaweicloud.com Signed-off-by: Ye Bin <yebin10@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
kdave
pushed a commit
that referenced
this pull request
Jun 5, 2025
Add a compile-time check that `*$ptr` is of the type of `$type->$($f)*`. Rename those placeholders for clarity. Given the incorrect usage: > diff --git a/rust/kernel/rbtree.rs b/rust/kernel/rbtree.rs > index 8d978c8..6a7089149878 100644 > --- a/rust/kernel/rbtree.rs > +++ b/rust/kernel/rbtree.rs > @@ -329,7 +329,7 @@ fn raw_entry(&mut self, key: &K) -> RawEntry<'_, K, V> { > while !(*child_field_of_parent).is_null() { > let curr = *child_field_of_parent; > // SAFETY: All links fields we create are in a `Node<K, V>`. > - let node = unsafe { container_of!(curr, Node<K, V>, links) }; > + let node = unsafe { container_of!(curr, Node<K, V>, key) }; > > // SAFETY: `node` is a non-null node so it is valid by the type invariants. > match key.cmp(unsafe { &(*node).key }) { this patch produces the compilation error: > error[E0308]: mismatched types > --> rust/kernel/lib.rs:220:45 > | > 220 | $crate::assert_same_type(field_ptr, (&raw const (*container_ptr).$($fields)*).cast_mut()); > | ------------------------ --------- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `*mut rb_node`, found `*mut K` > | | | > | | expected all arguments to be this `*mut bindings::rb_node` type because they need to match the type of this parameter > | arguments to this function are incorrect > | > ::: rust/kernel/rbtree.rs:270:6 > | > 270 | impl<K, V> RBTree<K, V> > | - found this type parameter > ... > 332 | let node = unsafe { container_of!(curr, Node<K, V>, key) }; > | ------------------------------------ in this macro invocation > | > = note: expected raw pointer `*mut bindings::rb_node` > found raw pointer `*mut K` > note: function defined here > --> rust/kernel/lib.rs:227:8 > | > 227 | pub fn assert_same_type<T>(_: T, _: T) {} > | ^^^^^^^^^^^^^^^^ - ---- ---- this parameter needs to match the `*mut bindings::rb_node` type of parameter #1 > | | | > | | parameter #2 needs to match the `*mut bindings::rb_node` type of this parameter > | parameter #1 and parameter #2 both reference this parameter `T` > = note: this error originates in the macro `container_of` (in Nightly builds, run with -Z macro-backtrace for more info) [ We decided to go with a variation of v1 [1] that became v4, since it seems like the obvious approach, the error messages seem good enough and the debug performance should be fine, given the kernel is always built with -O2. In the future, we may want to make the helper non-hidden, with proper documentation, for others to use. [1] https://lore.kernel.org/rust-for-linux/CANiq72kQWNfSV0KK6qs6oJt+aGdgY=hXg=wJcmK3zYcokY1LNw@mail.gmail.com/ - Miguel ] Suggested-by: Alice Ryhl <aliceryhl@google.com> Link: https://lore.kernel.org/all/CAH5fLgh6gmqGBhPMi2SKn7mCmMWfOSiS0WP5wBuGPYh9ZTAiww@mail.gmail.com/ Signed-off-by: Tamir Duberstein <tamird@gmail.com> Reviewed-by: Benno Lossin <lossin@kernel.org> Link: https://lore.kernel.org/r/20250529-b4-container-of-type-check-v4-1-bf3a7ad73cec@gmail.com [ Added intra-doc link. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
kdave
pushed a commit
that referenced
this pull request
Jun 5, 2025
When the XDP program is loaded, the XDP callback adds new Tx queues. This means that the callback must update the Tx scheduler with the new queue number. In the event of a Tx scheduler failure, the XDP callback should also fail and roll back any changes previously made for XDP preparation. The previous implementation had a bug that not all changes made by the XDP callback were rolled back. This caused the crash with the following call trace: [ +9.549584] ice 0000:ca:00.0: Failed VSI LAN queue config for XDP, error: -5 [ +0.382335] Oops: general protection fault, probably for non-canonical address 0x50a2250a90495525: 0000 [#1] SMP NOPTI [ +0.010710] CPU: 103 UID: 0 PID: 0 Comm: swapper/103 Not tainted 6.14.0-net-next-mar-31+ torvalds#14 PREEMPT(voluntary) [ +0.010175] Hardware name: Intel Corporation M50CYP2SBSTD/M50CYP2SBSTD, BIOS SE5C620.86B.01.01.0005.2202160810 02/16/2022 [ +0.010946] RIP: 0010:__ice_update_sample+0x39/0xe0 [ice] [...] [ +0.002715] Call Trace: [ +0.002452] <IRQ> [ +0.002021] ? __die_body.cold+0x19/0x29 [ +0.003922] ? die_addr+0x3c/0x60 [ +0.003319] ? exc_general_protection+0x17c/0x400 [ +0.004707] ? asm_exc_general_protection+0x26/0x30 [ +0.004879] ? __ice_update_sample+0x39/0xe0 [ice] [ +0.004835] ice_napi_poll+0x665/0x680 [ice] [ +0.004320] __napi_poll+0x28/0x190 [ +0.003500] net_rx_action+0x198/0x360 [ +0.003752] ? update_rq_clock+0x39/0x220 [ +0.004013] handle_softirqs+0xf1/0x340 [ +0.003840] ? sched_clock_cpu+0xf/0x1f0 [ +0.003925] __irq_exit_rcu+0xc2/0xe0 [ +0.003665] common_interrupt+0x85/0xa0 [ +0.003839] </IRQ> [ +0.002098] <TASK> [ +0.002106] asm_common_interrupt+0x26/0x40 [ +0.004184] RIP: 0010:cpuidle_enter_state+0xd3/0x690 Fix this by performing the missing unmapping of XDP queues from q_vectors and setting the XDP rings pointer back to NULL after all those queues are released. Also, add an immediate exit from the XDP callback in case of ring preparation failure. Fixes: efc2214 ("ice: Add support for XDP") Reviewed-by: Dawid Osuchowski <dawid.osuchowski@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Michal Kubiak <michal.kubiak@intel.com> Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Jesse Brandeburg <jbrandeburg@cloudflare.com> Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
kdave
pushed a commit
that referenced
this pull request
Jun 5, 2025
Commit a1e40ac ("net: gso: fix udp gso fraglist segmentation after pull from frag_list") detected invalid geometry in frag_list skbs and redirects them from skb_segment_list to more robust skb_segment. But some packets with modified geometry can also hit bugs in that code. We don't know how many such cases exist. Addressing each one by one also requires touching the complex skb_segment code, which risks introducing bugs for other types of skbs. Instead, linearize all these packets that fail the basic invariants on gso fraglist skbs. That is more robust. If only part of the fraglist payload is pulled into head_skb, it will always cause exception when splitting skbs by skb_segment. For detailed call stack information, see below. Valid SKB_GSO_FRAGLIST skbs - consist of two or more segments - the head_skb holds the protocol headers plus first gso_size - one or more frag_list skbs hold exactly one segment - all but the last must be gso_size Optional datapath hooks such as NAT and BPF (bpf_skb_pull_data) can modify fraglist skbs, breaking these invariants. In extreme cases they pull one part of data into skb linear. For UDP, this causes three payloads with lengths of (11,11,10) bytes were pulled tail to become (12,10,10) bytes. The skbs no longer meets the above SKB_GSO_FRAGLIST conditions because payload was pulled into head_skb, it needs to be linearized before pass to regular skb_segment. skb_segment+0xcd0/0xd14 __udp_gso_segment+0x334/0x5f4 udp4_ufo_fragment+0x118/0x15c inet_gso_segment+0x164/0x338 skb_mac_gso_segment+0xc4/0x13c __skb_gso_segment+0xc4/0x124 validate_xmit_skb+0x9c/0x2c0 validate_xmit_skb_list+0x4c/0x80 sch_direct_xmit+0x70/0x404 __dev_queue_xmit+0x64c/0xe5c neigh_resolve_output+0x178/0x1c4 ip_finish_output2+0x37c/0x47c __ip_finish_output+0x194/0x240 ip_finish_output+0x20/0xf4 ip_output+0x100/0x1a0 NF_HOOK+0xc4/0x16c ip_forward+0x314/0x32c ip_rcv+0x90/0x118 __netif_receive_skb+0x74/0x124 process_backlog+0xe8/0x1a4 __napi_poll+0x5c/0x1f8 net_rx_action+0x154/0x314 handle_softirqs+0x154/0x4b8 [118.376811] [C201134] rxq0_pus: [name:bug&]kernel BUG at net/core/skbuff.c:4278! [118.376829] [C201134] rxq0_pus: [name:traps&]Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP [118.470774] [C201134] rxq0_pus: [name:mrdump&]Kernel Offset: 0x178cc00000 from 0xffffffc008000000 [118.470810] [C201134] rxq0_pus: [name:mrdump&]PHYS_OFFSET: 0x40000000 [118.470827] [C201134] rxq0_pus: [name:mrdump&]pstate: 60400005 (nZCv daif +PAN -UAO) [118.470848] [C201134] rxq0_pus: [name:mrdump&]pc : [0xffffffd79598aefc] skb_segment+0xcd0/0xd14 [118.470900] [C201134] rxq0_pus: [name:mrdump&]lr : [0xffffffd79598a5e8] skb_segment+0x3bc/0xd14 [118.470928] [C201134] rxq0_pus: [name:mrdump&]sp : ffffffc008013770 Fixes: a1e40ac ("gso: fix udp gso fraglist segmentation after pull from frag_list") Signed-off-by: Shiming Cheng <shiming.cheng@mediatek.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
kdave
pushed a commit
that referenced
this pull request
Jun 5, 2025
When driver handles the napi rx polling requests, the netdev might have been released by the dellink logic triggered by the disconnect operation on user plane. However, in the logic of processing skb in polling, an invalid netdev is still being used, which causes a panic. BUG: kernel NULL pointer dereference, address: 00000000000000f1 Oops: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:dev_gro_receive+0x3a/0x620 [...] Call Trace: <IRQ> ? __die_body+0x68/0xb0 ? page_fault_oops+0x379/0x3e0 ? exc_page_fault+0x4f/0xa0 ? asm_exc_page_fault+0x22/0x30 ? __pfx_t7xx_ccmni_recv_skb+0x10/0x10 [mtk_t7xx (HASH:1400 7)] ? dev_gro_receive+0x3a/0x620 napi_gro_receive+0xad/0x170 t7xx_ccmni_recv_skb+0x48/0x70 [mtk_t7xx (HASH:1400 7)] t7xx_dpmaif_napi_rx_poll+0x590/0x800 [mtk_t7xx (HASH:1400 7)] net_rx_action+0x103/0x470 irq_exit_rcu+0x13a/0x310 sysvec_apic_timer_interrupt+0x56/0x90 </IRQ> Fixes: 5545b7b ("net: wwan: t7xx: Add NAPI support") Signed-off-by: Jinjian Song <jinjian.song@fibocom.com> Link: https://patch.msgid.link/20250530031648.5592-1-jinjian.song@fibocom.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 5, 2025
Removing a peer while userspace attempts to close its transport socket triggers a race condition resulting in the following crash: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000077: 0000 [#1] SMP KASAN KASAN: null-ptr-deref in range [0x00000000000003b8-0x00000000000003bf] CPU: 12 UID: 0 PID: 162 Comm: kworker/12:1 Tainted: G O 6.15.0-rc2-00635-g521139ac3840 torvalds#272 PREEMPT(full) Tainted: [O]=OOT_MODULE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-20240910_120124-localhost 04/01/2014 Workqueue: events ovpn_peer_keepalive_work [ovpn] RIP: 0010:ovpn_socket_release+0x23c/0x500 [ovpn] Code: ea 03 80 3c 02 00 0f 85 71 02 00 00 48 b8 00 00 00 00 00 fc ff df 4d 8b 64 24 18 49 8d bc 24 be 03 00 00 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 01 38 d0 7c 08 84 d2 0f 85 30 RSP: 0018:ffffc90000c9fb18 EFLAGS: 00010217 RAX: dffffc0000000000 RBX: ffff8881148d7940 RCX: ffffffff817787bb RDX: 0000000000000077 RSI: 0000000000000008 RDI: 00000000000003be RBP: ffffc90000c9fb30 R08: 0000000000000000 R09: fffffbfff0d3e840 R10: ffffffff869f4207 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888115eb9300 R14: ffffc90000c9fbc8 R15: 000000000000000c FS: 0000000000000000(0000) GS:ffff8882b0151000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37266b6114 CR3: 00000000054a8000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> unlock_ovpn+0x8b/0xe0 [ovpn] ovpn_peer_keepalive_work+0xe3/0x540 [ovpn] ? ovpn_peers_free+0x780/0x780 [ovpn] ? lock_acquire+0x56/0x70 ? process_one_work+0x888/0x1740 process_one_work+0x933/0x1740 ? pwq_dec_nr_in_flight+0x10b0/0x10b0 ? move_linked_works+0x12d/0x2c0 ? assign_work+0x163/0x270 worker_thread+0x4d6/0xd90 ? preempt_count_sub+0x4c/0x70 ? process_one_work+0x1740/0x1740 kthread+0x36c/0x710 ? trace_preempt_on+0x8c/0x1e0 ? kthread_is_per_cpu+0xc0/0xc0 ? preempt_count_sub+0x4c/0x70 ? _raw_spin_unlock_irq+0x36/0x60 ? calculate_sigpending+0x7b/0xa0 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork+0x3a/0x80 ? kthread_is_per_cpu+0xc0/0xc0 ret_from_fork_asm+0x11/0x20 </TASK> Modules linked in: ovpn(O) This happens because the peer deletion operation reaches ovpn_socket_release() while ovpn_sock->sock (struct socket *) and its sk member (struct sock *) are still both valid. Here synchronize_rcu() is invoked, after which ovpn_sock->sock->sk becomes NULL, due to the concurrent socket closing triggered from userspace. After having invoked synchronize_rcu(), ovpn_socket_release() will attempt dereferencing ovpn_sock->sock->sk, triggering the crash reported above. The reason for accessing sk is that we need to retrieve its protocol and continue the cleanup routine accordingly. This crash can be easily produced by running openvpn userspace in client mode with `--keepalive 10 20`, while entirely omitting this option on the server side. After 20 seconds ovpn will assume the peer (server) to be dead, will start removing it and will notify userspace. The latter will receive the notification and close the transport socket, thus triggering the crash. To fix the race condition for good, we need to refactor struct ovpn_socket. Since ovpn is always only interested in the sock->sk member (struct sock *) we can directly hold a reference to it, raher than accessing it via its struct socket container. This means changing "struct socket *ovpn_socket->sock" to "struct sock *ovpn_socket->sk". While acquiring a reference to sk, we can increase its refcounter without affecting the socket close()/destroy() notification (which we rely on when userspace closes a socket we are using). By increasing sk's refcounter we know we can dereference it in ovpn_socket_release() without incurring in any race condition anymore. ovpn_socket_release() will ultimately decrease the reference counter. Cc: Oleksandr Natalenko <oleksandr@natalenko.name> Fixes: 11851cb ("ovpn: implement TCP transport") Reported-by: Qingfang Deng <dqfext@gmail.com> Closes: OpenVPN#1 Tested-by: Gert Doering <gert@greenie.muc.de> Link: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg31575.html Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Antonio Quartulli <antonio@openvpn.net>
kdave
pushed a commit
that referenced
this pull request
Jun 5, 2025
According to OpenMDK, bit 2 of the RGMII register has a different meaning for BCM53115 [1]: "DLL_IQQD 1: In the IDDQ mode, power is down0: Normal function mode" Configuring RGMII delay works without setting this bit, so let's keep it at the default. For other chips, we always set it, so not clearing it is not an issue. One would assume BCM53118 works the same, but OpenMDK is not quite sure what this bit actually means [2]: "BYPASS_IMP_2NS_DEL #1: In the IDDQ mode, power is down#0: Normal function mode1: Bypass dll65_2ns_del IP0: Use dll65_2ns_del IP" So lets keep setting it for now. [1] https://github.com/Broadcom-Network-Switching-Software/OpenMDK/blob/master/cdk/PKG/chip/bcm53115/bcm53115_a0_defs.h#L19871 [2] https://github.com/Broadcom-Network-Switching-Software/OpenMDK/blob/master/cdk/PKG/chip/bcm53118/bcm53118_a0_defs.h#L14392 Fixes: 967dd82 ("net: dsa: b53: Add support for Broadcom RoboSwitch") Signed-off-by: Jonas Gorski <jonas.gorski@gmail.com> Link: https://patch.msgid.link/20250602193953.1010487-6-jonas.gorski@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
kdave
pushed a commit
that referenced
this pull request
Jun 6, 2025
When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47 (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Jun 9, 2025
When threads/tasks are switched we need to ensure the old execution's SR_SUM state is saved and the new thread has the old SR_SUM state restored. The issue was seen under heavy load especially with the syz-stress tool running, with crashes as follows in schedule_tail: Unable to handle kernel access to user memory without uaccess routines at virtual address 000000002749f0d0 Oops [#1] Modules linked in: CPU: 1 PID: 4875 Comm: syz-executor.0 Not tainted 5.12.0-rc2-syzkaller-00467-g0d7588ab9ef9 #0 Hardware name: riscv-virtio,qemu (DT) epc : schedule_tail+0x72/0xb2 kernel/sched/core.c:4264 ra : task_pid_vnr include/linux/sched.h:1421 [inline] ra : schedule_tail+0x70/0xb2 kernel/sched/core.c:4264 epc : ffffffe00008c8b0 ra : ffffffe00008c8ae sp : ffffffe025d17ec0 gp : ffffffe005d25378 tp : ffffffe00f0d0000 t0 : 0000000000000000 t1 : 0000000000000001 t2 : 00000000000f4240 s0 : ffffffe025d17ee0 s1 : 000000002749f0d0 a0 : 000000000000002a a1 : 0000000000000003 a2 : 1ffffffc0cfac500 a3 : ffffffe0000c80cc a4 : 5ae9db91c19bbe00 a5 : 0000000000000000 a6 : 0000000000f00000 a7 : ffffffe000082eba s2 : 0000000000040000 s3 : ffffffe00eef96c0 s4 : ffffffe022c77fe0 s5 : 0000000000004000 s6 : ffffffe067d74e00 s7 : ffffffe067d74850 s8 : ffffffe067d73e18 s9 : ffffffe067d74e00 s10: ffffffe00eef96e8 s11: 000000ae6cdf8368 t3 : 5ae9db91c19bbe00 t4 : ffffffc4043cafb2 t5 : ffffffc4043cafba t6 : 0000000000040000 status: 0000000000000120 badaddr: 000000002749f0d0 cause: 000000000000000f Call Trace: [<ffffffe00008c8b0>] schedule_tail+0x72/0xb2 kernel/sched/core.c:4264 [<ffffffe000005570>] ret_from_exception+0x0/0x14 Dumping ftrace buffer: (ftrace buffer empty) ---[ end trace b5f8f9231dc87dda ]--- The issue comes from the put_user() in schedule_tail (kernel/sched/core.c) doing the following: asmlinkage __visible void schedule_tail(struct task_struct *prev) { ... if (current->set_child_tid) put_user(task_pid_vnr(current), current->set_child_tid); ... } the put_user() macro causes the code sequence to come out as follows: 1: __enable_user_access() 2: reg = task_pid_vnr(current); 3: *current->set_child_tid = reg; 4: __disable_user_access() The problem is that we may have a sleeping function as argument which could clear SR_SUM causing the panic above. This was fixed by evaluating the argument of the put_user() macro outside the user-enabled section in commit 285a76b ("riscv: evaluate put_user() arg before enabling user access")" In order for riscv to take advantage of unsafe_get/put_XXX() macros and to avoid the same issue we had with put_user() and sleeping functions we must ensure code flow can go through switch_to() from within a region of code with SR_SUM enabled and come back with SR_SUM still enabled. This patch addresses the problem allowing future work to enable full use of unsafe_get/put_XXX() macros without needing to take a CSR bit flip cost on every access. Make switch_to() save and restore SR_SUM. Reported-by: syzbot+e74b94fe601ab9552d69@syzkaller.appspotmail.com Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Cyril Bur <cyrilbur@tenstorrent.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Reviewed-by: Deepak Gupta <debug@rivosinc.com> Link: https://lore.kernel.org/r/20250410070526.3160847-2-cyrilbur@tenstorrent.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
kdave
pushed a commit
that referenced
this pull request
Jun 9, 2025
No device was set which caused serial_base_ctrl_add to crash. BUG: kernel NULL pointer dereference, address: 0000000000000050 Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 16 UID: 0 PID: 368 Comm: (udev-worker) Not tainted 6.12.25-amd64 #1 Debian 6.12.25-1 RIP: 0010:serial_base_ctrl_add+0x96/0x120 Call Trace: <TASK> serial_core_register_port+0x1a0/0x580 ? __setup_irq+0x39c/0x660 ? __kmalloc_cache_noprof+0x111/0x310 jsm_uart_port_init+0xe8/0x180 [jsm] jsm_probe_one+0x1f4/0x410 [jsm] local_pci_probe+0x42/0x90 pci_device_probe+0x22f/0x270 really_probe+0xdb/0x340 ? pm_runtime_barrier+0x54/0x90 ? __pfx___driver_attach+0x10/0x10 __driver_probe_device+0x78/0x110 driver_probe_device+0x1f/0xa0 __driver_attach+0xba/0x1c0 bus_for_each_dev+0x8c/0xe0 bus_add_driver+0x112/0x1f0 driver_register+0x72/0xd0 jsm_init_module+0x36/0xff0 [jsm] ? __pfx_jsm_init_module+0x10/0x10 [jsm] do_one_initcall+0x58/0x310 do_init_module+0x60/0x230 Tested with Digi Neo PCIe 8 port card. Fixes: 84a9582 ("serial: core: Start managing serial controllers to enable runtime PM") Cc: stable <stable@kernel.org> Signed-off-by: Dustin Lundquist <dustin@null-ptr.net> Link: https://lore.kernel.org/r/3f31d4f75863614655c4673027a208be78d022ec.camel@null-ptr.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
kdave
pushed a commit
that referenced
this pull request
Jun 9, 2025
Syzkaller detected a kernel bug in jffs2_link_node_ref, caused by fault injection in jffs2_prealloc_raw_node_refs. jffs2_sum_write_sumnode doesn't check return value of jffs2_prealloc_raw_node_refs and simply lets any error propagate into jffs2_sum_write_data, which eventually calls jffs2_link_node_ref in order to link the summary to an expectedly allocated node. kernel BUG at fs/jffs2/nodelist.c:592! invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 1 PID: 31277 Comm: syz-executor.7 Not tainted 6.1.128-syzkaller-00139-ge10f83ca10a1 #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:jffs2_link_node_ref+0x570/0x690 fs/jffs2/nodelist.c:592 Call Trace: <TASK> jffs2_sum_write_data fs/jffs2/summary.c:841 [inline] jffs2_sum_write_sumnode+0xd1a/0x1da0 fs/jffs2/summary.c:874 jffs2_do_reserve_space+0xa18/0xd60 fs/jffs2/nodemgmt.c:388 jffs2_reserve_space+0x55f/0xaa0 fs/jffs2/nodemgmt.c:197 jffs2_write_inode_range+0x246/0xb50 fs/jffs2/write.c:362 jffs2_write_end+0x726/0x15d0 fs/jffs2/file.c:301 generic_perform_write+0x314/0x5d0 mm/filemap.c:3856 __generic_file_write_iter+0x2ae/0x4d0 mm/filemap.c:3973 generic_file_write_iter+0xe3/0x350 mm/filemap.c:4005 call_write_iter include/linux/fs.h:2265 [inline] do_iter_readv_writev+0x20f/0x3c0 fs/read_write.c:735 do_iter_write+0x186/0x710 fs/read_write.c:861 vfs_iter_write+0x70/0xa0 fs/read_write.c:902 iter_file_splice_write+0x73b/0xc90 fs/splice.c:685 do_splice_from fs/splice.c:763 [inline] direct_splice_actor+0x10c/0x170 fs/splice.c:950 splice_direct_to_actor+0x337/0xa10 fs/splice.c:896 do_splice_direct+0x1a9/0x280 fs/splice.c:1002 do_sendfile+0xb13/0x12c0 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x1cf/0x210 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Fix this issue by checking return value of jffs2_prealloc_raw_node_refs before calling jffs2_sum_write_data. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Cc: stable@vger.kernel.org Fixes: 2f78540 ("[JFFS2] Reduce visibility of raw_node_ref to upper layers of JFFS2 code.") Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Reviewed-by: Zhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Richard Weinberger <richard@nod.at>
kdave
pushed a commit
that referenced
this pull request
Jun 9, 2025
1. LINE#1794 - LINE#1887 is some codes about function of bch_cache_set_alloc(). 2. LINE#2078 - LINE#2142 is some codes about function of register_cache_set(). 3. register_cache_set() will call bch_cache_set_alloc() in LINE#2098. 1794 struct cache_set *bch_cache_set_alloc(struct cache_sb *sb) 1795 { ... 1860 if (!(c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL)) || 1861 mempool_init_slab_pool(&c->search, 32, bch_search_cache) || 1862 mempool_init_kmalloc_pool(&c->bio_meta, 2, 1863 sizeof(struct bbio) + sizeof(struct bio_vec) * 1864 bucket_pages(c)) || 1865 mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size) || 1866 bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio), 1867 BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER) || 1868 !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) || 1869 !(c->moving_gc_wq = alloc_workqueue("bcache_gc", 1870 WQ_MEM_RECLAIM, 0)) || 1871 bch_journal_alloc(c) || 1872 bch_btree_cache_alloc(c) || 1873 bch_open_buckets_alloc(c) || 1874 bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages))) 1875 goto err; ^^^^^^^^ 1876 ... 1883 return c; 1884 err: 1885 bch_cache_set_unregister(c); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1886 return NULL; 1887 } ... 2078 static const char *register_cache_set(struct cache *ca) 2079 { ... 2098 c = bch_cache_set_alloc(&ca->sb); 2099 if (!c) 2100 return err; ^^^^^^^^^^ ... 2128 ca->set = c; 2129 ca->set->cache[ca->sb.nr_this_dev] = ca; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... 2138 return NULL; 2139 err: 2140 bch_cache_set_unregister(c); 2141 return err; 2142 } (1) If LINE#1860 - LINE#1874 is true, then do 'goto err'(LINE#1875) and call bch_cache_set_unregister()(LINE#1885). (2) As (1) return NULL(LINE#1886), LINE#2098 - LINE#2100 would return. (3) As (2) has returned, LINE#2128 - LINE#2129 would do *not* give the value to c->cache[], it means that c->cache[] is NULL. LINE#1624 - LINE#1665 is some codes about function of cache_set_flush(). As (1), in LINE#1885 call bch_cache_set_unregister() ---> bch_cache_set_stop() ---> closure_queue() -.-> cache_set_flush() (as below LINE#1624) 1624 static void cache_set_flush(struct closure *cl) 1625 { ... 1654 for_each_cache(ca, c, i) 1655 if (ca->alloc_thread) ^^ 1656 kthread_stop(ca->alloc_thread); ... 1665 } (4) In LINE#1655 ca is NULL(see (3)) in cache_set_flush() then the kernel crash occurred as below: [ 846.712887] bcache: register_cache() error drbd6: cannot allocate memory [ 846.713242] bcache: register_bcache() error : failed to register device [ 846.713336] bcache: cache_set_free() Cache set 2f84bdc1-498a-4f2f-98a7-01946bf54287 unregistered [ 846.713768] BUG: unable to handle kernel NULL pointer dereference at 00000000000009f8 [ 846.714790] PGD 0 P4D 0 [ 846.715129] Oops: 0000 [#1] SMP PTI [ 846.715472] CPU: 19 PID: 5057 Comm: kworker/19:16 Kdump: loaded Tainted: G OE --------- - - 4.18.0-147.5.1.el8_1.5es.3.x86_64 #1 [ 846.716082] Hardware name: ESPAN GI-25212/X11DPL-i, BIOS 2.1 06/15/2018 [ 846.716451] Workqueue: events cache_set_flush [bcache] [ 846.716808] RIP: 0010:cache_set_flush+0xc9/0x1b0 [bcache] [ 846.717155] Code: 00 4c 89 a5 b0 03 00 00 48 8b 85 68 f6 ff ff a8 08 0f 84 88 00 00 00 31 db 66 83 bd 3c f7 ff ff 00 48 8b 85 48 ff ff ff 74 28 <48> 8b b8 f8 09 00 00 48 85 ff 74 05 e8 b6 58 a2 e1 0f b7 95 3c f7 [ 846.718026] RSP: 0018:ffffb56dcf85fe70 EFLAGS: 00010202 [ 846.718372] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 846.718725] RDX: 0000000000000001 RSI: 0000000040000001 RDI: 0000000000000000 [ 846.719076] RBP: ffffa0ccc0f20df8 R08: ffffa0ce1fedb118 R09: 000073746e657665 [ 846.719428] R10: 8080808080808080 R11: 0000000000000000 R12: ffffa0ce1fee8700 [ 846.719779] R13: ffffa0ccc0f211a8 R14: ffffa0cd1b902840 R15: ffffa0ccc0f20e00 [ 846.720132] FS: 0000000000000000(0000) GS:ffffa0ce1fec0000(0000) knlGS:0000000000000000 [ 846.720726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 846.721073] CR2: 00000000000009f8 CR3: 00000008ba00a005 CR4: 00000000007606e0 [ 846.721426] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 846.721778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 846.722131] PKRU: 55555554 [ 846.722467] Call Trace: [ 846.722814] process_one_work+0x1a7/0x3b0 [ 846.723157] worker_thread+0x30/0x390 [ 846.723501] ? create_worker+0x1a0/0x1a0 [ 846.723844] kthread+0x112/0x130 [ 846.724184] ? kthread_flush_work_fn+0x10/0x10 [ 846.724535] ret_from_fork+0x35/0x40 Now, check whether that ca is NULL in LINE#1655 to fix the issue. Signed-off-by: Linggang Zeng <linggang.zeng@easystack.cn> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250527051601.74407-2-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
kdave
pushed a commit
that referenced
this pull request
Jun 9, 2025
The generic/397 test hits a BUG_ON for the case of encrypted inode with unaligned file size (for example, 33K or 1K): [ 877.737811] run fstests generic/397 at 2025-01-03 12:34:40 [ 877.875761] libceph: mon0 (2)127.0.0.1:40674 session established [ 877.876130] libceph: client4614 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 877.991965] libceph: mon0 (2)127.0.0.1:40674 session established [ 877.992334] libceph: client4617 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.017234] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.017594] libceph: client4620 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.031394] xfs_io (pid 18988) is setting deprecated v1 encryption policy; recommend upgrading to v2. [ 878.054528] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.054892] libceph: client4623 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.070287] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.070704] libceph: client4626 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.264586] libceph: mon0 (2)127.0.0.1:40674 session established [ 878.265258] libceph: client4629 fsid 19b90bca-f1ae-47a6-93dd-0b03ee637949 [ 878.374578] -----------[ cut here ]------------ [ 878.374586] kernel BUG at net/ceph/messenger.c:1070! [ 878.375150] Oops: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [ 878.378145] CPU: 2 UID: 0 PID: 4759 Comm: kworker/2:9 Not tainted 6.13.0-rc5+ #1 [ 878.378969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 [ 878.380167] Workqueue: ceph-msgr ceph_con_workfn [ 878.381639] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50 [ 878.382152] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 [ 878.383928] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287 [ 878.384447] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000 [ 878.385129] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378 [ 878.385839] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000 [ 878.386539] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030 [ 878.387203] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001 [ 878.387877] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000 [ 878.388663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 878.389212] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0 [ 878.389921] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 878.390620] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 878.391307] PKRU: 55555554 [ 878.391567] Call Trace: [ 878.391807] <TASK> [ 878.392021] ? show_regs+0x71/0x90 [ 878.392391] ? die+0x38/0xa0 [ 878.392667] ? do_trap+0xdb/0x100 [ 878.392981] ? do_error_trap+0x75/0xb0 [ 878.393372] ? ceph_msg_data_cursor_init+0x42/0x50 [ 878.393842] ? exc_invalid_op+0x53/0x80 [ 878.394232] ? ceph_msg_data_cursor_init+0x42/0x50 [ 878.394694] ? asm_exc_invalid_op+0x1b/0x20 [ 878.395099] ? ceph_msg_data_cursor_init+0x42/0x50 [ 878.395583] ? ceph_con_v2_try_read+0xd16/0x2220 [ 878.396027] ? _raw_spin_unlock+0xe/0x40 [ 878.396428] ? raw_spin_rq_unlock+0x10/0x40 [ 878.396842] ? finish_task_switch.isra.0+0x97/0x310 [ 878.397338] ? __schedule+0x44b/0x16b0 [ 878.397738] ceph_con_workfn+0x326/0x750 [ 878.398121] process_one_work+0x188/0x3d0 [ 878.398522] ? __pfx_worker_thread+0x10/0x10 [ 878.398929] worker_thread+0x2b5/0x3c0 [ 878.399310] ? __pfx_worker_thread+0x10/0x10 [ 878.399727] kthread+0xe1/0x120 [ 878.400031] ? __pfx_kthread+0x10/0x10 [ 878.400431] ret_from_fork+0x43/0x70 [ 878.400771] ? __pfx_kthread+0x10/0x10 [ 878.401127] ret_from_fork_asm+0x1a/0x30 [ 878.401543] </TASK> [ 878.401760] Modules linked in: hctr2 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 chacha_generic chacha_x86_64 libchacha adiantum libpoly1305 essiv authenc mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag intel_rapl_msr intel_rapl_common intel_uncore_frequency_common skx_edac_common nfit kvm_intel kvm crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel joydev crypto_simd cryptd rapl input_leds psmouse sch_fq_codel serio_raw bochs i2c_piix4 floppy qemu_fw_cfg i2c_smbus mac_hid pata_acpi msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables [ 878.407319] ---[ end trace 0000000000000000 ]--- [ 878.407775] RIP: 0010:ceph_msg_data_cursor_init+0x42/0x50 [ 878.408317] Code: 89 17 48 8b 46 70 55 48 89 47 08 c7 47 18 00 00 00 00 48 89 e5 e8 de cc ff ff 5d 31 c0 31 d2 31 f6 31 ff c3 cc cc cc cc 0f 0b <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 [ 878.410087] RSP: 0018:ffffb4ffc7cbbd28 EFLAGS: 00010287 [ 878.410609] RAX: ffffffff82bb9ac0 RBX: ffff981390c2f1f8 RCX: 0000000000000000 [ 878.411318] RDX: 0000000000009000 RSI: ffff981288232b58 RDI: ffff981390c2f378 [ 878.412014] RBP: ffffb4ffc7cbbe18 R08: 0000000000000000 R09: 0000000000000000 [ 878.412735] R10: 0000000000000000 R11: 0000000000000000 R12: ffff981390c2f030 [ 878.413438] R13: ffff981288232b58 R14: 0000000000000029 R15: 0000000000000001 [ 878.414121] FS: 0000000000000000(0000) GS:ffff9814b7900000(0000) knlGS:0000000000000000 [ 878.414935] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 878.415516] CR2: 00005e106a0554e0 CR3: 0000000112bf0001 CR4: 0000000000772ef0 [ 878.416211] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 878.416907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 878.417630] PKRU: 55555554 (gdb) l *ceph_msg_data_cursor_init+0x42 0xffffffff823b45a2 is in ceph_msg_data_cursor_init (net/ceph/messenger.c:1070). 1065 1066 void ceph_msg_data_cursor_init(struct ceph_msg_data_cursor *cursor, 1067 struct ceph_msg *msg, size_t length) 1068 { 1069 BUG_ON(!length); 1070 BUG_ON(length > msg->data_length); 1071 BUG_ON(!msg->num_data_items); 1072 1073 cursor->total_resid = length; 1074 cursor->data = msg->data; The issue takes place because of this: [ 202.628853] libceph: net/ceph/messenger_v2.c:2034 prepare_sparse_read_data(): msg->data_length 33792, msg->sparse_read_total 36864 1070 BUG_ON(length > msg->data_length); The generic/397 test (xfstests) executes such steps: (1) create encrypted files and directories; (2) access the created files and folders with encryption key; (3) access the created files and folders without encryption key. The issue takes place in this portion of code: if (IS_ENCRYPTED(inode)) { struct page **pages; size_t page_off; err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len, &page_off); if (err < 0) { doutc(cl, "%llx.%llx failed to allocate pages, %d\n", ceph_vinop(inode), err); goto out; } /* should always give us a page-aligned read */ WARN_ON_ONCE(page_off); len = err; err = 0; osd_req_op_extent_osd_data_pages(req, 0, pages, len, 0, false, false); The reason of the issue is that subreq->io_iter.count keeps unaligned value of length: [ 347.751182] lib/iov_iter.c:1185 __iov_iter_get_pages_alloc(): maxsize 36864, maxpages 4294967295, start 18446659367320516064 [ 347.752808] lib/iov_iter.c:1196 __iov_iter_get_pages_alloc(): maxsize 33792, maxpages 4294967295, start 18446659367320516064 [ 347.754394] lib/iov_iter.c:1015 iter_folioq_get_pages(): maxsize 33792, maxpages 4294967295, extracted 0, _start_offset 18446659367320516064 This patch simply assigns the aligned value to subreq->io_iter.count before calling iov_iter_get_pages_alloc2(). [ idryomov: tag the comment with FIXME to make it clear that it's only a workaround for netfslib not coexisting with fscrypt nicely (this is also noted in another pre-existing comment) ] Cc: David Howells <dhowells@redhat.com> Cc: stable@vger.kernel.org Fixes: ee4cdf7 ("netfs: Speed up buffered reading") Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
kdave
pushed a commit
that referenced
this pull request
Jun 9, 2025
Kernel user spaces accesses to not exported pages in atomic context incorrectly try to resolve the page fault. With debug options enabled call traces like this can be seen: BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1523 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 419074, name: qemu-system-s39 preempt_count: 1, expected: 0 RCU nest depth: 0, expected: 0 INFO: lockdep is turned off. Preemption disabled at: [<00000383ea47cfa2>] copy_page_from_iter_atomic+0xa2/0x8a0 CPU: 12 UID: 0 PID: 419074 Comm: qemu-system-s39 Tainted: G W 6.16.0-20250531.rc0.git0.69b3a602feac.63.fc42.s390x+debug #1 PREEMPT Tainted: [W]=WARN Hardware name: IBM 3931 A01 703 (LPAR) Call Trace: [<00000383e990d282>] dump_stack_lvl+0xa2/0xe8 [<00000383e99bf152>] __might_resched+0x292/0x2d0 [<00000383eaa7c374>] down_read+0x34/0x2d0 [<00000383e99432f8>] do_secure_storage_access+0x108/0x360 [<00000383eaa724b0>] __do_pgm_check+0x130/0x220 [<00000383eaa842e4>] pgm_check_handler+0x114/0x160 [<00000383ea47d028>] copy_page_from_iter_atomic+0x128/0x8a0 ([<00000383ea47d016>] copy_page_from_iter_atomic+0x116/0x8a0) [<00000383e9c45eae>] generic_perform_write+0x16e/0x310 [<00000383e9eb87f4>] ext4_buffered_write_iter+0x84/0x160 [<00000383e9da0de4>] vfs_write+0x1c4/0x460 [<00000383e9da123c>] ksys_write+0x7c/0x100 [<00000383eaa7284e>] __do_syscall+0x15e/0x280 [<00000383eaa8417e>] system_call+0x6e/0x90 INFO: lockdep is turned off. It is not allowed to take the mmap_lock while in atomic context. Therefore handle such a secure storage access fault as if the accessed page is not mapped: the uaccess function will return -EFAULT, and the caller has to deal with this. Usually this means that the access is retried in process context, which allows to resolve the page fault (or in this case export the page). Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Acked-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20250603134936.1314139-1-hca@linux.ibm.com Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
kdave
pushed a commit
that referenced
this pull request
Jun 11, 2025
When building the free space tree with the block group tree feature enabled, we can hit an assertion failure like this: BTRFS info (device loop0 state M): rebuilding free space tree assertion failed: ret == 0, in fs/btrfs/free-space-tree.c:1102 ------------[ cut here ]------------ kernel BUG at fs/btrfs/free-space-tree.c:1102! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP Modules linked in: CPU: 1 UID: 0 PID: 6592 Comm: syz-executor322 Not tainted 6.15.0-rc7-syzkaller-gd7fa1af5b33e #0 PREEMPT Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/07/2025 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 lr : populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 sp : ffff8000a4ce7600 x29: ffff8000a4ce76e0 x28: ffff0000c9bc6000 x27: ffff0000ddfff3d8 x26: ffff0000ddfff378 x25: dfff800000000000 x24: 0000000000000001 x23: ffff8000a4ce7660 x22: ffff70001499cecc x21: ffff0000e1d8c160 x20: ffff0000e1cb7800 x19: ffff0000e1d8c0b0 x18: 00000000ffffffff x17: ffff800092f39000 x16: ffff80008ad27e48 x15: ffff700011e740c0 x14: 1ffff00011e740c0 x13: 0000000000000004 x12: ffffffffffffffff x11: ffff700011e740c0 x10: 0000000000ff0100 x9 : 94ef24f55d2dbc00 x8 : 94ef24f55d2dbc00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff8000a4ce6f98 x4 : ffff80008f415ba0 x3 : ffff800080548ef0 x2 : 0000000000000000 x1 : 0000000100000000 x0 : 000000000000003e Call trace: populate_free_space_tree+0x514/0x518 fs/btrfs/free-space-tree.c:1102 (P) btrfs_rebuild_free_space_tree+0x14c/0x54c fs/btrfs/free-space-tree.c:1337 btrfs_start_pre_rw_mount+0xa78/0xe10 fs/btrfs/disk-io.c:3074 btrfs_remount_rw fs/btrfs/super.c:1319 [inline] btrfs_reconfigure+0x828/0x2418 fs/btrfs/super.c:1543 reconfigure_super+0x1d4/0x6f0 fs/super.c:1083 do_remount fs/namespace.c:3365 [inline] path_mount+0xb34/0xde0 fs/namespace.c:4200 do_mount fs/namespace.c:4221 [inline] __do_sys_mount fs/namespace.c:4432 [inline] __se_sys_mount fs/namespace.c:4409 [inline] __arm64_sys_mount+0x3e8/0x468 fs/namespace.c:4409 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:49 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:132 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:151 el0_svc+0x58/0x17c arch/arm64/kernel/entry-common.c:767 el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786 el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600 Code: f0047182 91178042 528089c3 9771d47 (d4210000) ---[ end trace 0000000000000000 ]--- This happens because we are processing an empty block group, which has no extents allocated from it, there are no items for this block group, including the block group item since block group items are stored in a dedicated tree when using the block group tree feature. It also means this is the block group with the highest start offset, so there are no higher keys in the extent root, hence btrfs_search_slot_for_read() returns 1 (no higher key found). Fix this by asserting 'ret' is 0 only if the block group tree feature is not enabled, in which case we should find a block group item for the block group since it's stored in the extent root and block group item keys are greater than extent item keys (the value for BTRFS_BLOCK_GROUP_ITEM_KEY is 192 and for BTRFS_EXTENT_ITEM_KEY and BTRFS_METADATA_ITEM_KEY the values are 168 and 169 respectively). In case 'ret' is 1, we just need to add a record to the free space tree which spans the whole block group, and we can achieve this by making 'ret == 0' as the while loop's condition. Reported-by: syzbot+36fae25c35159a763a2a@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/6841dca8.a00a0220.d4325.0020.GAE@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Jun 11, 2025
[BUG] There is syzbot based reproducer that can crash the kernel, with the following call trace: (With some debug output added) DEBUG: rescue=ibadroots parsed BTRFS: device fsid 14d642db-7b15-43e4-81e6-4b8fac6a25f8 devid 1 transid 8 /dev/loop0 (7:0) scanned by repro (1010) BTRFS info (device loop0): first mount of filesystem 14d642db-7b15-43e4-81e6-4b8fac6a25f8 BTRFS info (device loop0): using blake2b (blake2b-256-generic) checksum algorithm BTRFS info (device loop0): using free-space-tree BTRFS warning (device loop0): checksum verify failed on logical 5312512 mirror 1 wanted 0xb043382657aede36608fd3386d6b001692ff406164733d94e2d9a180412c6003 found 0x810ceb2bacb7f0f9eb2bf3b2b15c02af867cb35ad450898169f3b1f0bd818651 level 0 DEBUG: read tree root path failed for tree csum, ret=-5 BTRFS warning (device loop0): checksum verify failed on logical 5328896 mirror 1 wanted 0x51be4e8b303da58e6340226815b70e3a93592dac3f30dd510c7517454de8567a found 0x51be4e8b303da58e634022a315b70e3a93592dac3f30dd510c7517454de8567a level 0 BTRFS warning (device loop0): checksum verify failed on logical 5292032 mirror 1 wanted 0x1924ccd683be9efc2fa98582ef58760e3848e9043db8649ee382681e220cdee4 found 0x0cb6184f6e8799d9f8cb335dccd1d1832da1071d12290dab3b85b587ecacca6e level 0 process 'repro' launched './file2' with NULL argv: empty string added DEBUG: no csum root, idatacsums=0 ibadroots=134217728 Oops: general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f] CPU: 5 UID: 0 PID: 1010 Comm: repro Tainted: G OE 6.15.0-custom+ torvalds#249 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022 RIP: 0010:btrfs_lookup_csum+0x93/0x3d0 [btrfs] Call Trace: <TASK> btrfs_lookup_bio_sums+0x47a/0xdf0 [btrfs] btrfs_submit_bbio+0x43e/0x1a80 [btrfs] submit_one_bio+0xde/0x160 [btrfs] btrfs_readahead+0x498/0x6a0 [btrfs] read_pages+0x1c3/0xb20 page_cache_ra_order+0x4b5/0xc20 filemap_get_pages+0x2d3/0x19e0 filemap_read+0x314/0xde0 __kernel_read+0x35b/0x900 bprm_execve+0x62e/0x1140 do_execveat_common.isra.0+0x3fc/0x520 __x64_sys_execveat+0xdc/0x130 do_syscall_64+0x54/0x1d0 entry_SYSCALL_64_after_hwframe+0x76/0x7e ---[ end trace 0000000000000000 ]--- [CAUSE] Firstly the fs has a corrupted csum tree root, thus to mount the fs we have to go "ro,rescue=ibadroots" mount option. Normally with that mount option, a bad csum tree root should set BTRFS_FS_STATE_NO_DATA_CSUMS flag, so that any future data read will ignore csum search. But in this particular case, we have the following call trace that caused NULL csum root, but not setting BTRFS_FS_STATE_NO_DATA_CSUMS: load_global_roots_objectid(): ret = btrfs_search_slot(); /* Succeeded */ btrfs_item_key_to_cpu() found = true; /* We found the root item for csum tree. */ root = read_tree_root_path(); if (IS_ERR(root)) { if (!btrfs_test_opt(fs_info, IGNOREBADROOTS)) /* * Since we have rescue=ibadroots mount option, * @ret is still 0. */ break; if (!found || ret) { /* @found is true, @ret is 0, error handling for csum * tree is skipped. */ } This means we completely skipped to set BTRFS_FS_STATE_NO_DATA_CSUMS if the csum tree is corrupted, which results unexpected later csum lookup. [FIX] If read_tree_root_path() failed, always populate @ret to the error number. As at the end of the function, we need @ret to determine if we need to do the extra error handling for csum tree. Fixes: abed4aa ("btrfs: track the csum, extent, and free space trees in a rb tree") Reported-by: Zhiyu Zhang <zhiyuzhang999@gmail.com> Reported-by: Longxing Li <coregee2000@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
kdave
pushed a commit
that referenced
this pull request
Jun 13, 2025
There is no disagreement that we should check both ptp->is_virtual_clock and ptp->n_vclocks to check if the ptp virtual clock is in use. However, when we acquire ptp->n_vclocks_mux to read ptp->n_vclocks in ptp_vclock_in_use(), we observe a recursive lock in the call trace starting from n_vclocks_store(). ============================================ WARNING: possible recursive locking detected 6.15.0-rc6 #1 Not tainted -------------------------------------------- syz.0.1540/13807 is trying to acquire lock: ffff888035a24868 (&ptp->n_vclocks_mux){+.+.}-{4:4}, at: ptp_vclock_in_use drivers/ptp/ptp_private.h:103 [inline] ffff888035a24868 (&ptp->n_vclocks_mux){+.+.}-{4:4}, at: ptp_clock_unregister+0x21/0x250 drivers/ptp/ptp_clock.c:415 but task is already holding lock: ffff888030704868 (&ptp->n_vclocks_mux){+.+.}-{4:4}, at: n_vclocks_store+0xf1/0x6d0 drivers/ptp/ptp_sysfs.c:215 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&ptp->n_vclocks_mux); lock(&ptp->n_vclocks_mux); *** DEADLOCK *** .... ============================================ The best way to solve this is to remove the logic that checks ptp->n_vclocks in ptp_vclock_in_use(). The reason why this is appropriate is that any path that uses ptp->n_vclocks must unconditionally check if ptp->n_vclocks is greater than 0 before unregistering vclocks, and all functions are already written this way. And in the function that uses ptp->n_vclocks, we already get ptp->n_vclocks_mux before unregistering vclocks. Therefore, we need to remove the redundant check for ptp->n_vclocks in ptp_vclock_in_use() to prevent recursive locking. Fixes: 73f3706 ("ptp: support ptp physical/virtual clocks conversion") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Link: https://patch.msgid.link/20250520160717.7350-1-aha310510@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Kindly consider to pull in these changes. These patch were individually submitted to the mailing list before. Thanks.