Skip to content

Conversation

asj
Copy link

@asj asj commented Oct 7, 2015

Kindly consider to pull in these changes. These patch were individually submitted to the mailing list before. Thanks.

asj and others added 30 commits September 29, 2015 16:29
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will return EIO when __bread() fails to read SB,
instead of EINVAL.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…INTK defined

error handling logic behaves differently with or without
CONFIG_PRINTK defined, since there are two copies of the same
function which a bit of different logic

One, when CONFIG_PRINTK is defined, code is

__btrfs_std_error(..)
{
::
       save_error_info(fs_info);
       if (sb->s_flags & MS_BORN)
               btrfs_handle_error(fs_info);
}

and two when CONFIG_PRINTK is not defined, the code is

__btrfs_std_error(..)
{
::
       if (sb->s_flags & MS_BORN) {
               save_error_info(fs_info);
               btrfs_handle_error(fs_info);
        }
}

I doubt if this was intentional ? and appear to have caused since
we maintain two copies of the same function and they got diverged
with commits.

Now to decide which logic is correct reviewed changes as below,

 533574c
Commit added two copies of this function

 cf79ffb
Commit made change to only one copy of the function and to the
copy when CONFIG_PRINTK is defined.

To fix this, instead of maintaining two copies of same function
approach, maintain single function, and just put the extra
portion of the code under CONFIG_PRINTK define.

This patch just does that. And keeps code of with CONFIG_PRINTK
defined.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_error() and btrfs_std_error() does the same thing
and calls _btrfs_std_error(), so consolidate them together.
And the main motivation is that btrfs_error() is closely
named with btrfs_err(), one handles error action the other
is to log the error, so don't closely name them.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Suggested-by: David Sterba <dsterba@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
…ot found

Use btrfs specific error code BTRFS_ERROR_DEV_MISSING_NOT_FOUND instead
of -ENOENT.  Next this removes the logging when user specifies "missing"
and we don't find it in the kernel device list. Logging are for system
events not for user input errors.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This uses a chunk of code from btrfs_read_dev_super() and creates
a function called btrfs_read_dev_one_super() so that next patch
can use it for scratch superblock.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed bufhead to bh]
Signed-off-by: David Sterba <dsterba@suse.com>
This patch updates and renames btrfs_scratch_superblocks, (which is used
by the replace device thread), with those fixes from the scratch
superblock code section of btrfs_rm_device(). The fixes are:
  Scratch all copies of superblock
  Notify kobject that superblock has been changed
  Update time on the device

So that btrfs_rm_device() can use the function
btrfs_scratch_superblocks() instead of its own scratch code. And further
replace deivce code which similarly releases device back to the system,
will have the fixes from the btrfs device delete.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[renamed to btrfs_scratch_superblock]
Signed-off-by: David Sterba <dsterba@suse.com>
By general rule of thumb there shouldn't be any way that user land
could trigger a kernel operation just by sending wrong arguments.

Here do commit cleanups after user input has been verified.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Originally the message was not in a helper but ended up there. We should
print error messages from callers instead.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[reworded subject and changelog]
Signed-off-by: David Sterba <dsterba@suse.com>
To avoid deadlock described in commit 084b6e7 ("btrfs: Fix a
lockdep warning when running xfstest."), we should move kobj stuff out
of dev_replace lock range.

  "It is because the btrfs_kobj_{add/rm}_device() will call memory
  allocation with GFP_KERNEL,
  which may flush fs page cache to free space, waiting for it self to do
  the commit, causing the deadlock.

  To solve the problem, move btrfs_kobj_{add/rm}_device() out of the
  dev_replace lock range, also involing split the
  btrfs_rm_dev_replace_srcdev() function into remove and free parts.

  Now only btrfs_rm_dev_replace_remove_srcdev() is called in dev_replace
  lock range, and kobj_{add/rm} and btrfs_rm_dev_replace_free_srcdev() are
  called out of the lock range."

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
[added lockup description]
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will log return value of add/del_qgroup_relation()
and pass the err code of btrfs_run_qgroups to the btrfs_std_error().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
A part of code from btrfs_scan_one_device() is moved to a new
function btrfs_read_disk_super(), so that former function looks
cleaner and moves the code to ensure null terminating label to it as well.
Further there is opportunity to merge various duplicate
code on read disk super. Earlier attempt on this was highlighted
that there was some issues for which there are multiple versions,
however it was not clear what was issue. So until its worked out
we can keep it in a separate function.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Optional Label may or may not be set, or it might be set at
some time later. However while debugging to search
through the kernel logs the scripts would need the logs to
be consistent, so logs search key words shouldn't depend on the
optional variables, instead fsid is better.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
From the issue diagnosable point of view, log if the device path is
changed.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
looks like oversight, call brelse() when checksum fails.
further down the code in the non error path we do call
brelse() and so we don't see brelse() in the goto error..
paths.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
optimize check for stale device to only be checked when there
is device added or changed. If there is no update to the
device, there is no need to call btrfs_free_stale_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
This adds an enhancement to show the seed fsid and its devices
on the btrfs sysfs.

The way sprouting handles fs_devices:
      clone seed fs_devices and add to the fs_uuids
      mem copy seed fs_devices and assign to fs_devices->seed (move dev_list)
      evacuate seed fs_devices contents to hold sprout fs devices contents

  So to be inline with this fs_devices changes during seeding,
  represent seed fsid under the sprout fsid, this is achieved
  by using the kobject_move()
  The end result will be,
    /sys/fs/btrfs/sprout-fsid/seed/level-1-seed-fsid/seed/(if)level-2-seed-fsid

Signed-off-by: Anand Jain <anand.jain@oracle.com>
We need fsid kobject to hold pool attributes however
its created only when fs is mounted. So, this patch
changes the life cycle of the fsid and devices kobjects
/sys/fs/btrfs/<fsid> and /sys/fs/btrfs/<fsid>/devices,
from created and destroyed by mount and unmount event
to created and destroyed by scanned and module-unload
events respectively.

However this does not alter life cycle of fs attributes as such.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
move a section of btrfs_rm_device() code to check for min number
of the devices into the function __check_raid_min_devices()

v2: commit update and title renamed from
    Btrfs: move check for min number of devices to a function

Signed-off-by: Anand Jain <anand.jain@oracle.com>
__check_raid_min_device() which was pealed from btrfs_rm_device()
maintianed its original code to show the block move. This patch
cleans up __check_raid_min_device().

Signed-off-by: Anand Jain <anand.jain@oracle.com>
The patch renames btrfs_dev_replace_find_srcdev() to
btrfs_find_device_by_user_input() and moves it to volumes.c.
so that delete device can use it.

v2: changed title from
    'Btrfs: create rename btrfs_dev_replace_find_srcdev()'
    and commit update

Signed-off-by: Anand Jain <anand.jain@oracle.com>
btrfs_rm_device() has a section of the code which can be replaced
btrfs_find_device_by_user_input()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
The operation of device replace and device delete follows same steps
upto some depth with in btrfs kernel, however they don't share codes.
This enhancement will help replace and delete to share codes.

Btrfs: enhance check device_path in btrfs_find_device_by_user_input()

Signed-off-by: Anand Jain <anand.jain@oracle.com>
With the previous patches now the btrfs_scratch_superblocks()
is ready to be used in btrfs_rm_device() so use it.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
This introduces new ioctl BTRFS_IOC_RM_DEV_V2, which uses
enhanced struct btrfs_ioctl_vol_args_v2 to carry devid as
an user argument.

The patch won't delete the old ioctl interface and remains
backward compatible with user land progs.

Test case/script:
echo "0 $(blockdev --getsz /dev/sdf) linear /dev/sdf 0" | dmsetup create bad_disk
mkfs.btrfs -f -d raid1 -m raid1 /dev/sdd /dev/sde /dev/mapper/bad_disk
mount /dev/sdd /btrfs
dmsetup suspend bad_disk
echo "0 $(blockdev --getsz /dev/sdf) error /dev/sdf 0" | dmsetup load bad_disk
dmsetup resume bad_disk
echo "bad disk failed. now deleting/replacing"
btrfs dev del  3  /btrfs
echo $?
btrfs fi show /btrfs
umount /btrfs
btrfs-show-super /dev/sdd | egrep num_device
dmsetup remove bad_disk
wipefs -a /dev/sdf

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reported-by: Martin <m_btrfs@ml1.co.uk>
kdave pushed a commit that referenced this pull request Sep 12, 2025
Avoid below overlapping mappings by using a contiguous
non-cacheable buffer.

[    4.077708] DMA-API: stm32_fmc2_nfc 48810000.nand-controller: cacheline tracking EEXIST,
overlapping mappings aren't supported
[    4.089103] WARNING: CPU: 1 PID: 44 at kernel/dma/debug.c:568 add_dma_entry+0x23c/0x300
[    4.097071] Modules linked in:
[    4.100101] CPU: 1 PID: 44 Comm: kworker/u4:2 Not tainted 6.1.82 #1
[    4.106346] Hardware name: STMicroelectronics STM32MP257F VALID1 SNOR / MB1704 (LPDDR4 Power discrete) + MB1703 + MB1708 (SNOR MB1730) (DT)
[    4.118824] Workqueue: events_unbound deferred_probe_work_func
[    4.124674] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.131624] pc : add_dma_entry+0x23c/0x300
[    4.135658] lr : add_dma_entry+0x23c/0x300
[    4.139792] sp : ffff800009dbb490
[    4.143016] x29: ffff800009dbb4a0 x28: 0000000004008022 x27: ffff8000098a6000
[    4.150174] x26: 0000000000000000 x25: ffff8000099e7000 x24: ffff8000099e7de8
[    4.157231] x23: 00000000ffffffff x22: 0000000000000000 x21: ffff8000098a6a20
[    4.164388] x20: ffff000080964180 x19: ffff800009819ba0 x18: 0000000000000006
[    4.171545] x17: 6361727420656e69 x16: 6c6568636163203a x15: 72656c6c6f72746e
[    4.178602] x14: 6f632d646e616e2e x13: ffff800009832f58 x12: 00000000000004ec
[    4.185759] x11: 00000000000001a4 x10: ffff80000988af58 x9 : ffff800009832f58
[    4.192916] x8 : 00000000ffffefff x7 : ffff80000988af58 x6 : 80000000fffff000
[    4.199972] x5 : 000000000000bff4 x4 : 0000000000000000 x3 : 0000000000000000
[    4.207128] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0000812d2c40
[    4.214185] Call trace:
[    4.216605]  add_dma_entry+0x23c/0x300
[    4.220338]  debug_dma_map_sg+0x198/0x350
[    4.224373]  __dma_map_sg_attrs+0xa0/0x110
[    4.228411]  dma_map_sg_attrs+0x10/0x2c
[    4.232247]  stm32_fmc2_nfc_xfer.isra.0+0x1c8/0x3fc
[    4.237088]  stm32_fmc2_nfc_seq_read_page+0xc8/0x174
[    4.242127]  nand_read_oob+0x1d4/0x8e0
[    4.245861]  mtd_read_oob_std+0x58/0x84
[    4.249596]  mtd_read_oob+0x90/0x150
[    4.253231]  mtd_read+0x68/0xac

Signed-off-by: Christophe Kerello <christophe.kerello@foss.st.com>
Cc: stable@vger.kernel.org
Fixes: 2cd457f ("mtd: rawnand: stm32_fmc2: add STM32 FMC2 NAND flash controller driver")
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
kdave pushed a commit that referenced this pull request Sep 12, 2025
Problem description
===================

Lockdep reports a possible circular locking dependency (AB/BA) between
&pl->state_mutex and &phy->lock, as follows.

phylink_resolve() // acquires &pl->state_mutex
-> phylink_major_config()
   -> phy_config_inband() // acquires &pl->phydev->lock

whereas all the other call sites where &pl->state_mutex and
&pl->phydev->lock have the locking scheme reversed. Everywhere else,
&pl->phydev->lock is acquired at the top level, and &pl->state_mutex at
the lower level. A clear example is phylink_bringup_phy().

The outlier is the newly introduced phy_config_inband() and the existing
lock order is the correct one. To understand why it cannot be the other
way around, it is sufficient to consider phylink_phy_change(), phylink's
callback from the PHY device's phy->phy_link_change() virtual method,
invoked by the PHY state machine.

phy_link_up() and phy_link_down(), the (indirect) callers of
phylink_phy_change(), are called with &phydev->lock acquired.
Then phylink_phy_change() acquires its own &pl->state_mutex, to
serialize changes made to its pl->phy_state and pl->link_config.
So all other instances of &pl->state_mutex and &phydev->lock must be
consistent with this order.

Problem impact
==============

I think the kernel runs a serious deadlock risk if an existing
phylink_resolve() thread, which results in a phy_config_inband() call,
is concurrent with a phy_link_up() or phy_link_down() call, which will
deadlock on &pl->state_mutex in phylink_phy_change(). Practically
speaking, the impact may be limited by the slow speed of the medium
auto-negotiation protocol, which makes it unlikely for the current state
to still be unresolved when a new one is detected, but I think the
problem is there. Nonetheless, the problem was discovered using lockdep.

Proposed solution
=================

Practically speaking, the phy_config_inband() requirement of having
phydev->lock acquired must transfer to the caller (phylink is the only
caller). There, it must bubble up until immediately before
&pl->state_mutex is acquired, for the cases where that takes place.

Solution details, considerations, notes
=======================================

This is the phy_config_inband() call graph:

                          sfp_upstream_ops :: connect_phy()
                          |
                          v
                          phylink_sfp_connect_phy()
                          |
                          v
                          phylink_sfp_config_phy()
                          |
                          |   sfp_upstream_ops :: module_insert()
                          |   |
                          |   v
                          |   phylink_sfp_module_insert()
                          |   |
                          |   |   sfp_upstream_ops :: module_start()
                          |   |   |
                          |   |   v
                          |   |   phylink_sfp_module_start()
                          |   |   |
                          |   v   v
                          |   phylink_sfp_config_optical()
 phylink_start()          |   |
   |   phylink_resume()   v   v
   |   |  phylink_sfp_set_config()
   |   |  |
   v   v  v
 phylink_mac_initial_config()
   |   phylink_resolve()
   |   |  phylink_ethtool_ksettings_set()
   v   v  v
   phylink_major_config()
            |
            v
    phy_config_inband()

phylink_major_config() caller #1, phylink_mac_initial_config(), does not
acquire &pl->state_mutex nor do its callers. It must acquire
&pl->phydev->lock prior to calling phylink_major_config().

phylink_major_config() caller #2, phylink_resolve() acquires
&pl->state_mutex, thus also needs to acquire &pl->phydev->lock.

phylink_major_config() caller #3, phylink_ethtool_ksettings_set(), is
completely uninteresting, because it only calls phylink_major_config()
if pl->phydev is NULL (otherwise it calls phy_ethtool_ksettings_set()).
We need to change nothing there.

Other solutions
===============

The lock inversion between &pl->state_mutex and &pl->phydev->lock has
occurred at least once before, as seen in commit c718af2 ("net:
phylink: fix ethtool -A with attached PHYs"). The solution there was to
simply not call phy_set_asym_pause() under the &pl->state_mutex. That
cannot be extended to our case though, where the phy_config_inband()
call is much deeper inside the &pl->state_mutex section.

Fixes: 5fd0f1a ("net: phylink: add negotiation of in-band capabilities")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250904125238.193990-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kdave pushed a commit that referenced this pull request Sep 12, 2025
If request_irq() in i40e_vsi_request_irq_msix() fails in an iteration
later than the first, the error path wants to free the IRQs requested
so far. However, it uses the wrong dev_id argument for free_irq(), so
it does not free the IRQs correctly and instead triggers the warning:

 Trying to free already-free IRQ 173
 WARNING: CPU: 25 PID: 1091 at kernel/irq/manage.c:1829 __free_irq+0x192/0x2c0
 Modules linked in: i40e(+) [...]
 CPU: 25 UID: 0 PID: 1091 Comm: NetworkManager Not tainted 6.17.0-rc1+ #1 PREEMPT(lazy)
 Hardware name: [...]
 RIP: 0010:__free_irq+0x192/0x2c0
 [...]
 Call Trace:
  <TASK>
  free_irq+0x32/0x70
  i40e_vsi_request_irq_msix.cold+0x63/0x8b [i40e]
  i40e_vsi_request_irq+0x79/0x80 [i40e]
  i40e_vsi_open+0x21f/0x2f0 [i40e]
  i40e_open+0x63/0x130 [i40e]
  __dev_open+0xfc/0x210
  __dev_change_flags+0x1fc/0x240
  netif_change_flags+0x27/0x70
  do_setlink.isra.0+0x341/0xc70
  rtnl_newlink+0x468/0x860
  rtnetlink_rcv_msg+0x375/0x450
  netlink_rcv_skb+0x5c/0x110
  netlink_unicast+0x288/0x3c0
  netlink_sendmsg+0x20d/0x430
  ____sys_sendmsg+0x3a2/0x3d0
  ___sys_sendmsg+0x99/0xe0
  __sys_sendmsg+0x8a/0xf0
  do_syscall_64+0x82/0x2c0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [...]
  </TASK>
 ---[ end trace 0000000000000000 ]---

Use the same dev_id for free_irq() as for request_irq().

I tested this with inserting code to fail intentionally.

Fixes: 493fb30 ("i40e: Move q_vectors from pointer to array to array of pointers")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
kdave pushed a commit that referenced this pull request Sep 12, 2025
Hangbin Liu says:

====================
hsr: fix lock warnings

hsr_for_each_port is called in many places without holding the RCU read
lock, this may trigger warnings on debug kernels like:

  [   40.457015] [  T201] WARNING: suspicious RCU usage
  [   40.457020] [  T201] 6.17.0-rc2-virtme #1 Not tainted
  [   40.457025] [  T201] -----------------------------
  [   40.457029] [  T201] net/hsr/hsr_main.c:137 RCU-list traversed in non-reader section!!
  [   40.457036] [  T201]
                          other info that might help us debug this:

  [   40.457040] [  T201]
                          rcu_scheduler_active = 2, debug_locks = 1
  [   40.457045] [  T201] 2 locks held by ip/201:
  [   40.457050] [  T201]  #0: ffffffff93040a40 (&ops->srcu){.+.+}-{0:0}, at: rtnl_link_ops_get+0xf2/0x280
  [   40.457080] [  T201]  #1: ffffffff92e7f968 (rtnl_mutex){+.+.}-{4:4}, at: rtnl_newlink+0x5e1/0xb20
  [   40.457102] [  T201]
                          stack backtrace:
  [   40.457108] [  T201] CPU: 2 UID: 0 PID: 201 Comm: ip Not tainted 6.17.0-rc2-virtme #1 PREEMPT(full)
  [   40.457114] [  T201] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [   40.457117] [  T201] Call Trace:
  [   40.457120] [  T201]  <TASK>
  [   40.457126] [  T201]  dump_stack_lvl+0x6f/0xb0
  [   40.457136] [  T201]  lockdep_rcu_suspicious.cold+0x4f/0xb1
  [   40.457148] [  T201]  hsr_port_get_hsr+0xfe/0x140
  [   40.457158] [  T201]  hsr_add_port+0x192/0x940
  [   40.457167] [  T201]  ? __pfx_hsr_add_port+0x10/0x10
  [   40.457176] [  T201]  ? lockdep_init_map_type+0x5c/0x270
  [   40.457189] [  T201]  hsr_dev_finalize+0x4bc/0xbf0
  [   40.457204] [  T201]  hsr_newlink+0x3c3/0x8f0
  [   40.457212] [  T201]  ? __pfx_hsr_newlink+0x10/0x10
  [   40.457222] [  T201]  ? rtnl_create_link+0x173/0xe40
  [   40.457233] [  T201]  rtnl_newlink_create+0x2cf/0x750
  [   40.457243] [  T201]  ? __pfx_rtnl_newlink_create+0x10/0x10
  [   40.457247] [  T201]  ? __dev_get_by_name+0x12/0x50
  [   40.457252] [  T201]  ? rtnl_dev_get+0xac/0x140
  [   40.457259] [  T201]  ? __pfx_rtnl_dev_get+0x10/0x10
  [   40.457285] [  T201]  __rtnl_newlink+0x22c/0xa50
  [   40.457305] [  T201]  rtnl_newlink+0x637/0xb20

Adding rcu_read_lock() for all hsr_for_each_port() looks confusing.

Introduce a new helper, hsr_for_each_port_rtnl(), that assumes the
RTNL lock is held. This allows callers in suitable contexts to iterate
ports safely without explicit RCU locking.

Other code paths that rely on RCU protection continue to use
hsr_for_each_port() with rcu_read_lock().
====================

Link: https://patch.msgid.link/20250905091533.377443-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
kdave pushed a commit that referenced this pull request Sep 15, 2025
5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing
"ranges"") simplified code by using the for_each_of_range() iterator, but
it broke PCI enumeration on Turris Omnia (and probably other mvebu
targets).

Issue #1:

To determine range.flags, of_pci_range_parser_one() uses bus->get_flags(),
which resolves to of_bus_pci_get_flags(), which already returns an
IORESOURCE bit field, and NOT the original flags from the "ranges"
resource.

Then mvebu_get_tgt_attr() attempts the very same conversion again.  Remove
the misinterpretation of range.flags in mvebu_get_tgt_attr(), to restore
the intended behavior.

Issue #2:

The driver needs target and attributes, which are encoded in the raw
address values of the "/soc/pcie/ranges" resource. According to
of_pci_range_parser_one(), the raw values are stored in range.bus_addr and
range.parent_bus_addr, respectively. range.cpu_addr is a translated version
of range.parent_bus_addr, and not relevant here.

Use the correct range structure member, to extract target and attributes.
This restores the intended behavior.

Fixes: 5da3d94 ("PCI: mvebu: Use for_each_of_range() iterator for parsing "ranges"")
Reported-by: Jan Palus <jpalus@fastmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220479
Signed-off-by: Klaus Kudielka <klaus.kudielka@gmail.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Tested-by: Tony Dinh <mibodhi@gmail.com>
Tested-by: Jan Palus <jpalus@fastmail.com>
Link: https://patch.msgid.link/20250907102303.29735-1-klaus.kudielka@gmail.com
kdave pushed a commit that referenced this pull request Sep 15, 2025
The function ceph_process_folio_batch() sets folio_batch entries to
NULL, which is an illegal state.  Before folio_batch_release() crashes
due to this API violation, the function ceph_shift_unused_folios_left()
is supposed to remove those NULLs from the array.

However, since commit ce80b76 ("ceph: introduce
ceph_process_folio_batch() method"), this shifting doesn't happen
anymore because the "for" loop got moved to ceph_process_folio_batch(),
and now the `i` variable that remains in ceph_writepages_start()
doesn't get incremented anymore, making the shifting effectively
unreachable much of the time.

Later, commit 1551ec6 ("ceph: introduce ceph_submit_write()
method") added more preconditions for doing the shift, replacing the
`i` check (with something that is still just as broken):

- if ceph_process_folio_batch() fails, shifting never happens

- if ceph_move_dirty_page_in_page_array() was never called (because
  ceph_process_folio_batch() has returned early for some of various
  reasons), shifting never happens

- if `processed_in_fbatch` is zero (because ceph_process_folio_batch()
  has returned early for some of the reasons mentioned above or
  because ceph_move_dirty_page_in_page_array() has failed), shifting
  never happens

Since those two commits, any problem in ceph_process_folio_batch()
could crash the kernel, e.g. this way:

 BUG: kernel NULL pointer dereference, address: 0000000000000034
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0002) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0002 [#1] SMP NOPTI
 CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es torvalds#714 NONE
 Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
 Workqueue: writeback wb_workfn (flush-ceph-1)
 RIP: 0010:folios_put_refs+0x85/0x140
 Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
 RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
 RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
 RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
 R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
 R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
 FS:  0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  ceph_writepages_start+0xeb9/0x1410

The crash can be reproduced easily by changing the
ceph_check_page_before_write() return value to `-E2BIG`.

(Interestingly, the crash happens only if `huge_zero_folio` has
already been allocated; without `huge_zero_folio`,
is_huge_zero_folio(NULL) returns true and folios_put_refs() skips NULL
entries instead of dereferencing them.  That makes reproducing the bug
somewhat unreliable.  See
https://lore.kernel.org/20250826231626.218675-1-max.kellermann@ionos.com
for a discussion of this detail.)

My suggestion is to move the ceph_shift_unused_folios_left() to right
after ceph_process_folio_batch() to ensure it always gets called to
fix up the illegal folio_batch state.

Cc: stable@vger.kernel.org
Fixes: ce80b76 ("ceph: introduce ceph_process_folio_batch() method")
Link: https://lore.kernel.org/ceph-devel/aK4v548CId5GIKG1@swift.blarg.de/
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
kdave pushed a commit that referenced this pull request Sep 18, 2025
…rnal()

A crash was observed with the following output:

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN PTI
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 UID: 0 PID: 2899 Comm: syz.2.399 Not tainted 6.17.0-rc5+ #5 PREEMPT(none)
RIP: 0010:trace_kprobe_create_internal+0x3fc/0x1440 kernel/trace/trace_kprobe.c:911
Call Trace:
 <TASK>
 trace_kprobe_create_cb+0xa2/0xf0 kernel/trace/trace_kprobe.c:1089
 trace_probe_create+0xf1/0x110 kernel/trace/trace_probe.c:2246
 dyn_event_create+0x45/0x70 kernel/trace/trace_dynevent.c:128
 create_or_delete_trace_kprobe+0x5e/0xc0 kernel/trace/trace_kprobe.c:1107
 trace_parse_run_command+0x1a5/0x330 kernel/trace/trace.c:10785
 vfs_write+0x2b6/0xd00 fs/read_write.c:684
 ksys_write+0x129/0x240 fs/read_write.c:738
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x5d/0x2d0 arch/x86/entry/syscall_64.c:94
 </TASK>

Function kmemdup() may return NULL in trace_kprobe_create_internal(), add
check for it's return value.

Link: https://lore.kernel.org/all/20250916075816.3181175-1-wangliang74@huawei.com/

Fixes: 33b4e38 ("tracing: kprobe-event: Allocate string buffers from heap")
Signed-off-by: Wang Liang <wangliang74@huawei.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
kdave pushed a commit that referenced this pull request Sep 19, 2025
…ostcopy

When you run a KVM guest with vhost-net and migrate that guest to
another host, and you immediately enable postcopy after starting the
migration, there is a big chance that the network connection of the
guest won't work anymore on the destination side after the migration.

With a debug kernel v6.16.0, there is also a call trace that looks
like this:

 FAULT_FLAG_ALLOW_RETRY missing 881
 CPU: 6 UID: 0 PID: 549 Comm: kworker/6:2 Kdump: loaded Not tainted 6.16.0 torvalds#56 NONE
 Hardware name: IBM 3931 LA1 400 (LPAR)
 Workqueue: events irqfd_inject [kvm]
 Call Trace:
  [<00003173cbecc634>] dump_stack_lvl+0x104/0x168
  [<00003173cca69588>] handle_userfault+0xde8/0x1310
  [<00003173cc756f0c>] handle_pte_fault+0x4fc/0x760
  [<00003173cc759212>] __handle_mm_fault+0x452/0xa00
  [<00003173cc7599ba>] handle_mm_fault+0x1fa/0x6a0
  [<00003173cc73409a>] __get_user_pages+0x4aa/0xba0
  [<00003173cc7349e8>] get_user_pages_remote+0x258/0x770
  [<000031734be6f052>] get_map_page+0xe2/0x190 [kvm]
  [<000031734be6f910>] adapter_indicators_set+0x50/0x4a0 [kvm]
  [<000031734be7f674>] set_adapter_int+0xc4/0x170 [kvm]
  [<000031734be2f268>] kvm_set_irq+0x228/0x3f0 [kvm]
  [<000031734be27000>] irqfd_inject+0xd0/0x150 [kvm]
  [<00003173cc00c9ec>] process_one_work+0x87c/0x1490
  [<00003173cc00dda6>] worker_thread+0x7a6/0x1010
  [<00003173cc02dc36>] kthread+0x3b6/0x710
  [<00003173cbed2f0c>] __ret_from_fork+0xdc/0x7f0
  [<00003173cdd737ca>] ret_from_fork+0xa/0x30
 3 locks held by kworker/6:2/549:
  #0: 00000000800bc958 ((wq_completion)events){+.+.}-{0:0}, at: process_one_work+0x7ee/0x1490
  #1: 000030f3d527fbd0 ((work_completion)(&irqfd->inject)){+.+.}-{0:0}, at: process_one_work+0x81c/0x1490
  #2: 00000000f99862b0 (&mm->mmap_lock){++++}-{3:3}, at: get_map_page+0xa8/0x190 [kvm]

The "FAULT_FLAG_ALLOW_RETRY missing" indicates that handle_userfaultfd()
saw a page fault request without ALLOW_RETRY flag set, hence userfaultfd
cannot remotely resolve it (because the caller was asking for an immediate
resolution, aka, FAULT_FLAG_NOWAIT, while remote faults can take time).
With that, get_map_page() failed and the irq was lost.

We should not be strictly in an atomic environment here and the worker
should be sleepable (the call is done during an ioctl from userspace),
so we can allow adapter_indicators_set() to just sleep waiting for the
remote fault instead.

Link: https://issues.redhat.com/browse/RHEL-42486
Signed-off-by: Peter Xu <peterx@redhat.com>
[thuth: Assembled patch description and fixed some cosmetical issues]
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Acked-by: Janosch Frank <frankja@linux.ibm.com>
Fixes: f654706 ("KVM: s390/interrupt: do not pin adapter interrupt pages")
[frankja: Added fixes tag]
Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
kdave pushed a commit that referenced this pull request Sep 19, 2025
syzkaller has caught us red-handed once more, this time nesting regular
spinlocks behind raw spinlocks:

  =============================
  [ BUG: Invalid wait context ]
  6.16.0-rc3-syzkaller-g7b8346bd9fce #0 Not tainted
  -----------------------------
  syz.0.29/3743 is trying to lock:
  a3ff80008e2e9e18 (&xa->xa_lock#20){....}-{3:3}, at: vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137
  other info that might help us debug this:
  context-{5:5}
  3 locks held by syz.0.29/3743:
   #0: a3ff80008e2e90a8 (&kvm->slots_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x50/0x624 arch/arm64/kvm/vgic/vgic-init.c:499
   #1: a3ff80008e2e9fa0 (&kvm->arch.config_lock){+.+.}-{4:4}, at: kvm_vgic_destroy+0x5c/0x624 arch/arm64/kvm/vgic/vgic-init.c:500
   #2: 58f0000021be1428 (&vgic_cpu->ap_list_lock){....}-{2:2}, at: vgic_flush_pending_lpis+0x3c/0x31c arch/arm64/kvm/vgic/vgic.c:150
  stack backtrace:
  CPU: 0 UID: 0 PID: 3743 Comm: syz.0.29 Not tainted 6.16.0-rc3-syzkaller-g7b8346bd9fce #0 PREEMPT
  Hardware name: linux,dummy-virt (DT)
  Call trace:
   show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:466 (C)
   __dump_stack+0x30/0x40 lib/dump_stack.c:94
   dump_stack_lvl+0xd8/0x12c lib/dump_stack.c:120
   dump_stack+0x1c/0x28 lib/dump_stack.c:129
   print_lock_invalid_wait_context kernel/locking/lockdep.c:4833 [inline]
   check_wait_context kernel/locking/lockdep.c:4905 [inline]
   __lock_acquire+0x978/0x299c kernel/locking/lockdep.c:5190
   lock_acquire+0x14c/0x2e0 kernel/locking/lockdep.c:5871
   __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:110 [inline]
   _raw_spin_lock_irqsave+0x5c/0x7c kernel/locking/spinlock.c:162
   vgic_put_irq+0xb4/0x190 arch/arm64/kvm/vgic/vgic.c:137
   vgic_flush_pending_lpis+0x24c/0x31c arch/arm64/kvm/vgic/vgic.c:158
   __kvm_vgic_vcpu_destroy+0x44/0x500 arch/arm64/kvm/vgic/vgic-init.c:455
   kvm_vgic_destroy+0x100/0x624 arch/arm64/kvm/vgic/vgic-init.c:505
   kvm_arch_destroy_vm+0x80/0x138 arch/arm64/kvm/arm.c:244
   kvm_destroy_vm virt/kvm/kvm_main.c:1308 [inline]
   kvm_put_kvm+0x800/0xff8 virt/kvm/kvm_main.c:1344
   kvm_vm_release+0x58/0x78 virt/kvm/kvm_main.c:1367
   __fput+0x4ac/0x980 fs/file_table.c:465
   ____fput+0x20/0x58 fs/file_table.c:493
   task_work_run+0x1bc/0x254 kernel/task_work.c:227
   resume_user_mode_work include/linux/resume_user_mode.h:50 [inline]
   do_notify_resume+0x1b4/0x270 arch/arm64/kernel/entry-common.c:151
   exit_to_user_mode_prepare arch/arm64/kernel/entry-common.c:169 [inline]
   exit_to_user_mode arch/arm64/kernel/entry-common.c:178 [inline]
   el0_svc+0xb4/0x160 arch/arm64/kernel/entry-common.c:768
   el0t_64_sync_handler+0x78/0x108 arch/arm64/kernel/entry-common.c:786
   el0t_64_sync+0x198/0x19c arch/arm64/kernel/entry.S:600

This is of course no good, but is at odds with how LPI refcounts are
managed. Solve the locking mess by deferring the release of unreferenced
LPIs after the ap_list_lock is released. Mark these to-be-released LPIs
specially to avoid racing with vgic_put_irq() and causing a double-free.

Since references can only be taken on LPIs with a nonzero refcount,
extending the lifetime of freed LPIs is still safe.

Reviewed-by: Marc Zyngier <maz@kernel.org>
Reported-by: syzbot+cef594105ac7e60c6d93@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/kvmarm/68acd0d9.a00a0220.33401d.048b.GAE@google.com/
Link: https://lore.kernel.org/r/20250905100531.282980-5-oliver.upton@linux.dev
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
kdave pushed a commit that referenced this pull request Sep 19, 2025
…PAIR

A NULL pointer dereference can occur in tcp_ao_finish_connect() during a
connect() system call on a socket with a TCP-AO key added and TCP_REPAIR
enabled.

The function is called with skb being NULL and attempts to dereference it
on tcp_hdr(skb)->seq without a prior skb validation.

Fix this by checking if skb is NULL before dereferencing it.

The commentary is taken from bpf_skops_established(), which is also called
in the same flow. Unlike the function being patched,
bpf_skops_established() validates the skb before dereferencing it.

int main(void){
	struct sockaddr_in sockaddr;
	struct tcp_ao_add tcp_ao;
	int sk;
	int one = 1;

	memset(&sockaddr,'\0',sizeof(sockaddr));
	memset(&tcp_ao,'\0',sizeof(tcp_ao));

	sk = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);

	sockaddr.sin_family = AF_INET;

	memcpy(tcp_ao.alg_name,"cmac(aes128)",12);
	memcpy(tcp_ao.key,"ABCDEFGHABCDEFGH",16);
	tcp_ao.keylen = 16;

	memcpy(&tcp_ao.addr,&sockaddr,sizeof(sockaddr));

	setsockopt(sk, IPPROTO_TCP, TCP_AO_ADD_KEY, &tcp_ao,
	sizeof(tcp_ao));
	setsockopt(sk, IPPROTO_TCP, TCP_REPAIR, &one, sizeof(one));

	sockaddr.sin_family = AF_INET;
	sockaddr.sin_port = htobe16(123);

	inet_aton("127.0.0.1", &sockaddr.sin_addr);

	connect(sk,(struct sockaddr *)&sockaddr,sizeof(sockaddr));

return 0;
}

$ gcc tcp-ao-nullptr.c -o tcp-ao-nullptr -Wall
$ unshare -Urn

BUG: kernel NULL pointer dereference, address: 00000000000000b6
PGD 1f648d067 P4D 1f648d067 PUD 1982e8067 PMD 0
Oops: Oops: 0000 [#1] SMP NOPTI
Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop
Reference Platform, BIOS 6.00 11/12/2020
RIP: 0010:tcp_ao_finish_connect (net/ipv4/tcp_ao.c:1182)

Fixes: 7c2ffaf ("net/tcp: Calculate TCP-AO traffic keys")
Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250911230743.2551-3-anderson@allelesecurity.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kdave pushed a commit that referenced this pull request Sep 19, 2025
When igc_led_setup() fails, igc_probe() fails and triggers kernel panic
in free_netdev() since unregister_netdev() is not called. [1]
This behavior can be tested using fault-injection framework, especially
the failslab feature. [2]

Since LED support is not mandatory, treat LED setup failures as
non-fatal and continue probe with a warning message, consequently
avoiding the kernel panic.

[1]
 kernel BUG at net/core/dev.c:12047!
 Oops: invalid opcode: 0000 [#1] SMP NOPTI
 CPU: 0 UID: 0 PID: 937 Comm: repro-igc-led-e Not tainted 6.17.0-rc4-enjuk-tnguy-00865-gc4940196ab02 torvalds#64 PREEMPT(voluntary)
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
 RIP: 0010:free_netdev+0x278/0x2b0
 [...]
 Call Trace:
  <TASK>
  igc_probe+0x370/0x910
  local_pci_probe+0x3a/0x80
  pci_device_probe+0xd1/0x200
 [...]

[2]
 #!/bin/bash -ex

 FAILSLAB_PATH=/sys/kernel/debug/failslab/
 DEVICE=0000:00:05.0
 START_ADDR=$(grep " igc_led_setup" /proc/kallsyms \
         | awk '{printf("0x%s", $1)}')
 END_ADDR=$(printf "0x%x" $((START_ADDR + 0x100)))

 echo $START_ADDR > $FAILSLAB_PATH/require-start
 echo $END_ADDR > $FAILSLAB_PATH/require-end
 echo 1 > $FAILSLAB_PATH/times
 echo 100 > $FAILSLAB_PATH/probability
 echo N > $FAILSLAB_PATH/ignore-gfp-wait

 echo $DEVICE > /sys/bus/pci/drivers/igc/bind

Fixes: ea57870 ("igc: Add support for LEDs on i225/i226")
Signed-off-by: Kohei Enju <enjuk@amazon.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
kdave pushed a commit that referenced this pull request Sep 20, 2025
…own_locked()

When

  9770b42 ("crypto: ccp - Move dev_info/err messages for SEV/SNP init and shutdown")

moved the error messages dumping so that they don't need to be issued by
the callers, it missed the case where __sev_firmware_shutdown() calls
__sev_platform_shutdown_locked() with a NULL argument which leads to
a NULL ptr deref on the shutdown path, during suspend to disk:

  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] SMP NOPTI
  CPU: 0 UID: 0 PID: 983 Comm: hib.sh Not tainted 6.17.0-rc4+ #1 PREEMPT(voluntary)
  Hardware name: Supermicro Super Server/H12SSL-i, BIOS 2.5 09/08/2022
  RIP: 0010:__sev_platform_shutdown_locked.cold+0x0/0x21 [ccp]

That rIP is:

  00000000000006fd <__sev_platform_shutdown_locked.cold>:
   6fd:   8b 13                   mov    (%rbx),%edx
   6ff:   48 8b 7d 00             mov    0x0(%rbp),%rdi
   703:   89 c1                   mov    %eax,%ecx

  Code: 74 05 31 ff 41 89 3f 49 8b 3e 89 ea 48 c7 c6 a0 8e 54 a0 41 bf 92 ff ff ff e8 e5 2e 09 e1 c6 05 2a d4 38 00 01 e9 26 af ff ff <8b> 13 48 8b 7d 00 89 c1 48 c7 c6 18 90 54 a0 89 44 24 04 e8 c1 2e
  RSP: 0018:ffffc90005467d00 EFLAGS: 00010282
  RAX: 00000000ffffff92 RBX: 0000000000000000 RCX: 0000000000000000
  			     ^^^^^^^^^^^^^^^^
and %rbx is nice and clean.

  Call Trace:
   <TASK>
   __sev_firmware_shutdown.isra.0
   sev_dev_destroy
   psp_dev_destroy
   sp_destroy
   pci_device_shutdown
   device_shutdown
   kernel_power_off
   hibernate.cold
   state_store
   kernfs_fop_write_iter
   vfs_write
   ksys_write
   do_syscall_64
   entry_SYSCALL_64_after_hwframe

Pass in a pointer to the function-local error var in the caller.

With that addressed, suspending the ccp shows the error properly at
least:

  ccp 0000:47:00.1: sev command 0x2 timed out, disabling PSP
  ccp 0000:47:00.1: SEV: failed to SHUTDOWN error 0x0, rc -110
  SEV-SNP: Leaking PFN range 0x146800-0x146a00
  SEV-SNP: PFN 0x146800 unassigned, dumping non-zero entries in 2M PFN region: [0x146800 - 0x146a00]
  ...
  ccp 0000:47:00.1: SEV-SNP firmware shutdown failed, rc -16, error 0x0
  ACPI: PM: Preparing to enter system sleep state S5
  kvm: exiting hardware virtualization
  reboot: Power down

Btw, this driver is crying to be cleaned up to pass in a proper I/O
struct which can be used to store information between the different
functions, otherwise stuff like that will happen in the future again.

Fixes: 9770b42 ("crypto: ccp - Move dev_info/err messages for SEV/SNP init and shutdown")
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: <stable@kernel.org>
Reviewed-by: Ashish Kalra <ashish.kalra@amd.com>
Acked-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
kdave pushed a commit that referenced this pull request Sep 20, 2025
Check in snd_intel_dsp_check_soundwire() that the pointer returned by
ACPI_HANDLE() is not NULL, before passing it on to other functions.

The original code assumed a non-NULL return, but if it was unexpectedly
NULL it would end up passed to acpi_walk_namespace() as the start
point, and would result in

[    3.219028] BUG: kernel NULL pointer dereference, address:
0000000000000018
[    3.219029] #PF: supervisor read access in kernel mode
[    3.219030] #PF: error_code(0x0000) - not-present page
[    3.219031] PGD 0 P4D 0
[    3.219032] Oops: Oops: 0000 [#1] SMP NOPTI
[    3.219035] CPU: 2 UID: 0 PID: 476 Comm: (udev-worker) Tainted: G S
AW   E       6.17.0-rc5-test #1 PREEMPT(voluntary)
[    3.219038] Tainted: [S]=CPU_OUT_OF_SPEC, [A]=OVERRIDDEN_ACPI_TABLE,
[W]=WARN, [E]=UNSIGNED_MODULE
[    3.219040] RIP: 0010:acpi_ns_walk_namespace+0xb5/0x480

This problem was triggered by a bugged DSDT that the kernel couldn't parse.
But it shouldn't be possible to SEGFAULT the kernel just because of some
bugs in ACPI.

Fixes: 0650857 ("ALSA: hda: add autodetection for SoundWire")
Signed-off-by: Richard Fitzgerald <rf@opensource.cirrus.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
kdave pushed a commit that referenced this pull request Sep 23, 2025
[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:

  # mkfs.btrfs -s 8k $dev
  # mount $dev $mnt
  # mkdir $mnt/dir
  # ln -s dir $mnt/link
  # ls $mnt/link

The call trace looks like this:

  BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
  BTRFS info (device dm-2): checking UUID tree
  BTRFS info (device dm-2): enabling ssd optimizations
  BTRFS info (device dm-2): enabling free space tree
  ------------[ cut here ]------------
  kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
  Oops: invalid opcode: 0000 [#1] SMP
  CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G           OE       6.17.0-rc4-custom+ torvalds#283 PREEMPT(full)
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
  RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
  Call Trace:
   <TASK>
   btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   filemap_read_folio+0x37/0xe0
   do_read_cache_folio+0x94/0x3e0
   __page_get_link.isra.0+0x20/0x90
   page_get_link+0x16/0x40
   step_into+0x69b/0x830
   path_lookupat+0xa7/0x170
   filename_lookup+0xf7/0x200
   ? set_ptes.isra.0+0x36/0x70
   vfs_statx+0x7a/0x160
   do_statx+0x63/0xa0
   __x64_sys_statx+0x90/0xe0
   do_syscall_64+0x82/0xae0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.

[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.

But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.

Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.

[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().

For symbolic links this will fix the crash as now the folio will meet
the minimal order.

For regular files this brings no change.

For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.

Fixes: cc38d17 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kdave pushed a commit that referenced this pull request Sep 23, 2025
[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:

  # mkfs.btrfs -s 8k $dev
  # mount $dev $mnt
  # mkdir $mnt/dir
  # ln -s dir $mnt/link
  # ls $mnt/link

The call trace looks like this:

  BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
  BTRFS info (device dm-2): checking UUID tree
  BTRFS info (device dm-2): enabling ssd optimizations
  BTRFS info (device dm-2): enabling free space tree
  ------------[ cut here ]------------
  kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
  Oops: invalid opcode: 0000 [#1] SMP
  CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G           OE       6.17.0-rc4-custom+ torvalds#283 PREEMPT(full)
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
  RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
  Call Trace:
   <TASK>
   btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   filemap_read_folio+0x37/0xe0
   do_read_cache_folio+0x94/0x3e0
   __page_get_link.isra.0+0x20/0x90
   page_get_link+0x16/0x40
   step_into+0x69b/0x830
   path_lookupat+0xa7/0x170
   filename_lookup+0xf7/0x200
   ? set_ptes.isra.0+0x36/0x70
   vfs_statx+0x7a/0x160
   do_statx+0x63/0xa0
   __x64_sys_statx+0x90/0xe0
   do_syscall_64+0x82/0xae0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.

[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.

But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.

Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.

[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().

For symbolic links this will fix the crash as now the folio will meet
the minimal order.

For regular files this brings no change.

For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.

Fixes: cc38d17 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kdave pushed a commit that referenced this pull request Sep 25, 2025
The kernel forbids the creation of non-FDB nexthop groups with FDB
nexthops:

 # ip nexthop add id 1 via 192.0.2.1 fdb
 # ip nexthop add id 2 group 1
 Error: Non FDB nexthop group cannot have fdb nexthops.

And vice versa:

 # ip nexthop add id 3 via 192.0.2.2 dev dummy1
 # ip nexthop add id 4 group 3 fdb
 Error: FDB nexthop group can only have fdb nexthops.

However, as long as no routes are pointing to a non-FDB nexthop group,
the kernel allows changing the type of a nexthop from FDB to non-FDB and
vice versa:

 # ip nexthop add id 5 via 192.0.2.2 dev dummy1
 # ip nexthop add id 6 group 5
 # ip nexthop replace id 5 via 192.0.2.2 fdb
 # echo $?
 0

This configuration is invalid and can result in a NPD [1] since FDB
nexthops are not associated with a nexthop device:

 # ip route add 198.51.100.1/32 nhid 6
 # ping 198.51.100.1

Fix by preventing nexthop FDB status change while the nexthop is in a
group:

 # ip nexthop add id 7 via 192.0.2.2 dev dummy1
 # ip nexthop add id 8 group 7
 # ip nexthop replace id 7 via 192.0.2.2 fdb
 Error: Cannot change nexthop FDB status while in a group.

[1]
BUG: kernel NULL pointer dereference, address: 00000000000003c0
[...]
Oops: Oops: 0000 [#1] SMP
CPU: 6 UID: 0 PID: 367 Comm: ping Not tainted 6.17.0-rc6-virtme-gb65678cacc03 #1 PREEMPT(voluntary)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-4.fc41 04/01/2014
RIP: 0010:fib_lookup_good_nhc+0x1e/0x80
[...]
Call Trace:
 <TASK>
 fib_table_lookup+0x541/0x650
 ip_route_output_key_hash_rcu+0x2ea/0x970
 ip_route_output_key_hash+0x55/0x80
 __ip4_datagram_connect+0x250/0x330
 udp_connect+0x2b/0x60
 __sys_connect+0x9c/0xd0
 __x64_sys_connect+0x18/0x20
 do_syscall_64+0xa4/0x2a0
 entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 38428d6 ("nexthop: support for fdb ecmp nexthops")
Reported-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/68c9a4d2.050a0220.3c6139.0e63.GAE@google.com/
Tested-by: syzbot+6596516dd2b635ba2350@syzkaller.appspotmail.com
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20250921150824.149157-2-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kdave pushed a commit that referenced this pull request Sep 25, 2025
Ido Schimmel says:

====================
nexthop: Various fixes

Patch #1 fixes a NPD that was recently reported by syzbot.

Patch #2 fixes an issue in the existing FIB nexthop selftest.

Patch #3 extends the selftest with test cases for the bug that was fixed
in the first patch.
====================

Link: https://patch.msgid.link/20250921150824.149157-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
kdave pushed a commit that referenced this pull request Sep 29, 2025
Add 0x29 as the accelerometer address for the Dell Latitude E6530 to
lis3lv02d_devices[].

The address was verified as below:

    $ cd /sys/bus/pci/drivers/i801_smbus/0000:00:1f.3
    $ ls -d i2c-*
    i2c-20
    $ sudo modprobe i2c-dev
    $ sudo i2cdetect 20
    WARNING! This program can confuse your I2C bus, cause data loss and worse!
    I will probe file /dev/i2c-20.
    I will probe address range 0x08-0x77.
    Continue? [Y/n] Y
         0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
    00:                         08 -- -- -- -- -- -- --
    10: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    20: -- -- -- -- -- -- -- -- -- UU -- 2b -- -- -- --
    30: -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
    40: -- -- -- -- 44 -- -- -- -- -- -- -- -- -- -- --
    50: UU -- 52 -- -- -- -- -- -- -- -- -- -- -- -- --
    60: -- 61 -- -- -- -- -- -- -- -- -- -- -- -- -- --
    70: -- -- -- -- -- -- -- --
    $ cat /proc/cmdline
    BOOT_IMAGE=/vmlinuz-linux-cachyos-bore root=UUID=<redacted> rw loglevel=3 quiet dell_lis3lv02d.probe_i2c_addr=1
    $ sudo dmesg
    [    0.000000] Linux version 6.16.6-2-cachyos-bore (linux-cachyos-bore@cachyos) (gcc (GCC) 15.2.1 20250813, GNU ld (GNU Binutils) 2.45.0) #1 SMP PREEMPT_DYNAMIC Thu, 11 Sep 2025 16:01:12 +0000
    […]
    [    0.000000] DMI: Dell Inc. Latitude E6530/07Y85M, BIOS A22 11/30/2018
    […]
    [    5.166442] i2c i2c-20: Probing for lis3lv02d on address 0x29
    [    5.167854] i2c i2c-20: Detected lis3lv02d on address 0x29, please report this upstream to platform-driver-x86@vger.kernel.org so that a quirk can be added

Signed-off-by: Nickolay Goppen <setotau@mainlining.org>
Reviewed-by: Hans de Goede <hansg@kernel.org>
Link: https://patch.msgid.link/20250917-dell-lis3lv02d-latitude-e6530-v1-1-8a6dec4e51e9@mainlining.org
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
kdave pushed a commit that referenced this pull request Sep 29, 2025
Running sha224_kunit on a KMSAN-enabled kernel results in a crash in
kmsan_internal_set_shadow_origin():

    BUG: unable to handle page fault for address: ffffbc3840291000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 1810067 P4D 1810067 PUD 192d067 PMD 3c17067 PTE 0
    Oops: 0000 [#1] SMP NOPTI
    CPU: 0 UID: 0 PID: 81 Comm: kunit_try_catch Tainted: G                 N  6.17.0-rc3 #10 PREEMPT(voluntary)
    Tainted: [N]=TEST
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.17.0-0-gb52ca86e094d-prebuilt.qemu.org 04/01/2014
    RIP: 0010:kmsan_internal_set_shadow_origin+0x91/0x100
    [...]
    Call Trace:
    <TASK>
    __msan_memset+0xee/0x1a0
    sha224_final+0x9e/0x350
    test_hash_buffer_overruns+0x46f/0x5f0
    ? kmsan_get_shadow_origin_ptr+0x46/0xa0
    ? __pfx_test_hash_buffer_overruns+0x10/0x10
    kunit_try_run_case+0x198/0xa00

This occurs when memset() is called on a buffer that is not 4-byte aligned
and extends to the end of a guard page, i.e.  the next page is unmapped.

The bug is that the loop at the end of kmsan_internal_set_shadow_origin()
accesses the wrong shadow memory bytes when the address is not 4-byte
aligned.  Since each 4 bytes are associated with an origin, it rounds the
address and size so that it can access all the origins that contain the
buffer.  However, when it checks the corresponding shadow bytes for a
particular origin, it incorrectly uses the original unrounded shadow
address.  This results in reads from shadow memory beyond the end of the
buffer's shadow memory, which crashes when that memory is not mapped.

To fix this, correctly align the shadow address before accessing the 4
shadow bytes corresponding to each origin.

Link: https://lkml.kernel.org/r/20250911195858.394235-1-ebiggers@kernel.org
Fixes: 2ef3cec ("kmsan: do not wipe out origin when doing partial unpoisoning")
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Tested-by: Alexander Potapenko <glider@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave pushed a commit that referenced this pull request Sep 29, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kdave pushed a commit that referenced this pull request Sep 30, 2025
The generic/736 xfstest fails for HFS case:

BEGIN TEST default (1 test): hfs Mon May 5 03:18:32 UTC 2025
DEVICE: /dev/vdb
HFS_MKFS_OPTIONS:
MOUNT_OPTIONS: MOUNT_OPTIONS
FSTYP -- hfs
PLATFORM -- Linux/x86_64 kvm-xfstests 6.15.0-rc4-xfstests-g00b827f0cffa #1 SMP PREEMPT_DYNAMIC Fri May 25
MKFS_OPTIONS -- /dev/vdc
MOUNT_OPTIONS -- /dev/vdc /vdc

generic/736 [03:18:33][ 3.510255] run fstests generic/736 at 2025-05-05 03:18:33
_check_generic_filesystem: filesystem on /dev/vdb is inconsistent
(see /results/hfs/results-default/generic/736.full for details)
Ran: generic/736
Failures: generic/736
Failed 1 of 1 tests

The HFS volume becomes corrupted after the test run:

sudo fsck.hfs -d /dev/loop50
** /dev/loop50
Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K.
Executing fsck_hfs (version 540.1-Linux).
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
invalid MDB drNxtCNID
Master Directory Block needs minor repair
(1, 0)
Verify Status: VIStat = 0x8000, ABTStat = 0x0000 EBTStat = 0x0000
CBTStat = 0x0000 CatStat = 0x00000000
** Repairing volume.
** Rechecking volume.
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
** The volume untitled was repaired successfully.

The main reason of the issue is the absence of logic that
corrects mdb->drNxtCNID/HFS_SB(sb)->next_id (next unused
CNID) after deleting a record in Catalog File. This patch
introduces a hfs_correct_next_unused_CNID() method that
implements the necessary logic. In the case of Catalog File's
record delete operation, the function logic checks that
(deleted_CNID + 1) == next_unused_CNID and it finds/sets the new
value of next_unused_CNID.

sudo ./check generic/736
FSTYP -- hfs
PLATFORM -- Linux/x86_64 hfsplus-testing-0001 6.15.0+ #6 SMP PREEMPT_DYNAMIC Tue Jun 10 15:02:48 PDT 2025
MKFS_OPTIONS -- /dev/loop51
MOUNT_OPTIONS -- /dev/loop51 /mnt/scratch

generic/736 33s
Ran: generic/736
Passed all 1 tests

sudo fsck.hfs -d /dev/loop50
** /dev/loop50
Using cacheBlockSize=32K cacheTotalBlock=1024 cacheSize=32768K.
Executing fsck_hfs (version 540.1-Linux).
** Checking HFS volume.
The volume name is untitled
** Checking extents overflow file.
** Checking catalog file.
** Checking catalog hierarchy.
** Checking volume bitmap.
** Checking volume information.
** The volume untitled appears to be OK

Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250610231609.551930-1-slava@dubeyko.com
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
kdave pushed a commit that referenced this pull request Sep 30, 2025
BUG: KASAN: slab-out-of-bounds in hfsplus_uni2asc+0xa71/0xb90 fs/hfsplus/unicode.c:186
Read of size 2 at addr ffff8880289ef218 by task syz.6.248/14290

CPU: 0 UID: 0 PID: 14290 Comm: syz.6.248 Not tainted 6.16.4 #1 PREEMPT(full)
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1b0 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:378 [inline]
 print_report+0xca/0x5f0 mm/kasan/report.c:482
 kasan_report+0xca/0x100 mm/kasan/report.c:595
 hfsplus_uni2asc+0xa71/0xb90 fs/hfsplus/unicode.c:186
 hfsplus_listxattr+0x5b6/0xbd0 fs/hfsplus/xattr.c:738
 vfs_listxattr+0xbe/0x140 fs/xattr.c:493
 listxattr+0xee/0x190 fs/xattr.c:924
 filename_listxattr fs/xattr.c:958 [inline]
 path_listxattrat+0x143/0x360 fs/xattr.c:988
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xcb/0x4c0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fe0e9fae16d
Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe0eae67f98 EFLAGS: 00000246 ORIG_RAX: 00000000000000c3
RAX: ffffffffffffffda RBX: 00007fe0ea205fa0 RCX: 00007fe0e9fae16d
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000200000000000
RBP: 00007fe0ea0480f0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fe0ea206038 R14: 00007fe0ea205fa0 R15: 00007fe0eae48000
 </TASK>

Allocated by task 14290:
 kasan_save_stack+0x24/0x50 mm/kasan/common.c:47
 kasan_save_track+0x14/0x30 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:377 [inline]
 __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:394
 kasan_kmalloc include/linux/kasan.h:260 [inline]
 __do_kmalloc_node mm/slub.c:4333 [inline]
 __kmalloc_noprof+0x219/0x540 mm/slub.c:4345
 kmalloc_noprof include/linux/slab.h:909 [inline]
 hfsplus_find_init+0x95/0x1f0 fs/hfsplus/bfind.c:21
 hfsplus_listxattr+0x331/0xbd0 fs/hfsplus/xattr.c:697
 vfs_listxattr+0xbe/0x140 fs/xattr.c:493
 listxattr+0xee/0x190 fs/xattr.c:924
 filename_listxattr fs/xattr.c:958 [inline]
 path_listxattrat+0x143/0x360 fs/xattr.c:988
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xcb/0x4c0 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

When hfsplus_uni2asc is called from hfsplus_listxattr,
it actually passes in a struct hfsplus_attr_unistr*.
The size of the corresponding structure is different from that of hfsplus_unistr,
so the previous fix (9445878) is insufficient.
The pointer on the unicode buffer is still going beyond the allocated memory.

This patch introduces two warpper functions hfsplus_uni2asc_xattr_str and
hfsplus_uni2asc_str to process two unicode buffers,
struct hfsplus_attr_unistr* and struct hfsplus_unistr* respectively.
When ustrlen value is bigger than the allocated memory size,
the ustrlen value is limited to an safe size.

Fixes: 9445878 ("hfsplus: fix slab-out-of-bounds read in hfsplus_uni2asc()")
Signed-off-by: Kang Chen <k.chen@smail.nju.edu.cn>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
Link: https://lore.kernel.org/r/20250909031316.1647094-1-k.chen@smail.nju.edu.cn
Signed-off-by: Viacheslav Dubeyko <slava@dubeyko.com>
kdave pushed a commit that referenced this pull request Sep 30, 2025
Our goal is to move towards enabling vmalloc-huge by default on arm64 so
as to reduce TLB pressure. Therefore, we need a way to analyze the portion
of block mappings in vmalloc space we can get on a production system; this
can be done through ptdump, but currently we disable vmalloc-huge if
CONFIG_PTDUMP_DEBUGFS is on. The reason is that lazy freeing of kernel
pagetables via vmap_try_huge_pxd() may race with ptdump, so ptdump
may dereference a bogus address.

To solve this, we need to synchronize ptdump_walk() and ptdump_check_wx()
with pud_free_pmd_page() and pmd_free_pte_page().

Since this race is very unlikely to happen in practice, we do not want to
penalize the vmalloc pagetable tearing path by taking the init_mm
mmap_lock. Therefore, we use static keys. ptdump_walk() and
ptdump_check_wx() are the pagetable walkers; they will enable the static
key - upon observing that, the vmalloc pagetable tearing path will get
patched in with an mmap_read_lock/unlock sequence. A combination of the
patched-in mmap_read_lock/unlock, the acquire semantics of
static_branch_inc(), and the barriers in __flush_tlb_kernel_pgtable()
ensures that ptdump will never get a hold on the address of a freed PMD
or PTE table.

We can verify the correctness of the algorithm via the following litmus
test (thanks to James Houghton and Will Deacon):

AArch64 ptdump
Variant=Ifetch
{
uint64_t pud=0xa110c;
uint64_t pmd;

0:X0=label:"P1:L0"; 0:X1=instr:"NOP"; 0:X2=lock; 0:X3=pud; 0:X4=pmd;
                    1:X1=0xdead;      1:X2=lock; 1:X3=pud; 1:X4=pmd;
}
 P0				| P1				;
 (* static_key_enable *)	| (* pud_free_pmd_page *)	;
 STR	W1, [X0]		| LDR	X9, [X3]		;
 DC	CVAU,X0			| STR	XZR, [X3]		;
 DSB	ISH			| DSB	ISH			;
 IC	IVAU,X0			| ISB				;
 DSB	ISH			|				;
 ISB				| (* static key *)		;
				| L0:				;
 (* mmap_lock *)		| B	out1			;
 Lwlock:			|				;
 MOV	W7, #1			| (* mmap_lock *)		;
 SWPA	W7, W8, [X2]		| Lrlock:			;
				| MOV	W7, #1			;
				| SWPA	W7, W8, [X2]		;
 (* walk pgtable *)		|				;
 LDR	X9, [X3]		| (* mmap_unlock *)		;
 CBZ	X9, out0		| STLR	WZR, [X2]		;
 EOR	X10, X9, X9		|				;
 LDR	X11, [X4, X10]		| out1:				;
				| EOR	X10, X9, X9		;
 out0:				| STR	X1, [X4, X10]		;

exists (0:X8=0 /\ 1:X8=0 /\	(* Lock acquisitions succeed *)
	0:X9=0xa110c /\		(* P0 sees the valid PUD ...*)
	0:X11=0xdead)		(* ... but the freed PMD *)

For an approximate written proof of why this algorithm works, please read
the code comment in [1], which is now removed for the sake of simplicity.

mm-selftests pass. No issues were observed while parallelly running
test_vmalloc.sh (which stresses the vmalloc subsystem),
and cat /sys/kernel/debug/{kernel_page_tables, check_wx_pages} in a loop.

Link: https://lore.kernel.org/all/20250723161827.15802-1-dev.jain@arm.com/ [1]
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Will Deacon <will@kernel.org>
kdave pushed a commit that referenced this pull request Oct 1, 2025
The ns_bpf_qdisc selftest triggers a kernel panic:

    Unable to handle kernel paging request at virtual address ffffffffa38dbf58
    Current test_progs pgtable: 4K pagesize, 57-bit VAs, pgdp=0x00000001109cc000
    [ffffffffa38dbf58] pgd=000000011fffd801, p4d=000000011fffd401, pud=000000011fffd001, pmd=0000000000000000
    Oops [#1]
    Modules linked in: bpf_testmod(OE) xt_conntrack nls_iso8859_1 [...] [last unloaded: bpf_testmod(OE)]
    CPU: 1 UID: 0 PID: 23584 Comm: test_progs Tainted: G        W  OE       6.17.0-rc1-g2465bb83e0b4 #1 NONE
    Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
    Hardware name: Unknown Unknown Product/Unknown Product, BIOS 2024.01+dfsg-1ubuntu5.1 01/01/2024
    epc : __qdisc_run+0x82/0x6f0
     ra : __qdisc_run+0x6e/0x6f0
    epc : ffffffff80bd5c7a ra : ffffffff80bd5c66 sp : ff2000000eecb550
     gp : ffffffff82472098 tp : ff60000096895940 t0 : ffffffff8001f180
     t1 : ffffffff801e1664 t2 : 0000000000000000 s0 : ff2000000eecb5d0
     s1 : ff60000093a6a600 a0 : ffffffffa38dbee8 a1 : 0000000000000001
     a2 : ff2000000eecb510 a3 : 0000000000000001 a4 : 0000000000000000
     a5 : 0000000000000010 a6 : 0000000000000000 a7 : 0000000000735049
     s2 : ffffffffa38dbee8 s3 : 0000000000000040 s4 : ff6000008bcda000
     s5 : 0000000000000008 s6 : ff60000093a6a680 s7 : ff60000093a6a6f0
     s8 : ff60000093a6a6ac s9 : ff60000093140000 s10: 0000000000000000
     s11: ff2000000eecb9d0 t3 : 0000000000000000 t4 : 0000000000ff0000
     t5 : 0000000000000000 t6 : ff60000093a6a8b6
    status: 0000000200000120 badaddr: ffffffffa38dbf58 cause: 000000000000000d
    [<ffffffff80bd5c7a>] __qdisc_run+0x82/0x6f0
    [<ffffffff80b6fe58>] __dev_queue_xmit+0x4c0/0x1128
    [<ffffffff80b80ae0>] neigh_resolve_output+0xd0/0x170
    [<ffffffff80d2daf6>] ip6_finish_output2+0x226/0x6c8
    [<ffffffff80d31254>] ip6_finish_output+0x10c/0x2a0
    [<ffffffff80d31446>] ip6_output+0x5e/0x178
    [<ffffffff80d2e232>] ip6_xmit+0x29a/0x608
    [<ffffffff80d6f4c6>] inet6_csk_xmit+0xe6/0x140
    [<ffffffff80c985e4>] __tcp_transmit_skb+0x45c/0xaa8
    [<ffffffff80c995fe>] tcp_connect+0x9ce/0xd10
    [<ffffffff80d66524>] tcp_v6_connect+0x4ac/0x5e8
    [<ffffffff80cc19b8>] __inet_stream_connect+0xd8/0x318
    [<ffffffff80cc1c36>] inet_stream_connect+0x3e/0x68
    [<ffffffff80b42b20>] __sys_connect_file+0x50/0x88
    [<ffffffff80b42bee>] __sys_connect+0x96/0xc8
    [<ffffffff80b42c40>] __riscv_sys_connect+0x20/0x30
    [<ffffffff80e5bcae>] do_trap_ecall_u+0x256/0x378
    [<ffffffff80e69af2>] handle_exception+0x14a/0x156
    Code: 892a 0363 1205 489c 8bc1 c7e5 2d03 084a 2703 080a (2783) 0709
    ---[ end trace 0000000000000000 ]---

The bpf_fifo_dequeue prog returns a skb which is a pointer. The pointer
is treated as a 32bit value and sign extend to 64bit in epilogue. This
behavior is right for most bpf prog types but wrong for struct ops which
requires RISC-V ABI.

So let's sign extend struct ops return values according to the function
model and RISC-V ABI([0]).

  [0]: https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf

Fixes: 25ad106 ("riscv, bpf: Adapt bpf trampoline to optimized riscv ftrace framework")
Signed-off-by: Hengqi Chen <hengqi.chen@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Pu Lehui <pulehui@huawei.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/bpf/20250908012448.1695-1-hengqi.chen@gmail.com
kdave pushed a commit that referenced this pull request Oct 1, 2025
…tack-analysis'

Eduard Zingerman says:

====================
bpf: replace path-sensitive with path-insensitive live stack analysis

Consider the following program, assuming checkpoint is created for a
state at instruction (3):

  1: call bpf_get_prandom_u32()
  2: *(u64 *)(r10 - 8) = 42
  -- checkpoint #1 --
  3: if r0 != 0 goto +1
  4: exit;
  5: r0 = *(u64 *)(r10 - 8)
  6: exit

The verifier processes this program by exploring two paths:
 - 1 -> 2 -> 3 -> 4
 - 1 -> 2 -> 3 -> 5 -> 6

When instruction (5) is processed, the current liveness tracking
mechanism moves up the register parent links and records a "read" mark
for stack slot -8 at checkpoint #1, stopping because of the "write"
mark recorded at instruction (2).

This patch set replaces the existing liveness tracking mechanism with
a path-insensitive data flow analysis. The program above is processed
as follows:
 - a data structure representing live stack slots for
   instructions 1-6 in frame #0 is allocated;
 - when instruction (2) is processed, record that slot -8 is written at
   instruction (2) in frame #0;
 - when instruction (5) is processed, record that slot -8 is read at
   instruction (5) in frame #0;
 - when instruction (6) is processed, propagate read mark for slot -8
   up the control flow graph to instructions 3 and 2.

The key difference is that the new mechanism operates on a control
flow graph and associates read and write marks with pairs of (call
chain, instruction index). In contrast, the old mechanism operates on
verifier states and register parent links, associating read and write
marks with verifier states.

Motivation
==========

As it stands, this patch set makes liveness tracking slightly less
precise, as it no longer distinguishes individual program paths taken
by the verifier during symbolic execution.
See the "Impact on verification performance" section for details.

However, this change is intended as a stepping stone toward the
following goals:
 - Short term, integrate precision tracking into liveness analysis and
   remove the following code:
   - verifier backedge states accumulation in is_state_visited();
   - most of the logic for precision tracking;
   - jump history tracking.
 - Long term, help with more efficient loop verification handling.

 Why integrating precision tracking?
 -----------------------------------

In a sense, precision tracking is very similar to liveness tracking.
The data flow equations for liveness tracking look as follows:

  live_after =
    U [state[s].live_before for s in insn_successors(i)]

  state[i].live_before =
    (live_after / state[i].must_write) U state[i].may_read

While data flow equations for precision tracking look as follows:

  precise_after =
    U [state[s].precise_before for s in insn_successors(i)]

  // if some of the instruction outputs are precise,
  // assume its inputs to be precise
  induced_precise =
    ⎧ state[i].may_read   if (state[i].may_write ∩ precise_after) ≠ ∅
    ⎨
    ⎩ ∅                   otherwise

  state[i].precise_before =
    (precise_after / state[i].must_write) ∩ induced_precise

Where:
 - `may_read` set represents a union of all possibly read slots
   (any slot in `may_read` set might be by the instruction);
 - `must_write` set represents an intersection of all possibly written slots
   (any slot in `must_write` set is guaranteed to be written by the instruction).
 - `may_write` set represents a union of all possibly written slots
   (any slot in `may_write` set might be written by the instruction).

This means that precision tracking can be implemented as a logical
extension of liveness tracking:
 - track registers as well as stack slots;
 - add bit masks to represent `precise_before` and `may_write`;
 - add above equations for `precise_before` computation;
 - (linked registers require some additional consideration).

Such extension would allow removal of:
 - precision propagation logic in verifier.c:
   - backtrack_insn()
   - mark_chain_precision()
   - propagate_{precision,backedges}()
 - push_jmp_history() and related data structures, which are only used
   by precision tracking;
 - add_scc_backedge() and related backedge state accumulation in
   is_state_visited(), superseded by per-callchain function state
   accumulated by liveness analysis.

The hope here is that unifying liveness and precision tracking will
reduce overall amount of code and make it easier to reason about.

 How this helps with loops?
 --------------------------

As it stands, this patch set shares the same deficiency as the current
liveness tracking mechanism. Liveness marks on stack slots cannot be
used to prune states when processing iterator-based loops:
 - such states still have branches to be explored;
 - meaning that not all stack slot reads have been discovered.

For example:

  1: while(iter_next()) {
  2:   if (...)
  3:     r0 = *(u64 *)(r10 - 8)
  4:   if (...)
  5:     r0 = *(u64 *)(r10 - 16)
  6:   ...
  7: }

For any checkpoint state created at instruction (1), it is only
possible to rely on read marks for slots fp[-8] and fp[-16] once all
child states of (1) have been explored. Thus, when the verifier
transitions from (7) to (1), it cannot rely on read marks.

However, sacrificing path-sensitivity makes it possible to run
analysis defined in this patch set before main verification pass,
if estimates for value ranges are available.
E.g. for the following program:

  1: while(iter_next()) {
  2:   r0 = r10
  3:   r0 += r2
  4:   r0 = *(u64 *)(r2 + 0)
  5:   ...
  6: }

If an estimate for `r2` range is available before the main
verification pass, it can be used to populate read marks at
instruction (4) and run the liveness analysis. Thus making
conservative liveness information available during loops verification.

Such estimates can be provided by some form of value range analysis.
Value range analysis is also necessary to address loop verification
from another angle: computing boundaries for loop induction variables
and iteration counts.

The hope here is that the new liveness tracking mechanism will support
the broader goal of making loop verification more efficient.

Validation
==========

The change was tested on three program sets:
 - bpf selftests
 - sched_ext
 - Meta's internal set of programs

Commit [#8] enables a special mode where both the current and new
liveness analyses are enabled simultaneously. This mode signals an
error if the new algorithm considers a stack slot dead while the
current algorithm assumes it is alive. This mode was very useful for
debugging. At the time of posting, no such errors have been reported
for the above program sets.

[#8] "bpf: signal error if old liveness is more conservative than new"

Impact on memory consumption
============================

Debug patch [1] extends the kernel and veristat to count the amount of
memory allocated for storing analysis data. This patch is not included
in the submission. The maximal observed impact for the above program
sets is 2.6Mb.

Data below is shown in bytes.

For bpf selftests top 5 consumers look as follows:

  File                     Program           liveness mem
  -----------------------  ----------------  ------------
  pyperf180.bpf.o          on_event               2629740
  pyperf600.bpf.o          on_event               2287662
  pyperf100.bpf.o          on_event               1427022
  test_verif_scale3.bpf.o  balancer_ingress       1121283
  pyperf_subprogs.bpf.o    on_event                756900

For sched_ext top 5 consumers loog as follows:

  File       Program                          liveness mem
  ---------  -------------------------------  ------------
  bpf.bpf.o  lavd_enqueue                           164686
  bpf.bpf.o  lavd_select_cpu                        157393
  bpf.bpf.o  layered_enqueue                        154817
  bpf.bpf.o  lavd_init                              127865
  bpf.bpf.o  layered_dispatch                       110129

For Meta's internal set of programs top consumer is 1Mb.

[1] kernel-patches/bpf@085588e

Impact on verification performance
==================================

Veristat results below are reported using
`-f insns_pct>1 -f !insns<500` filter and -t option
(BPF_F_TEST_STATE_FREQ flag).

 master vs patch-set, selftests (out of ~4K programs)
 ----------------------------------------------------

  File                              Program                                 Insns (A)  Insns (B)  Insns    (DIFF)
  --------------------------------  --------------------------------------  ---------  ---------  ---------------
  cpumask_success.bpf.o             test_global_mask_nested_deep_array_rcu       1622       1655     +33 (+2.03%)
  strobemeta_bpf_loop.bpf.o         on_event                                     2163       2684   +521 (+24.09%)
  test_cls_redirect.bpf.o           cls_redirect                                36001      42515  +6514 (+18.09%)
  test_cls_redirect_dynptr.bpf.o    cls_redirect                                 2299       2339     +40 (+1.74%)
  test_cls_redirect_subprogs.bpf.o  cls_redirect                                69545      78497  +8952 (+12.87%)
  test_l4lb_noinline.bpf.o          balancer_ingress                             2993       3084     +91 (+3.04%)
  test_xdp_noinline.bpf.o           balancer_ingress_v4                          3539       3616     +77 (+2.18%)
  test_xdp_noinline.bpf.o           balancer_ingress_v6                          3608       3685     +77 (+2.13%)

 master vs patch-set, sched_ext (out of 148 programs)
 ----------------------------------------------------

  File       Program           Insns (A)  Insns (B)  Insns    (DIFF)
  ---------  ----------------  ---------  ---------  ---------------
  bpf.bpf.o  chaos_dispatch         2257       2287     +30 (+1.33%)
  bpf.bpf.o  lavd_enqueue          20735      22101   +1366 (+6.59%)
  bpf.bpf.o  lavd_select_cpu       22100      24409  +2309 (+10.45%)
  bpf.bpf.o  layered_dispatch      25051      25606    +555 (+2.22%)
  bpf.bpf.o  p2dq_dispatch           961        990     +29 (+3.02%)
  bpf.bpf.o  rusty_quiescent         526        534      +8 (+1.52%)
  bpf.bpf.o  rusty_runnable          541        547      +6 (+1.11%)

Perf report
===========

In relative terms, the analysis does not consume much CPU time.
For example, here is a perf report collected for pyperf180 selftest:

 # Children      Self  Command   Shared Object         Symbol
 # ........  ........  ........  ....................  ........................................
        ...
      1.22%     1.22%  veristat  [kernel.kallsyms]     [k] bpf_update_live_stack
        ...

Changelog
=========

v1: https://lore.kernel.org/bpf/20250911010437.2779173-1-eddyz87@gmail.com/T/
v1 -> v2:
 - compute_postorder() fixed to handle jumps with offset -1 (syzbot).
 - is_state_visited() in patch #9 fixed access to uninitialized `err`
   (kernel test robot, Dan Carpenter).
 - Selftests added.
 - Fixed bug with write marks propagation from callee to caller,
   see verifier_live_stack.c:caller_stack_write() test case.
 - Added a patch for __not_msg() annotation for test_loader based
   tests.

v2: https://lore.kernel.org/bpf/20250918-callchain-sensitive-liveness-v2-0-214ed2653eee@gmail.com/
v2 -> v3:
 - Added __diag_ignore_all("-Woverride-init", ...) in liveness.c for
   bpf_insn_successors() (suggested by Alexei).

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
====================

Link: https://patch.msgid.link/20250918-callchain-sensitive-liveness-v3-0-c3cd27bacc60@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
kdave pushed a commit that referenced this pull request Oct 1, 2025
scx_bpf_cpu_curr() has been introduced to retrieve the current task of a
given runqueue, allowing schedulers to interact with that task.

The kfunc assumes that it is always called in an RCU context, but this
is not always guaranteed and some BPF schedulers can trigger the
following warning:

  WARNING: suspicious RCU usage
  sched_ext: BPF scheduler "cosmos_1.0.2_gd0e71ca_x86_64_unknown_linux_gnu_debug" enabled
  6.17.0-rc1 #1-NixOS Not tainted
  -----------------------------
  kernel/sched/ext.c:6415 suspicious rcu_dereference_check() usage!
  ...
 Call Trace:
  <IRQ>
  dump_stack_lvl+0x6f/0xb0
  lockdep_rcu_suspicious.cold+0x4e/0x96
  scx_bpf_cpu_curr+0x7e/0x80
  bpf_prog_c68b2b6b6b1b0ff8_sched_timerfn+0xce/0x1dc
  bpf_timer_cb+0x7b/0x130
  __hrtimer_run_queues+0x1ea/0x380
  hrtimer_run_softirq+0x8c/0xd0
  handle_softirqs+0xc9/0x3b0
  __irq_exit_rcu+0x96/0xc0
  irq_exit_rcu+0xe/0x20
  sysvec_apic_timer_interrupt+0x73/0x80
  </IRQ>
  <TASK>

To address this, mark the kfunc with KF_RCU_PROTECTED, so the verifier
can enforce its usage only inside RCU-protected sections.

Note: this also requires commit 1512231 ("bpf: Enforce RCU protection
for KF_RCU_PROTECTED"), currently in bpf-next, to enforce the proper
KF_RCU_PROTECTED.

Fixes: 20b1580 ("sched_ext: Introduce scx_bpf_cpu_curr()")
Cc: Christian Loehle <christian.loehle@arm.com>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrea Righi <arighi@nvidia.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kdave pushed a commit that referenced this pull request Oct 1, 2025
Leon Hwang says:

====================
bpf: Allow union argument in trampoline based programs

While tracing 'release_pages' with bpfsnoop[0], the verifier reports:

The function release_pages arg0 type UNION is unsupported.

However, it should be acceptable to trace functions that have 'union'
arguments.

This patch set enables such support in the verifier by allowing 'union'
as a valid argument type.

Changes:
v3 -> v4:
* Address comments from Alexei:
  * Trim bpftrace output in patch #1 log.
  * Drop the referenced commit info and the test output in patch #2 log.

v2 -> v3:
* Address comments from Alexei:
  * Reuse the existing flag BTF_FMODEL_STRUCT_ARG.
  * Update the comment of the flag BTF_FMODEL_STRUCT_ARG.

v1 -> v2:
* Add 16B 'union' argument support in x86_64 trampoline.
* Update selftests using bpf_testmod.
* Add test case about 16-bytes 'union' argument.
* Address comments from Alexei:
  * Study the patch set about 'struct' argument support.
  * Update selftests to cover more cases.
v1: https://lore.kernel.org/bpf/20250905133226.84675-1-leon.hwang@linux.dev/

Links:
[0] https://github.com/bpfsnoop/bpfsnoop
====================

Link: https://patch.msgid.link/20250919044110.23729-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
kdave pushed a commit that referenced this pull request Oct 1, 2025
…up_fast()"

This reverts commit c8191ee which triggers
the following suspicious RCU usage warning:

[    6.647598] =============================
[    6.647603] WARNING: suspicious RCU usage
[    6.647605] 6.17.0-rc7-virtme #1 Not tainted
[    6.647608] -----------------------------
[    6.647608] ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
[    6.647610]
[    6.647610] other info that might help us debug this:
[    6.647610]
[    6.647612]
[    6.647612] rcu_scheduler_active = 2, debug_locks = 1
[    6.647613] 1 lock held by swapper/10/0:
[    6.647614]  #0: ffff8b14bbb3cc98 (&rq->__lock){-.-.}-{2:2}, at:
+raw_spin_rq_lock_nested+0x20/0x90
[    6.647630]
[    6.647630] stack backtrace:
[    6.647633] CPU: 10 UID: 0 PID: 0 Comm: swapper/10 Not tainted 6.17.0-rc7-virtme #1
+PREEMPT(full)
[    6.647643] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[    6.647646] Sched_ext: beerland_1.0.2_g27d63fc3_x86_64_unknown_linux_gnu (enabled+all)
[    6.647648] Call Trace:
[    6.647652]  <IRQ>
[    6.647655]  dump_stack_lvl+0x78/0xe0
[    6.647665]  lockdep_rcu_suspicious+0x14a/0x1b0
[    6.647672]  __rhashtable_lookup.constprop.0+0x1d5/0x250
[    6.647680]  find_dsq_for_dispatch+0xbc/0x190
[    6.647684]  do_enqueue_task+0x25b/0x550
[    6.647689]  enqueue_task_scx+0x21d/0x360
[    6.647692]  ? trace_lock_acquire+0x22/0xb0
[    6.647695]  enqueue_task+0x2e/0xd0
[    6.647698]  ttwu_do_activate+0xa2/0x290
[    6.647703]  sched_ttwu_pending+0xfd/0x250
[    6.647706]  __flush_smp_call_function_queue+0x1cd/0x610
[    6.647714]  __sysvec_call_function_single+0x34/0x150
[    6.647720]  sysvec_call_function_single+0x6e/0x80
[    6.647726]  </IRQ>
[    6.647726]  <TASK>
[    6.647727]  asm_sysvec_call_function_single+0x1a/0x20

Reported-by: Andrea Righi <arighi@nvidia.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kdave pushed a commit that referenced this pull request Oct 1, 2025
[BUG]
Syzbot reported an ASSERT() triggered inside scrub:

  BTRFS info (device loop0): scrub: started on devid 1
  assertion failed: !folio_test_partial_kmap(folio) :: 0, in fs/btrfs/scrub.c:697
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/scrub.c:697!
  Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 0 UID: 0 PID: 6077 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
  RIP: 0010:scrub_stripe_get_kaddr+0x1bb/0x1c0 fs/btrfs/scrub.c:697
  Call Trace:
   <TASK>
   scrub_bio_add_sector fs/btrfs/scrub.c:932 [inline]
   scrub_submit_initial_read+0xf21/0x1120 fs/btrfs/scrub.c:1897
   submit_initial_group_read+0x423/0x5b0 fs/btrfs/scrub.c:1952
   flush_scrub_stripes+0x18f/0x1150 fs/btrfs/scrub.c:1973
   scrub_stripe+0xbea/0x2a30 fs/btrfs/scrub.c:2516
   scrub_chunk+0x2a3/0x430 fs/btrfs/scrub.c:2575
   scrub_enumerate_chunks+0xa70/0x1350 fs/btrfs/scrub.c:2839
   btrfs_scrub_dev+0x6e7/0x10e0 fs/btrfs/scrub.c:3153
   btrfs_ioctl_scrub+0x249/0x4b0 fs/btrfs/ioctl.c:3163
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:597 [inline]
   __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>
  ---[ end trace 0000000000000000 ]---

Which doesn't make much sense, as all the folios we allocated for scrub
should not be highmem.

[CAUSE]
Thankfully syzbot has a detailed kernel config file, showing that
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is set to y.

And that debug option will force all folio_test_partial_kmap() to return
true, to improve coverage on highmem tests.

But in our case we really just want to make sure the folios we allocated
are not highmem (and they are indeed not). Such incorrect result from
folio_test_partial_kmap() is just screwing up everything.

[FIX]
Replace folio_test_partial_kmap() to folio_test_highmem() so that we
won't bother those stupid highmem specific debug options.

Fixes: 5fbaae4 ("btrfs: prepare scrub to support bs > ps cases")
Reported-by: syzbot+bde59221318c592e6346@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
kdave pushed a commit that referenced this pull request Oct 2, 2025
…ce tree

Currently, when building a free space tree at populate_free_space_tree(),
if we are not using the block group tree feature, we always expect to find
block group items (either extent items or a block group item with key type
BTRFS_BLOCK_GROUP_ITEM_KEY) when we search the extent tree with
btrfs_search_slot_for_read(), so we assert that we found an item. However
this expectation is wrong since we can have a new block group created in
the current transaction which is still empty and for which we still have
not added the block group's item to the extent tree, in which case we do
not have any items in the extent tree associated to the block group.

The insertion of a new block group's block group item in the extent tree
happens at btrfs_create_pending_block_groups() when it calls the helper
insert_block_group_item(). This typically is done when a transaction
handle is released, committed or when running delayed refs (either as
part of a transaction commit or when serving tickets for space reservation
if we are low on free space).

So remove the assertion at populate_free_space_tree() even when the block
group tree feature is not enabled and update the comment to mention this
case.

Syzbot reported this with the following stack trace:

  BTRFS info (device loop3 state M): rebuilding free space tree
  assertion failed: ret == 0 :: 0, in fs/btrfs/free-space-tree.c:1115
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/free-space-tree.c:1115!
  Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 1 UID: 0 PID: 6352 Comm: syz.3.25 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
  RIP: 0010:populate_free_space_tree+0x700/0x710 fs/btrfs/free-space-tree.c:1115
  Code: ff ff e8 d3 (...)
  RSP: 0018:ffffc9000430f780 EFLAGS: 00010246
  RAX: 0000000000000043 RBX: ffff88805b709630 RCX: fea61d0e2e79d000
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffffc9000430f8b0 R08: ffffc9000430f4a7 R09: 1ffff92000861e94
  R10: dffffc0000000000 R11: fffff52000861e95 R12: 0000000000000001
  R13: 1ffff92000861f00 R14: dffffc0000000000 R15: 0000000000000000
  FS:  00007f424d9fe6c0(0000) GS:ffff888125afc000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fd78ad212c0 CR3: 0000000076d68000 CR4: 00000000003526f0
  Call Trace:
   <TASK>
   btrfs_rebuild_free_space_tree+0x1ba/0x6d0 fs/btrfs/free-space-tree.c:1364
   btrfs_start_pre_rw_mount+0x128f/0x1bf0 fs/btrfs/disk-io.c:3062
   btrfs_remount_rw fs/btrfs/super.c:1334 [inline]
   btrfs_reconfigure+0xaed/0x2160 fs/btrfs/super.c:1559
   reconfigure_super+0x227/0x890 fs/super.c:1076
   do_remount fs/namespace.c:3279 [inline]
   path_mount+0xd1a/0xfe0 fs/namespace.c:4027
   do_mount fs/namespace.c:4048 [inline]
   __do_sys_mount fs/namespace.c:4236 [inline]
   __se_sys_mount+0x313/0x410 fs/namespace.c:4213
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f424e39066a
  Code: d8 64 89 02 (...)
  RSP: 002b:00007f424d9fde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 00007f424d9fdef0 RCX: 00007f424e39066a
  RDX: 0000200000000180 RSI: 0000200000000380 RDI: 0000000000000000
  RBP: 0000200000000180 R08: 00007f424d9fdef0 R09: 0000000000000020
  R10: 0000000000000020 R11: 0000000000000246 R12: 0000200000000380
  R13: 00007f424d9fdeb0 R14: 0000000000000000 R15: 00002000000002c0
   </TASK>
  Modules linked in:
  ---[ end trace 0000000000000000 ]---

Reported-by: syzbot+884dc4621377ba579a6f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68dc3dab.a00a0220.102ee.004e.GAE@google.com/
Fixes: a5ed918 ("Btrfs: implement the free space B-tree")
CC: <stable@vger.kernel.org> # 6.1.x: 1961d20: btrfs: fix assertion when building free space tree
CC: <stable@vger.kernel.org> # 6.1.x
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant