Skip to content

Conversation

@pull
Copy link

@pull pull bot commented Jul 2, 2020

See Commits and Changes for more details.


Created by pull[bot]. Want to support this open source service? Please star it : )

Hou Tao and others added 7 commits June 29, 2020 07:45
Else there may be magic numbers in /sys/kernel/debug/block/*/state.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Make blk_ksm_destroy() use the kvfree_sensitive() function (which was
introduced in v5.8-rc1) instead of open-coding it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
So that the target task will exit the wait_event_interruptible-like
loop and call task_work_run() asap.

The patch turns "bool notify" into 0,TWA_RESUME,TWA_SIGNAL enum, the
new TWA_SIGNAL flag implies signal_wake_up().  However, it needs to
avoid the race with recalc_sigpending(), so the patch also adds the
new JOBCTL_TASK_WORK bit included in JOBCTL_PENDING_MASK.

TODO: once this patch is merged we need to change all current users
of task_work_add(notify = true) to use TWA_RESUME.

Cc: stable@vger.kernel.org # v5.7
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since 5.7, we've been using task_work to trigger async running of
requests in the context of the original task. This generally works
great, but there's a case where if the task is currently blocked
in the kernel waiting on a condition to become true, it won't process
task_work. Even though the task is woken, it just checks whatever
condition it's waiting on, and goes back to sleep if it's still false.

This is a problem if that very condition only becomes true when that
task_work is run. An example of that is the task registering an eventfd
with io_uring, and it's now blocked waiting on an eventfd read. That
read could depend on a completion event, and that completion event
won't get trigged until task_work has been run.

Use the TWA_SIGNAL notification for task_work, so that we ensure that
the task always runs the work when queued.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Else there will be memory leak if alloc_disk() fails.

Fixes: 6a27b65 ("block: virtio-blk: support multi virt queues per virtio-blk device")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull io_uring fixes from Jens Axboe:
 "One fix in here, for a regression in 5.7 where a task is waiting in
  the kernel for a condition, but that condition won't become true until
  task_work is run. And the task_work can't be run exactly because the
  task is waiting in the kernel, so we'll never make any progress.

  One example of that is registering an eventfd and queueing io_uring
  work, and then the task goes and waits in eventfd read with the
  expectation that it'll get woken (and read an event) when the io_uring
  request completes. The io_uring request is finished through task_work,
  which won't get run while the task is looping in eventfd read"

* tag 'io_uring-5.8-2020-07-01' of git://git.kernel.dk/linux-block:
  io_uring: use signal based task_work running
  task_work: teach task_work_add() to do signal_wake_up()
Pull block fixes from Jens Axboe:

 - Use kvfree_sensitive() for the block keyslot free (Eric)

 - Sync blk-mq debugfs flags (Hou)

 - Memory leak fix in virtio-blk error path (Hou)

* tag 'block-5.8-2020-07-01' of git://git.kernel.dk/linux-block:
  virtio-blk: free vblk-vqs in error path of virtblk_probe()
  block/keyslot-manager: use kvfree_sensitive()
  blk-mq-debugfs: update blk_queue_flag_name[] accordingly for new flags
@pull pull bot added the ⤵️ pull label Jul 2, 2020
@pull pull bot merged commit 7cc2a8e into vchong:master Jul 2, 2020
pull bot pushed a commit that referenced this pull request Jul 3, 2020
GFP_KERNEL flag specifies a normal kernel allocation in which executing
in process context without any locks and can sleep.
mmio_diff takes sometime to finish all the diff compare and it has
locks, continue using GFP_KERNEL will output below trace if LOCKDEP
enabled.

Use GFP_ATOMIC instead.

V2: Rebase.

=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
5.7.0-rc2 #400 Not tainted
-----------------------------------------------------
is trying to acquire:
ffffffffb47bea20 (fs_reclaim){+.+.}-{0:0}, at: fs_reclaim_acquire.part.0+0x0/0x30

               and this task is already holding:
ffff88845b85cc90 (&gvt->scheduler.mmio_context_lock){+.-.}-{2:2}, at: vgpu_mmio_diff_show+0xcf/0x2e0
which would create a new lock dependency:
 (&gvt->scheduler.mmio_context_lock){+.-.}-{2:2} -> (fs_reclaim){+.+.}-{0:0}

               but this new dependency connects a SOFTIRQ-irq-safe lock:
 (&gvt->scheduler.mmio_context_lock){+.-.}-{2:2}

               ... which became SOFTIRQ-irq-safe at:
  lock_acquire+0x175/0x4e0
  _raw_spin_lock_irqsave+0x2b/0x40
  shadow_context_status_change+0xfe/0x2f0
  notifier_call_chain+0x6a/0xa0
  __atomic_notifier_call_chain+0x5f/0xf0
  execlists_schedule_out+0x42a/0x820
  process_csb+0xe7/0x3e0
  execlists_submission_tasklet+0x5c/0x1d0
  tasklet_action_common.isra.0+0xeb/0x260
  __do_softirq+0x11d/0x56f
  irq_exit+0xf6/0x100
  do_IRQ+0x7f/0x160
  ret_from_intr+0x0/0x2a
  cpuidle_enter_state+0xcd/0x5b0
  cpuidle_enter+0x37/0x60
  do_idle+0x337/0x3f0
  cpu_startup_entry+0x14/0x20
  start_kernel+0x58b/0x5c5
  secondary_startup_64+0xa4/0xb0

               to a SOFTIRQ-irq-unsafe lock:
 (fs_reclaim){+.+.}-{0:0}

               ... which became SOFTIRQ-irq-unsafe at:
...
  lock_acquire+0x175/0x4e0
  fs_reclaim_acquire.part.0+0x20/0x30
  kmem_cache_alloc_node_trace+0x2e/0x290
  alloc_worker+0x2b/0xb0
  init_rescuer.part.0+0x17/0xe0
  workqueue_init+0x293/0x3bb
  kernel_init_freeable+0x149/0x325
  kernel_init+0x8/0x116
  ret_from_fork+0x3a/0x50

               other info that might help us debug this:

 Possible interrupt unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(fs_reclaim);
                               local_irq_disable();
                               lock(&gvt->scheduler.mmio_context_lock);
                               lock(fs_reclaim);
  <Interrupt>
    lock(&gvt->scheduler.mmio_context_lock);

                *** DEADLOCK ***

3 locks held by cat/1439:
 #0: ffff888444a23698 (&p->lock){+.+.}-{3:3}, at: seq_read+0x49/0x680
 #1: ffff88845b858068 (&gvt->lock){+.+.}-{3:3}, at: vgpu_mmio_diff_show+0xc7/0x2e0
 #2: ffff88845b85cc90 (&gvt->scheduler.mmio_context_lock){+.-.}-{2:2}, at: vgpu_mmio_diff_show+0xcf/0x2e0

               the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&gvt->scheduler.mmio_context_lock){+.-.}-{2:2} ops: 31 {
   HARDIRQ-ON-W at:
                    lock_acquire+0x175/0x4e0
                    _raw_spin_lock_bh+0x2f/0x40
                    vgpu_mmio_diff_show+0xcf/0x2e0
                    seq_read+0x242/0x680
                    full_proxy_read+0x95/0xc0
                    vfs_read+0xc2/0x1b0
                    ksys_read+0xc4/0x160
                    do_syscall_64+0x63/0x290
                    entry_SYSCALL_64_after_hwframe+0x49/0xb3
   IN-SOFTIRQ-W at:
                    lock_acquire+0x175/0x4e0
                    _raw_spin_lock_irqsave+0x2b/0x40
                    shadow_context_status_change+0xfe/0x2f0
                    notifier_call_chain+0x6a/0xa0
                    __atomic_notifier_call_chain+0x5f/0xf0
                    execlists_schedule_out+0x42a/0x820
                    process_csb+0xe7/0x3e0
                    execlists_submission_tasklet+0x5c/0x1d0
                    tasklet_action_common.isra.0+0xeb/0x260
                    __do_softirq+0x11d/0x56f
                    irq_exit+0xf6/0x100
                    do_IRQ+0x7f/0x160
                    ret_from_intr+0x0/0x2a
                    cpuidle_enter_state+0xcd/0x5b0
                    cpuidle_enter+0x37/0x60
                    do_idle+0x337/0x3f0
                    cpu_startup_entry+0x14/0x20
                    start_kernel+0x58b/0x5c5
                    secondary_startup_64+0xa4/0xb0
   INITIAL USE at:
                   lock_acquire+0x175/0x4e0
                   _raw_spin_lock_irqsave+0x2b/0x40
                   shadow_context_status_change+0xfe/0x2f0
                   notifier_call_chain+0x6a/0xa0
                   __atomic_notifier_call_chain+0x5f/0xf0
                   execlists_schedule_in+0x2c8/0x690
                   __execlists_submission_tasklet+0x1303/0x1930
                   execlists_submit_request+0x1e7/0x230
                   submit_notify+0x105/0x2a4
                   __i915_sw_fence_complete+0xaa/0x380
                   __engine_park+0x313/0x5a0
                   ____intel_wakeref_put_last+0x3e/0x90
                   intel_gt_resume+0x41e/0x440
                   intel_gt_init+0x283/0xbc0
                   i915_gem_init+0x197/0x240
                   i915_driver_probe+0xc2d/0x12e0
                   i915_pci_probe+0xa2/0x1e0
                   local_pci_probe+0x6f/0xb0
                   pci_device_probe+0x171/0x230
                   really_probe+0x17a/0x380
                   driver_probe_device+0x70/0xf0
                   device_driver_attach+0x82/0x90
                   __driver_attach+0x60/0x100
                   bus_for_each_dev+0xe4/0x140
                   bus_add_driver+0x257/0x2a0
                   driver_register+0xd3/0x150
                   i915_init+0x6d/0x80
                   do_one_initcall+0xb8/0x3a0
                   kernel_init_freeable+0x2b4/0x325
                   kernel_init+0x8/0x116
                   ret_from_fork+0x3a/0x50
 }
__key.77812+0x0/0x40
 ... acquired at:
   lock_acquire+0x175/0x4e0
   fs_reclaim_acquire.part.0+0x20/0x30
   kmem_cache_alloc_trace+0x2e/0x260
   mmio_diff_handler+0xc0/0x150
   intel_gvt_for_each_tracked_mmio+0x7b/0x140
   vgpu_mmio_diff_show+0x111/0x2e0
   seq_read+0x242/0x680
   full_proxy_read+0x95/0xc0
   vfs_read+0xc2/0x1b0
   ksys_read+0xc4/0x160
   do_syscall_64+0x63/0x290
   entry_SYSCALL_64_after_hwframe+0x49/0xb3

               the dependencies between the lock to be acquired
 and SOFTIRQ-irq-unsafe lock:
-> (fs_reclaim){+.+.}-{0:0} ops: 1999031 {
   HARDIRQ-ON-W at:
                    lock_acquire+0x175/0x4e0
                    fs_reclaim_acquire.part.0+0x20/0x30
                    kmem_cache_alloc_node_trace+0x2e/0x290
                    alloc_worker+0x2b/0xb0
                    init_rescuer.part.0+0x17/0xe0
                    workqueue_init+0x293/0x3bb
                    kernel_init_freeable+0x149/0x325
                    kernel_init+0x8/0x116
                    ret_from_fork+0x3a/0x50
   SOFTIRQ-ON-W at:
                    lock_acquire+0x175/0x4e0
                    fs_reclaim_acquire.part.0+0x20/0x30
                    kmem_cache_alloc_node_trace+0x2e/0x290
                    alloc_worker+0x2b/0xb0
                    init_rescuer.part.0+0x17/0xe0
                    workqueue_init+0x293/0x3bb
                    kernel_init_freeable+0x149/0x325
                    kernel_init+0x8/0x116
                    ret_from_fork+0x3a/0x50
   INITIAL USE at:
                   lock_acquire+0x175/0x4e0
                   fs_reclaim_acquire.part.0+0x20/0x30
                   kmem_cache_alloc_node_trace+0x2e/0x290
                   alloc_worker+0x2b/0xb0
                   init_rescuer.part.0+0x17/0xe0
                   workqueue_init+0x293/0x3bb
                   kernel_init_freeable+0x149/0x325
                   kernel_init+0x8/0x116
                   ret_from_fork+0x3a/0x50
 }
__fs_reclaim_map+0x0/0x60
 ... acquired at:
   lock_acquire+0x175/0x4e0
   fs_reclaim_acquire.part.0+0x20/0x30
   kmem_cache_alloc_trace+0x2e/0x260
   mmio_diff_handler+0xc0/0x150
   intel_gvt_for_each_tracked_mmio+0x7b/0x140
   vgpu_mmio_diff_show+0x111/0x2e0
   seq_read+0x242/0x680
   full_proxy_read+0x95/0xc0
   vfs_read+0xc2/0x1b0
   ksys_read+0xc4/0x160
   do_syscall_64+0x63/0x290
   entry_SYSCALL_64_after_hwframe+0x49/0xb3

               stack backtrace:
CPU: 5 PID: 1439 Comm: cat Not tainted 5.7.0-rc2 #400
Hardware name: Intel(R) Client Systems NUC8i7BEH/NUC8BEB, BIOS BECFL357.86A.0056.2018.1128.1717 11/28/2018
Call Trace:
 dump_stack+0x97/0xe0
 check_irq_usage.cold+0x428/0x434
 ? check_usage_forwards+0x2c0/0x2c0
 ? class_equal+0x11/0x20
 ? __bfs+0xd2/0x2d0
 ? in_any_class_list+0xa0/0xa0
 ? check_path+0x22/0x40
 ? check_noncircular+0x150/0x2b0
 ? print_circular_bug.isra.0+0x1b0/0x1b0
 ? mark_lock+0x13d/0xc50
 ? __lock_acquire+0x1e32/0x39b0
 __lock_acquire+0x1e32/0x39b0
 ? timerqueue_add+0xc1/0x130
 ? register_lock_class+0xa60/0xa60
 ? mark_lock+0x13d/0xc50
 lock_acquire+0x175/0x4e0
 ? __zone_pcp_update+0x80/0x80
 ? check_flags.part.0+0x210/0x210
 ? mark_held_locks+0x65/0x90
 ? _raw_spin_unlock_irqrestore+0x32/0x40
 ? lockdep_hardirqs_on+0x190/0x290
 ? fwtable_read32+0x163/0x480
 ? mmio_diff_handler+0xc0/0x150
 fs_reclaim_acquire.part.0+0x20/0x30
 ? __zone_pcp_update+0x80/0x80
 kmem_cache_alloc_trace+0x2e/0x260
 mmio_diff_handler+0xc0/0x150
 ? vgpu_mmio_diff_open+0x30/0x30
 intel_gvt_for_each_tracked_mmio+0x7b/0x140
 vgpu_mmio_diff_show+0x111/0x2e0
 ? mmio_diff_handler+0x150/0x150
 ? rcu_read_lock_sched_held+0xa0/0xb0
 ? rcu_read_lock_bh_held+0xc0/0xc0
 ? kasan_unpoison_shadow+0x33/0x40
 ? __kasan_kmalloc.constprop.0+0xc2/0xd0
 seq_read+0x242/0x680
 ? debugfs_locked_down.isra.0+0x70/0x70
 full_proxy_read+0x95/0xc0
 vfs_read+0xc2/0x1b0
 ksys_read+0xc4/0x160
 ? kernel_write+0xb0/0xb0
 ? mark_held_locks+0x24/0x90
 do_syscall_64+0x63/0x290
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x7ffbe3e6efb2
Code: c0 e9 c2 fe ff ff 50 48 8d 3d ca cb 0a 00 e8 f5 19 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
RSP: 002b:00007ffd021c08a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007ffbe3e6efb2
RDX: 0000000000020000 RSI: 00007ffbe34cd000 RDI: 0000000000000003
RBP: 00007ffbe34cd000 R08: 00007ffbe34cc010 R09: 0000000000000000
R10: 0000000000000022 R11: 0000000000000246 R12: 0000562b6f0a11f0
R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
------------[ cut here ]------------

Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Signed-off-by: Colin Xu <colin.xu@intel.com>
Signed-off-by: Zhenyu Wang <zhenyuw@linux.intel.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20200601035556.19999-1-colin.xu@intel.com
pull bot pushed a commit that referenced this pull request Jul 7, 2020
…kernel/git/kvmarm/kvmarm into kvm-master

KVM/arm fixes for 5.8, take #2

- Make sure a vcpu becoming non-resident doesn't race against the doorbell delivery
- Only advertise pvtime if accounting is enabled
- Return the correct error code if reset fails with SVE
- Make sure that pseudo-NMI functions are annotated as __always_inline
pull bot pushed a commit that referenced this pull request Jul 11, 2020
Jakub Sitnicki says:

====================
This patch set prepares ground for link-based multi-prog attachment for
future netns attach types, with BPF_SK_LOOKUP attach type in mind [0].

Two changes are needed in order to attach and run a series of BPF programs:

  1) an bpf_prog_array of programs to run (patch #2), and
  2) a list of attached links to keep track of attachments (patch #3).

Nothing changes for BPF flow_dissector. Just as before only one program can
be attached to netns.

In v3 I've simplified patch #2 that introduces bpf_prog_array to take
advantage of the fact that it will hold at most one program for now.

In particular, I'm no longer using bpf_prog_array_copy. It turned out to be
less suitable for link operations than I thought as it fails to append the
same BPF program.

bpf_prog_array_replace_item is also gone, because we know we always want to
replace the first element in prog_array.

Naturally the code that handles bpf_prog_array will need change once
more when there is a program type that allows multi-prog attachment. But I
feel it will be better to do it gradually and present it together with
tests that actually exercise multi-prog code paths.

[0] https://lore.kernel.org/bpf/20200511185218.1422406-1-jakub@cloudflare.com/

v2 -> v3:
- Don't check if run_array is null in link update callback. (Martin)
- Allow updating the link with the same BPF program. (Andrii)
- Add patch #4 with a test for the above case.
- Kill bpf_prog_array_replace_item. Access the run_array directly.
- Switch from bpf_prog_array_copy() to bpf_prog_array_alloc(1, ...).
- Replace rcu_deref_protected & RCU_INIT_POINTER with rcu_replace_pointer.
- Drop Andrii's Ack from patch #2. Code changed.

v1 -> v2:

- Show with a (void) cast that bpf_prog_array_replace_item() return value
  is ignored on purpose. (Andrii)
- Explain why bpf-cgroup cannot replace programs in bpf_prog_array based
  on bpf_prog pointer comparison in patch #2 description. (Andrii)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
pull bot pushed a commit that referenced this pull request Jul 11, 2020
Ido Schimmel says:

====================
mlxsw: Various fixes

Fix two issues found by syzkaller.

Patch #1 removes inappropriate usage of WARN_ON() following memory
allocation failure. Constantly triggered when syzkaller injects faults.

Patch #2 fixes a use-after-free that can be triggered by 'devlink dev
info' following a failed devlink reload.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
pull bot pushed a commit that referenced this pull request Jul 12, 2020
In BRM_status_show(), if the condition "!ioc->is_warpdrive" tested on entry
to the function is true, a "goto out" is called. This results in unlocking
ioc->pci_access_mutex without this mutex lock being taken.  This generates
the following splat:

[ 1148.539883] mpt3sas_cm2: BRM_status_show: BRM attribute is only for warpdrive
[ 1148.547184]
[ 1148.548708] =====================================
[ 1148.553501] WARNING: bad unlock balance detected!
[ 1148.558277] 5.8.0-rc3+ #827 Not tainted
[ 1148.562183] -------------------------------------
[ 1148.566959] cat/5008 is trying to release lock (&ioc->pci_access_mutex) at:
[ 1148.574035] [<ffffffffc070b7a3>] BRM_status_show+0xd3/0x100 [mpt3sas]
[ 1148.580574] but there are no more locks to release!
[ 1148.585524]
[ 1148.585524] other info that might help us debug this:
[ 1148.599624] 3 locks held by cat/5008:
[ 1148.607085]  #0: ffff92aea3e392c0 (&p->lock){+.+.}-{3:3}, at: seq_read+0x34/0x480
[ 1148.618509]  #1: ffff922ef14c4888 (&of->mutex){+.+.}-{3:3}, at: kernfs_seq_start+0x2a/0xb0
[ 1148.630729]  #2: ffff92aedb5d7310 (kn->active#224){.+.+}-{0:0}, at: kernfs_seq_start+0x32/0xb0
[ 1148.643347]
[ 1148.643347] stack backtrace:
[ 1148.655259] CPU: 73 PID: 5008 Comm: cat Not tainted 5.8.0-rc3+ #827
[ 1148.665309] Hardware name: HGST H4060-S/S2600STB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[ 1148.678394] Call Trace:
[ 1148.684750]  dump_stack+0x78/0xa0
[ 1148.691802]  lock_release.cold+0x45/0x4a
[ 1148.699451]  __mutex_unlock_slowpath+0x35/0x270
[ 1148.707675]  BRM_status_show+0xd3/0x100 [mpt3sas]
[ 1148.716092]  dev_attr_show+0x19/0x40
[ 1148.723664]  sysfs_kf_seq_show+0x87/0x100
[ 1148.731193]  seq_read+0xbc/0x480
[ 1148.737882]  vfs_read+0xa0/0x160
[ 1148.744514]  ksys_read+0x58/0xd0
[ 1148.751129]  do_syscall_64+0x4c/0xa0
[ 1148.757941]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1148.766240] RIP: 0033:0x7f1230566542
[ 1148.772957] Code: Bad RIP value.
[ 1148.779206] RSP: 002b:00007ffeac1bcac8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 1148.790063] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007f1230566542
[ 1148.800284] RDX: 0000000000020000 RSI: 00007f1223460000 RDI: 0000000000000003
[ 1148.810474] RBP: 00007f1223460000 R08: 00007f122345f010 R09: 0000000000000000
[ 1148.820641] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000000000
[ 1148.830728] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000

Fix this by returning immediately instead of jumping to the out label.

Link: https://lore.kernel.org/r/20200701085254.51740-1-damien.lemoal@wdc.com
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Acked-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
pull bot pushed a commit that referenced this pull request Jul 14, 2020
In pci_disable_sriov(), i.e.,

 # echo 0 > /sys/class/net/enp11s0f1np1/device/sriov_numvfs

iommu_release_device
  iommu_group_remove_device
    arm_smmu_domain_free
      kfree(smmu_domain)

Later,

iommu_release_device
  arm_smmu_release_device
    arm_smmu_detach_dev
      spin_lock_irqsave(&smmu_domain->devices_lock,

would trigger an use-after-free. Fixed it by call
arm_smmu_release_device() first before iommu_group_remove_device().

 BUG: KASAN: use-after-free in __lock_acquire+0x3458/0x4440
  __lock_acquire at kernel/locking/lockdep.c:4250
 Read of size 8 at addr ffff0089df1a6f68 by task bash/3356

 CPU: 5 PID: 3356 Comm: bash Not tainted 5.8.0-rc3-next-20200630 #2
 Hardware name: HPE Apollo 70             /C01_APACHE_MB         , BIOS L50_5.13_1.11 06/18/2019
 Call trace:
  dump_backtrace+0x0/0x398
  show_stack+0x14/0x20
  dump_stack+0x140/0x1b8
  print_address_description.isra.12+0x54/0x4a8
  kasan_report+0x134/0x1b8
  __asan_report_load8_noabort+0x2c/0x50
  __lock_acquire+0x3458/0x4440
  lock_acquire+0x204/0xf10
  _raw_spin_lock_irqsave+0xf8/0x180
  arm_smmu_detach_dev+0xd8/0x4a0
  arm_smmu_detach_dev at drivers/iommu/arm-smmu-v3.c:2776
  arm_smmu_release_device+0xb4/0x1c8
  arm_smmu_disable_pasid at drivers/iommu/arm-smmu-v3.c:2754
  (inlined by) arm_smmu_release_device at drivers/iommu/arm-smmu-v3.c:3000
  iommu_release_device+0xc0/0x178
  iommu_release_device at drivers/iommu/iommu.c:302
  iommu_bus_notifier+0x118/0x160
  notifier_call_chain+0xa4/0x128
  __blocking_notifier_call_chain+0x70/0xa8
  blocking_notifier_call_chain+0x14/0x20
  device_del+0x618/0xa00
  pci_remove_bus_device+0x108/0x2d8
  pci_stop_and_remove_bus_device+0x1c/0x28
  pci_iov_remove_virtfn+0x228/0x368
  sriov_disable+0x8c/0x348
  pci_disable_sriov+0x5c/0x70
  mlx5_core_sriov_configure+0xd8/0x260 [mlx5_core]
  sriov_numvfs_store+0x240/0x318
  dev_attr_store+0x38/0x68
  sysfs_kf_write+0xdc/0x128
  kernfs_fop_write+0x23c/0x448
  __vfs_write+0x54/0xe8
  vfs_write+0x124/0x3f0
  ksys_write+0xe8/0x1b8
  __arm64_sys_write+0x68/0x98
  do_el0_svc+0x124/0x220
  el0_sync_handler+0x260/0x408
  el0_sync+0x140/0x180

 Allocated by task 3356:
  save_stack+0x24/0x50
  __kasan_kmalloc.isra.13+0xc4/0xe0
  kasan_kmalloc+0xc/0x18
  kmem_cache_alloc_trace+0x1ec/0x318
  arm_smmu_domain_alloc+0x54/0x148
  iommu_group_alloc_default_domain+0xc0/0x440
  iommu_probe_device+0x1c0/0x308
  iort_iommu_configure+0x434/0x518
  acpi_dma_configure+0xf0/0x128
  pci_dma_configure+0x114/0x160
  really_probe+0x124/0x6d8
  driver_probe_device+0xc4/0x180
  __device_attach_driver+0x184/0x1e8
  bus_for_each_drv+0x114/0x1a0
  __device_attach+0x19c/0x2a8
  device_attach+0x10/0x18
  pci_bus_add_device+0x70/0xf8
  pci_iov_add_virtfn+0x7b4/0xb40
  sriov_enable+0x5c8/0xc30
  pci_enable_sriov+0x64/0x80
  mlx5_core_sriov_configure+0x58/0x260 [mlx5_core]
  sriov_numvfs_store+0x1c0/0x318
  dev_attr_store+0x38/0x68
  sysfs_kf_write+0xdc/0x128
  kernfs_fop_write+0x23c/0x448
  __vfs_write+0x54/0xe8
  vfs_write+0x124/0x3f0
  ksys_write+0xe8/0x1b8
  __arm64_sys_write+0x68/0x98
  do_el0_svc+0x124/0x220
  el0_sync_handler+0x260/0x408
  el0_sync+0x140/0x180

 Freed by task 3356:
  save_stack+0x24/0x50
  __kasan_slab_free+0x124/0x198
  kasan_slab_free+0x10/0x18
  slab_free_freelist_hook+0x110/0x298
  kfree+0x128/0x668
  arm_smmu_domain_free+0xf4/0x1a0
  iommu_group_release+0xec/0x160
  kobject_put+0xf4/0x238
  kobject_del+0x110/0x190
  kobject_put+0x1e4/0x238
  iommu_group_remove_device+0x394/0x938
  iommu_release_device+0x9c/0x178
  iommu_release_device at drivers/iommu/iommu.c:300
  iommu_bus_notifier+0x118/0x160
  notifier_call_chain+0xa4/0x128
  __blocking_notifier_call_chain+0x70/0xa8
  blocking_notifier_call_chain+0x14/0x20
  device_del+0x618/0xa00
  pci_remove_bus_device+0x108/0x2d8
  pci_stop_and_remove_bus_device+0x1c/0x28
  pci_iov_remove_virtfn+0x228/0x368
  sriov_disable+0x8c/0x348
  pci_disable_sriov+0x5c/0x70
  mlx5_core_sriov_configure+0xd8/0x260 [mlx5_core]
  sriov_numvfs_store+0x240/0x318
  dev_attr_store+0x38/0x68
  sysfs_kf_write+0xdc/0x128
  kernfs_fop_write+0x23c/0x448
  __vfs_write+0x54/0xe8
  vfs_write+0x124/0x3f0
  ksys_write+0xe8/0x1b8
  __arm64_sys_write+0x68/0x98
  do_el0_svc+0x124/0x220
  el0_sync_handler+0x260/0x408
  el0_sync+0x140/0x180

 The buggy address belongs to the object at ffff0089df1a6e00
  which belongs to the cache kmalloc-512 of size 512
 The buggy address is located 360 bytes inside of
  512-byte region [ffff0089df1a6e00, ffff0089df1a7000)
 The buggy address belongs to the page:
 page:ffffffe02257c680 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff0089df1a1400
 flags: 0x7ffff800000200(slab)
 raw: 007ffff800000200 ffffffe02246b8c8 ffffffe02257ff88 ffff000000320680
 raw: ffff0089df1a1400 00000000002a000e 00000001ffffffff ffff0089df1a5001
 page dumped because: kasan: bad access detected
 page->mem_cgroup:ffff0089df1a5001

 Memory state around the buggy address:
  ffff0089df1a6e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff0089df1a6e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 >ffff0089df1a6f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                           ^
  ffff0089df1a6f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff0089df1a7000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: a6a4c7e ("iommu: Add probe_device() and release_device() call-backs")
Signed-off-by: Qian Cai <cai@lca.pw>
Link: https://lore.kernel.org/r/20200704001003.2303-1-cai@lca.pw
Signed-off-by: Joerg Roedel <jroedel@suse.de>
pull bot pushed a commit that referenced this pull request Jul 16, 2020
devm_gpiod_get_index() doesn't return NULL but -ENOENT when the
requested GPIO doesn't exist,  leading to the following messages:

[    2.742468] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.748147] can't set direction for gpio #2: -2
[    2.753081] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.758724] can't set direction for gpio #3: -2
[    2.763666] gpiod_direction_output: invalid GPIO (errorpointer)
[    2.769394] can't set direction for gpio #4: -2
[    2.774341] gpiod_direction_input: invalid GPIO (errorpointer)
[    2.779981] can't set direction for gpio #5: -2
[    2.784545] ff000a20.serial: ttyCPM1 at MMIO 0xfff00a20 (irq = 39, base_baud = 8250000) is a CPM UART

Use devm_gpiod_get_index_optional() instead.

At the same time, handle the error case and properly exit
with an error.

Fixes: 97cbaf2 ("tty: serial: cpm_uart: Convert to use GPIO descriptors")
Cc: stable@vger.kernel.org
Cc: Linus Walleij <linus.walleij@linaro.org>
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Reviewed-by: Linus Walleij <linus.walleij@linaro.org>
Link: https://lore.kernel.org/r/694a25fdce548c5ee8b060ef6a4b02746b8f25c0.1591986307.git.christophe.leroy@csgroup.eu
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pull bot pushed a commit that referenced this pull request Jul 31, 2020
This patch fixes a race condition that causes a use-after-free during
amdgpu_dm_atomic_commit_tail. This can occur when 2 non-blocking commits
are requested and the second one finishes before the first. Essentially,
this bug occurs when the following sequence of events happens:

1. Non-blocking commit #1 is requested w/ a new dm_state #1 and is
deferred to the workqueue.

2. Non-blocking commit #2 is requested w/ a new dm_state #2 and is
deferred to the workqueue.

3. Commit #2 starts before commit #1, dm_state #1 is used in the
commit_tail and commit #2 completes, freeing dm_state #1.

4. Commit #1 starts after commit #2 completes, uses the freed dm_state
1 and dereferences a freelist pointer while setting the context.

Since this bug has only been spotted with fast commits, this patch fixes
the bug by clearing the dm_state instead of using the old dc_state for
fast updates. In addition, since dm_state is only used for its dc_state
and amdgpu_dm_atomic_commit_tail will retain the dc_state if none is found,
removing the dm_state should not have any consequences in fast updates.

This use-after-free bug has existed for a while now, but only caused a
noticeable issue starting from 5.7-rc1 due to 3202fa6 ("slub: relocate
freelist pointer to middle of object") moving the freelist pointer from
dm_state->base (which was unused) to dm_state->context (which is
dereferenced).

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207383
Fixes: bd200d1 ("drm/amd/display: Don't replace the dc_state for fast updates")
Reported-by: Duncan <1i5t5.duncan@cox.net>
Signed-off-by: Mazin Rezk <mnrzk@protonmail.com>
Reviewed-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
pull bot pushed a commit that referenced this pull request Aug 2, 2020
Ido Schimmel says:

====================
mlxsw fixes

This patch set contains various fixes for mlxsw.

Patches #1-#2 fix two trap related issues introduced in previous cycle.

Patches #3-#5 fix rare use-after-frees discovered by syzkaller. After
over a week of fuzzing with the fixes, the bugs did not reproduce.

Patch #6 from Amit fixes an issue in the ethtool selftest that was
recently discovered after running the test on a new platform that
supports only 1Gbps and 10Gbps speeds.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
pull bot pushed a commit that referenced this pull request Aug 2, 2020
I compiled with AddressSanitizer and I had these memory leaks while I
was using the tep_parse_format function:

    Direct leak of 28 byte(s) in 4 object(s) allocated from:
        #0 0x7fb07db49ffe in __interceptor_realloc (/lib/x86_64-linux-gnu/libasan.so.5+0x10dffe)
        #1 0x7fb07a724228 in extend_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:985
        #2 0x7fb07a724c21 in __read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1140
        #3 0x7fb07a724f78 in read_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1206
        #4 0x7fb07a725191 in __read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1291
        #5 0x7fb07a7251df in read_expect_type /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1299
        #6 0x7fb07a72e6c8 in process_dynamic_array_len /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:2849
        #7 0x7fb07a7304b8 in process_function /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3161
        #8 0x7fb07a730900 in process_arg_token /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3207
        #9 0x7fb07a727c0b in process_arg /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:1786
        #10 0x7fb07a731080 in event_read_print_args /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3285
        #11 0x7fb07a731722 in event_read_print /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:3369
        #12 0x7fb07a740054 in __tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6335
        #13 0x7fb07a74047a in __parse_event /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6389
        #14 0x7fb07a740536 in tep_parse_format /home/pduplessis/repo/linux/tools/lib/traceevent/event-parse.c:6431
        #15 0x7fb07a785acf in parse_event ../../../src/fs-src/fs.c:251
        #16 0x7fb07a785ccd in parse_systems ../../../src/fs-src/fs.c:284
        #17 0x7fb07a786fb3 in read_metadata ../../../src/fs-src/fs.c:593
        #18 0x7fb07a78760e in ftrace_fs_source_init ../../../src/fs-src/fs.c:727
        #19 0x7fb07d90c19c in add_component_with_init_method_data ../../../../src/lib/graph/graph.c:1048
        #20 0x7fb07d90c87b in add_source_component_with_initialize_method_data ../../../../src/lib/graph/graph.c:1127
        #21 0x7fb07d90c92a in bt_graph_add_source_component ../../../../src/lib/graph/graph.c:1152
        #22 0x55db11aa632e in cmd_run_ctx_create_components_from_config_components ../../../src/cli/babeltrace2.c:2252
        #23 0x55db11aa6fda in cmd_run_ctx_create_components ../../../src/cli/babeltrace2.c:2347
        #24 0x55db11aa780c in cmd_run ../../../src/cli/babeltrace2.c:2461
        #25 0x55db11aa8a7d in main ../../../src/cli/babeltrace2.c:2673
        #26 0x7fb07d5460b2 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x270b2)

The token variable in the process_dynamic_array_len function is
allocated in the read_expect_type function, but is not freed before
calling the read_token function.

Free the token variable before calling read_token in order to plug the
leak.

Signed-off-by: Philippe Duplessis-Guindon <pduplessis@efficios.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/linux-trace-devel/20200730150236.5392-1-pduplessis@efficios.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
pull bot pushed a commit that referenced this pull request Aug 3, 2020
Dave hit this splat during testing btrfs/078:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc6-default+ #1191 Not tainted
  ------------------------------------------------------
  kswapd0/75 is trying to acquire lock:
  ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]

  but task is already holding lock:
  ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #2 (fs_reclaim){+.+.}-{0:0}:
	 __lock_acquire+0x56f/0xaa0
	 lock_acquire+0xa3/0x440
	 fs_reclaim_acquire.part.0+0x25/0x30
	 __kmalloc_track_caller+0x49/0x330
	 kstrdup+0x2e/0x60
	 __kernfs_new_node.constprop.0+0x44/0x250
	 kernfs_new_node+0x25/0x50
	 kernfs_create_link+0x34/0xa0
	 sysfs_do_create_link_sd+0x5e/0xd0
	 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs]
	 btrfs_init_new_device+0x44c/0x12b0 [btrfs]
	 btrfs_ioctl+0xc3c/0x25c0 [btrfs]
	 ksys_ioctl+0x68/0xa0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x50/0xe0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x56f/0xaa0
	 lock_acquire+0xa3/0x440
	 __mutex_lock+0xa0/0xaf0
	 btrfs_chunk_alloc+0x137/0x3e0 [btrfs]
	 find_free_extent+0xb44/0xfb0 [btrfs]
	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
	 btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
	 __btrfs_cow_block+0x143/0x7a0 [btrfs]
	 btrfs_cow_block+0x15f/0x310 [btrfs]
	 push_leaf_right+0x150/0x240 [btrfs]
	 split_leaf+0x3cd/0x6d0 [btrfs]
	 btrfs_search_slot+0xd14/0xf70 [btrfs]
	 btrfs_insert_empty_items+0x64/0xc0 [btrfs]
	 __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs]
	 btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs]
	 btrfs_work_helper+0x2f9/0x650 [btrfs]
	 process_one_work+0x22c/0x600
	 worker_thread+0x50/0x3b0
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
	 check_prev_add+0x98/0xa20
	 validate_chain+0xa8c/0x2a00
	 __lock_acquire+0x56f/0xaa0
	 lock_acquire+0xa3/0x440
	 __mutex_lock+0xa0/0xaf0
	 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
	 btrfs_evict_inode+0x3bf/0x560 [btrfs]
	 evict+0xd6/0x1c0
	 dispose_list+0x48/0x70
	 prune_icache_sb+0x54/0x80
	 super_cache_scan+0x121/0x1a0
	 do_shrink_slab+0x175/0x420
	 shrink_slab+0xb1/0x2e0
	 shrink_node+0x192/0x600
	 balance_pgdat+0x31f/0x750
	 kswapd+0x206/0x510
	 kthread+0x137/0x150
	 ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Chain exists of:
    &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(fs_reclaim);
				 lock(&fs_info->chunk_mutex);
				 lock(fs_reclaim);
    lock(&delayed_node->mutex);

   *** DEADLOCK ***

  3 locks held by kswapd0/75:
   #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
   #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0
   #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50

  stack backtrace:
  CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
  Call Trace:
   dump_stack+0x78/0xa0
   check_noncircular+0x16f/0x190
   check_prev_add+0x98/0xa20
   validate_chain+0xa8c/0x2a00
   __lock_acquire+0x56f/0xaa0
   lock_acquire+0xa3/0x440
   ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
   __mutex_lock+0xa0/0xaf0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
   ? __lock_acquire+0x56f/0xaa0
   ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
   ? lock_acquire+0xa3/0x440
   ? btrfs_evict_inode+0x138/0x560 [btrfs]
   ? btrfs_evict_inode+0x2fe/0x560 [btrfs]
   ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
   __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs]
   btrfs_evict_inode+0x3bf/0x560 [btrfs]
   evict+0xd6/0x1c0
   dispose_list+0x48/0x70
   prune_icache_sb+0x54/0x80
   super_cache_scan+0x121/0x1a0
   do_shrink_slab+0x175/0x420
   shrink_slab+0xb1/0x2e0
   shrink_node+0x192/0x600
   balance_pgdat+0x31f/0x750
   kswapd+0x206/0x510
   ? _raw_spin_unlock_irqrestore+0x3e/0x50
   ? finish_wait+0x90/0x90
   ? balance_pgdat+0x750/0x750
   kthread+0x137/0x150
   ? kthread_stop+0x2a0/0x2a0
   ret_from_fork+0x1f/0x30

This is because we're holding the chunk_mutex while adding this device
and adding its sysfs entries.  We actually hold different locks in
different places when calling this function, the dev_replace semaphore
for instance in dev replace, so instead of moving this call around
simply wrap it's operations in NOFS.

CC: stable@vger.kernel.org # 4.14+
Reported-by: David Sterba <dsterba@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
pull bot pushed a commit that referenced this pull request Aug 3, 2020
There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.

======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]

but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

 -> #6 (sb_pagefaults){.+.+}-{0:0}:
       __sb_start_write+0x13e/0x220
       btrfs_page_mkwrite+0x59/0x560 [btrfs]
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       asm_exc_page_fault+0x1e/0x30

 -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x60/0x80
       _copy_from_user+0x20/0xb0
       get_sg_io_hdr+0x9a/0xb0
       scsi_cmd_ioctl+0x1ea/0x2f0
       cdrom_ioctl+0x3c/0x12b4
       sr_block_ioctl+0xa4/0xd0
       block_ioctl+0x3f/0x50
       ksys_ioctl+0x82/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #4 (&cd->lock){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       sr_block_open+0xa2/0x180
       __blkdev_get+0xdd/0x550
       blkdev_get+0x38/0x150
       do_dentry_open+0x16b/0x3e0
       path_openat+0x3c9/0xa00
       do_filp_open+0x75/0x100
       do_sys_openat2+0x8a/0x140
       __x64_sys_openat+0x46/0x70
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       __blkdev_get+0x6a/0x550
       blkdev_get+0x85/0x150
       blkdev_get_by_path+0x2c/0x70
       btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
       open_fs_devices+0x88/0x240 [btrfs]
       btrfs_open_devices+0x92/0xa0 [btrfs]
       btrfs_mount_root+0x250/0x490 [btrfs]
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       vfs_kern_mount.part.0+0x71/0xb0
       btrfs_mount+0x119/0x380 [btrfs]
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       do_mount+0x8c6/0xca0
       __x64_sys_mount+0x8e/0xd0
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       btrfs_run_dev_stats+0x36/0x420 [btrfs]
       commit_cowonly_roots+0x91/0x2d0 [btrfs]
       btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
       btrfs_sync_file+0x38a/0x480 [btrfs]
       __x64_sys_fdatasync+0x47/0x80
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
       btrfs_sync_file+0x38a/0x480 [btrfs]
       __x64_sys_fdatasync+0x47/0x80
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       __mutex_lock+0x7b/0x820
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

Possible unsafe locking scenario:

     CPU0                    CPU1
     ----                    ----
 lock(sb_pagefaults);
                             lock(&mm->mmap_lock#2);
                             lock(sb_pagefaults);
 lock(&fs_info->reloc_mutex);

 *** DEADLOCK ***

3 locks held by systemd-journal/509:
 #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
 #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
 #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]

stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
 dump_stack+0x92/0xc8
 check_noncircular+0x134/0x150
 __lock_acquire+0x1241/0x20c0
 lock_acquire+0xb0/0x400
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 ? lock_acquire+0xb0/0x400
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 __mutex_lock+0x7b/0x820
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 ? kvm_sched_clock_read+0x14/0x30
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0xc/0xb0
 btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 start_transaction+0xd2/0x500 [btrfs]
 btrfs_dirty_inode+0x44/0xd0 [btrfs]
 file_update_time+0xc6/0x120
 btrfs_page_mkwrite+0xda/0x560 [btrfs]
 ? sched_clock+0x5/0x10
 do_page_mkwrite+0x4f/0x130
 do_wp_page+0x3b0/0x4f0
 handle_mm_fault+0xf47/0x1850
 do_user_addr_fault+0x1fc/0x4b0
 exc_page_fault+0x88/0x300
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.

Fix this by not holding the ->device_list_mutex at this point.  The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.

However it can also be modified by doing a scan on a device.  But this
action is specifically protected by the uuid_mutex, which we are holding
here.  We cannot race with opening at this point because we have the
->s_mount lock held during the mount.  Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
pull bot pushed a commit that referenced this pull request Aug 3, 2020
We are currently getting this lockdep splat in btrfs/161:

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc5+ #20 Tainted: G            E
  ------------------------------------------------------
  mount/678048 is trying to acquire lock:
  ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs]

  but task is already holding lock:
  ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
	 __mutex_lock+0x8b/0x8f0
	 btrfs_init_new_device+0x2d2/0x1240 [btrfs]
	 btrfs_ioctl+0x1de/0x2d20 [btrfs]
	 ksys_ioctl+0x87/0xc0
	 __x64_sys_ioctl+0x16/0x20
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
	 __lock_acquire+0x1240/0x2460
	 lock_acquire+0xab/0x360
	 __mutex_lock+0x8b/0x8f0
	 clone_fs_devices+0x4d/0x170 [btrfs]
	 btrfs_read_chunk_tree+0x330/0x800 [btrfs]
	 open_ctree+0xb7c/0x18ce [btrfs]
	 btrfs_mount_root.cold+0x13/0xfa [btrfs]
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 fc_mount+0xe/0x40
	 vfs_kern_mount.part.0+0x71/0x90
	 btrfs_mount+0x13b/0x3e0 [btrfs]
	 legacy_get_tree+0x30/0x50
	 vfs_get_tree+0x28/0xc0
	 do_mount+0x7de/0xb30
	 __x64_sys_mount+0x8e/0xd0
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&fs_info->chunk_mutex);
				 lock(&fs_devs->device_list_mutex);
				 lock(&fs_info->chunk_mutex);
    lock(&fs_devs->device_list_mutex);

   *** DEADLOCK ***

  3 locks held by mount/678048:
   #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380
   #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs]
   #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs]

  stack backtrace:
  CPU: 2 PID: 678048 Comm: mount Tainted: G            E     5.8.0-rc5+ #20
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
  Call Trace:
   dump_stack+0x96/0xd0
   check_noncircular+0x162/0x180
   __lock_acquire+0x1240/0x2460
   ? asm_sysvec_apic_timer_interrupt+0x12/0x20
   lock_acquire+0xab/0x360
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   __mutex_lock+0x8b/0x8f0
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   ? rcu_read_lock_sched_held+0x52/0x60
   ? cpumask_next+0x16/0x20
   ? module_assert_mutex_or_preempt+0x14/0x40
   ? __module_address+0x28/0xf0
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   ? static_obj+0x4f/0x60
   ? lockdep_init_map_waits+0x43/0x200
   ? clone_fs_devices+0x4d/0x170 [btrfs]
   clone_fs_devices+0x4d/0x170 [btrfs]
   btrfs_read_chunk_tree+0x330/0x800 [btrfs]
   open_ctree+0xb7c/0x18ce [btrfs]
   ? super_setup_bdi_name+0x79/0xd0
   btrfs_mount_root.cold+0x13/0xfa [btrfs]
   ? vfs_parse_fs_string+0x84/0xb0
   ? rcu_read_lock_sched_held+0x52/0x60
   ? kfree+0x2b5/0x310
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   fc_mount+0xe/0x40
   vfs_kern_mount.part.0+0x71/0x90
   btrfs_mount+0x13b/0x3e0 [btrfs]
   ? cred_has_capability+0x7c/0x120
   ? rcu_read_lock_sched_held+0x52/0x60
   ? legacy_get_tree+0x30/0x50
   legacy_get_tree+0x30/0x50
   vfs_get_tree+0x28/0xc0
   do_mount+0x7de/0xb30
   ? memdup_user+0x4e/0x90
   __x64_sys_mount+0x8e/0xd0
   do_syscall_64+0x52/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and
then read the device, which takes the device_list_mutex.  The
device_list_mutex needs to be taken before the chunk_mutex, so this is a
problem.  We only really need the chunk mutex around adding the chunk,
so move the mutex around read_one_chunk.

An argument could be made that we don't even need the chunk_mutex here
as it's during mount, and we are protected by various other locks.
However we already have special rules for ->device_list_mutex, and I'd
rather not have another special case for ->chunk_mutex.

CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
pull bot pushed a commit that referenced this pull request Aug 3, 2020
When running with -o enospc_debug you can get the following splat if one
of the dump_space_info's trip

  ======================================================
  WARNING: possible circular locking dependency detected
  5.8.0-rc5+ #20 Tainted: G           OE
  ------------------------------------------------------
  dd/563090 is trying to acquire lock:
  ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs]

  but task is already holding lock:
  ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&cache->lock){+.+.}-{2:2}:
	 _raw_spin_lock+0x25/0x30
	 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs]
	 find_free_extent+0x7ef/0x13b0 [btrfs]
	 btrfs_reserve_extent+0x9b/0x180 [btrfs]
	 btrfs_alloc_tree_block+0xc1/0x340 [btrfs]
	 alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
	 __btrfs_cow_block+0x122/0x530 [btrfs]
	 btrfs_cow_block+0x106/0x210 [btrfs]
	 commit_cowonly_roots+0x55/0x300 [btrfs]
	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
	 sync_filesystem+0x74/0x90
	 generic_shutdown_super+0x22/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20 [btrfs]
	 deactivate_locked_super+0x36/0x70
	 cleanup_mnt+0x104/0x160
	 task_work_run+0x5f/0x90
	 __prepare_exit_to_usermode+0x1bd/0x1c0
	 do_syscall_64+0x5e/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #2 (&space_info->lock){+.+.}-{2:2}:
	 _raw_spin_lock+0x25/0x30
	 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs]
	 btrfs_inode_rsv_release+0x4f/0x170 [btrfs]
	 btrfs_clear_delalloc_extent+0x155/0x480 [btrfs]
	 clear_state_bit+0x81/0x1a0 [btrfs]
	 __clear_extent_bit+0x25c/0x5d0 [btrfs]
	 clear_extent_bit+0x15/0x20 [btrfs]
	 btrfs_invalidatepage+0x2b7/0x3c0 [btrfs]
	 truncate_cleanup_page+0x47/0xe0
	 truncate_inode_pages_range+0x238/0x840
	 truncate_pagecache+0x44/0x60
	 btrfs_setattr+0x202/0x5e0 [btrfs]
	 notify_change+0x33b/0x490
	 do_truncate+0x76/0xd0
	 path_openat+0x687/0xa10
	 do_filp_open+0x91/0x100
	 do_sys_openat2+0x215/0x2d0
	 do_sys_open+0x44/0x80
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #1 (&tree->lock#2){+.+.}-{2:2}:
	 _raw_spin_lock+0x25/0x30
	 find_first_extent_bit+0x32/0x150 [btrfs]
	 write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs]
	 __btrfs_write_out_cache+0x172/0x480 [btrfs]
	 btrfs_write_out_cache+0x7a/0xf0 [btrfs]
	 btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs]
	 commit_cowonly_roots+0x245/0x300 [btrfs]
	 btrfs_commit_transaction+0x4ed/0xac0 [btrfs]
	 close_ctree+0xf9/0x2f5 [btrfs]
	 generic_shutdown_super+0x6c/0x100
	 kill_anon_super+0x14/0x30
	 btrfs_kill_super+0x12/0x20 [btrfs]
	 deactivate_locked_super+0x36/0x70
	 cleanup_mnt+0x104/0x160
	 task_work_run+0x5f/0x90
	 __prepare_exit_to_usermode+0x1bd/0x1c0
	 do_syscall_64+0x5e/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  -> #0 (&ctl->tree_lock){+.+.}-{2:2}:
	 __lock_acquire+0x1240/0x2460
	 lock_acquire+0xab/0x360
	 _raw_spin_lock+0x25/0x30
	 btrfs_dump_free_space+0x2b/0xa0 [btrfs]
	 btrfs_dump_space_info+0xf4/0x120 [btrfs]
	 btrfs_reserve_extent+0x176/0x180 [btrfs]
	 __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
	 cache_save_setup+0x28d/0x3b0 [btrfs]
	 btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
	 btrfs_commit_transaction+0xcc/0xac0 [btrfs]
	 btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
	 btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
	 btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
	 btrfs_file_write_iter+0x3cf/0x610 [btrfs]
	 new_sync_write+0x11e/0x1b0
	 vfs_write+0x1c9/0x200
	 ksys_write+0x68/0xe0
	 do_syscall_64+0x52/0xb0
	 entry_SYSCALL_64_after_hwframe+0x44/0xa9

  other info that might help us debug this:

  Chain exists of:
    &ctl->tree_lock --> &space_info->lock --> &cache->lock

   Possible unsafe locking scenario:

	 CPU0                    CPU1
	 ----                    ----
    lock(&cache->lock);
				 lock(&space_info->lock);
				 lock(&cache->lock);
    lock(&ctl->tree_lock);

   *** DEADLOCK ***

  6 locks held by dd/563090:
   #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200
   #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs]
   #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs]
   #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs]
   #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs]
   #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs]

  stack backtrace:
  CPU: 3 PID: 563090 Comm: dd Tainted: G           OE     5.8.0-rc5+ #20
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011
  Call Trace:
   dump_stack+0x96/0xd0
   check_noncircular+0x162/0x180
   __lock_acquire+0x1240/0x2460
   ? wake_up_klogd.part.0+0x30/0x40
   lock_acquire+0xab/0x360
   ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
   _raw_spin_lock+0x25/0x30
   ? btrfs_dump_free_space+0x2b/0xa0 [btrfs]
   btrfs_dump_free_space+0x2b/0xa0 [btrfs]
   btrfs_dump_space_info+0xf4/0x120 [btrfs]
   btrfs_reserve_extent+0x176/0x180 [btrfs]
   __btrfs_prealloc_file_range+0x145/0x550 [btrfs]
   ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs]
   cache_save_setup+0x28d/0x3b0 [btrfs]
   btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs]
   btrfs_commit_transaction+0xcc/0xac0 [btrfs]
   ? start_transaction+0xe0/0x5b0 [btrfs]
   btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs]
   btrfs_check_data_free_space+0x4c/0xa0 [btrfs]
   btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs]
   ? ktime_get_coarse_real_ts64+0xa8/0xd0
   ? trace_hardirqs_on+0x1c/0xe0
   btrfs_file_write_iter+0x3cf/0x610 [btrfs]
   new_sync_write+0x11e/0x1b0
   vfs_write+0x1c9/0x200
   ksys_write+0x68/0xe0
   do_syscall_64+0x52/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is because we're holding the block_group->lock while trying to dump
the free space cache.  However we don't need this lock, we just need it
to read the values for the printk, so move the free space cache dumping
outside of the block group lock.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
pull bot pushed a commit that referenced this pull request Aug 5, 2020
Like syscall entry all architectures have similar and pointlessly different
code to handle pending work before returning from a syscall to user space.

  1) One-time syscall exit work:
      - rseq syscall exit
      - audit
      - syscall tracing
      - tracehook (single stepping)

  2) Preparatory work
      - Exit to user mode loop (common TIF handling).
      - Architecture specific one time work arch_exit_to_user_mode_prepare()
      - Address limit and lockdep checks
     
  3) Final transition (lockdep, tracing, context tracking, RCU). Invokes
     arch_exit_to_user_mode() to handle e.g. speculation mitigations

Provide a generic version based on the x86 code which has all the RCU and
instrumentation protections right.

Provide a variant for interrupt return to user mode as well which shares
the above #2 and #3 work items.

After syscall_exit_to_user_mode() and irqentry_exit_to_user_mode() the
architecture code just has to return to user space. The code after
returning from these functions must not be instrumented.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200722220519.613977173@linutronix.de
pull bot pushed a commit that referenced this pull request Aug 5, 2020
The following deadlock was captured. The first process is holding 'kernfs_mutex'
and hung by io. The io was staging in 'r1conf.pending_bio_list' of raid1 device,
this pending bio list would be flushed by second process 'md127_raid1', but
it was hung by 'kernfs_mutex'. Using sysfs_notify_dirent_safe() to replace
sysfs_notify() can fix it. There were other sysfs_notify() invoked from io
path, removed all of them.

 PID: 40430  TASK: ffff8ee9c8c65c40  CPU: 29  COMMAND: "probe_file"
  #0 [ffffb87c4df37260] __schedule at ffffffff9a8678ec
  #1 [ffffb87c4df372f8] schedule at ffffffff9a867f06
  #2 [ffffb87c4df37310] io_schedule at ffffffff9a0c73e6
  #3 [ffffb87c4df37328] __dta___xfs_iunpin_wait_3443 at ffffffffc03a4057 [xfs]
  #4 [ffffb87c4df373a0] xfs_iunpin_wait at ffffffffc03a6c79 [xfs]
  #5 [ffffb87c4df373b0] __dta_xfs_reclaim_inode_3357 at ffffffffc039a46c [xfs]
  #6 [ffffb87c4df37400] xfs_reclaim_inodes_ag at ffffffffc039a8b6 [xfs]
  #7 [ffffb87c4df37590] xfs_reclaim_inodes_nr at ffffffffc039bb33 [xfs]
  #8 [ffffb87c4df375b0] xfs_fs_free_cached_objects at ffffffffc03af0e9 [xfs]
  #9 [ffffb87c4df375c0] super_cache_scan at ffffffff9a287ec7
 #10 [ffffb87c4df37618] shrink_slab at ffffffff9a1efd93
 #11 [ffffb87c4df37700] shrink_node at ffffffff9a1f5968
 #12 [ffffb87c4df37788] do_try_to_free_pages at ffffffff9a1f5ea2
 #13 [ffffb87c4df377f0] try_to_free_mem_cgroup_pages at ffffffff9a1f6445
 #14 [ffffb87c4df37880] try_charge at ffffffff9a26cc5f
 #15 [ffffb87c4df37920] memcg_kmem_charge_memcg at ffffffff9a270f6a
 #16 [ffffb87c4df37958] new_slab at ffffffff9a251430
 #17 [ffffb87c4df379c0] ___slab_alloc at ffffffff9a251c85
 #18 [ffffb87c4df37a80] __slab_alloc at ffffffff9a25635d
 #19 [ffffb87c4df37ac0] kmem_cache_alloc at ffffffff9a251f89
 #20 [ffffb87c4df37b00] alloc_inode at ffffffff9a2a2b10
 #21 [ffffb87c4df37b20] iget_locked at ffffffff9a2a4854
 #22 [ffffb87c4df37b60] kernfs_get_inode at ffffffff9a311377
 #23 [ffffb87c4df37b80] kernfs_iop_lookup at ffffffff9a311e2b
 #24 [ffffb87c4df37ba8] lookup_slow at ffffffff9a290118
 #25 [ffffb87c4df37c10] walk_component at ffffffff9a291e83
 #26 [ffffb87c4df37c78] path_lookupat at ffffffff9a293619
 #27 [ffffb87c4df37cd8] filename_lookup at ffffffff9a2953af
 #28 [ffffb87c4df37de8] user_path_at_empty at ffffffff9a295566
 #29 [ffffb87c4df37e10] vfs_statx at ffffffff9a289787
 #30 [ffffb87c4df37e70] SYSC_newlstat at ffffffff9a289d5d
 #31 [ffffb87c4df37f18] sys_newlstat at ffffffff9a28a60e
 #32 [ffffb87c4df37f28] do_syscall_64 at ffffffff9a003949
 #33 [ffffb87c4df37f50] entry_SYSCALL_64_after_hwframe at ffffffff9aa001ad
     RIP: 00007f617a5f2905  RSP: 00007f607334f838  RFLAGS: 00000246
     RAX: ffffffffffffffda  RBX: 00007f6064044b20  RCX: 00007f617a5f2905
     RDX: 00007f6064044b20  RSI: 00007f6064044b20  RDI: 00007f6064005890
     RBP: 00007f6064044aa0   R8: 0000000000000030   R9: 000000000000011c
     R10: 0000000000000013  R11: 0000000000000246  R12: 00007f606417e6d0
     R13: 00007f6064044aa0  R14: 00007f6064044b10  R15: 00000000ffffffff
     ORIG_RAX: 0000000000000006  CS: 0033  SS: 002b

 PID: 927    TASK: ffff8f15ac5dbd80  CPU: 42  COMMAND: "md127_raid1"
  #0 [ffffb87c4df07b28] __schedule at ffffffff9a8678ec
  #1 [ffffb87c4df07bc0] schedule at ffffffff9a867f06
  #2 [ffffb87c4df07bd8] schedule_preempt_disabled at ffffffff9a86825e
  #3 [ffffb87c4df07be8] __mutex_lock at ffffffff9a869bcc
  #4 [ffffb87c4df07ca0] __mutex_lock_slowpath at ffffffff9a86a013
  #5 [ffffb87c4df07cb0] mutex_lock at ffffffff9a86a04f
  #6 [ffffb87c4df07cc8] kernfs_find_and_get_ns at ffffffff9a311d83
  #7 [ffffb87c4df07cf0] sysfs_notify at ffffffff9a314b3a
  #8 [ffffb87c4df07d18] md_update_sb at ffffffff9a688696
  #9 [ffffb87c4df07d98] md_update_sb at ffffffff9a6886d5
 #10 [ffffb87c4df07da8] md_check_recovery at ffffffff9a68ad9c
 #11 [ffffb87c4df07dd0] raid1d at ffffffffc01f0375 [raid1]
 #12 [ffffb87c4df07ea0] md_thread at ffffffff9a680348
 #13 [ffffb87c4df07f08] kthread at ffffffff9a0b8005
 #14 [ffffb87c4df07f50] ret_from_fork at ffffffff9aa00344

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
pull bot pushed a commit that referenced this pull request Aug 7, 2020
The SADC component in JZ47xx SoCs provides support for touchscreen
operations (pen position and pen down pressure) in single-ended and
differential modes.

The touchscreen component of SADC takes a significant time to stabilize
after first receiving the clock and a delay of 50ms has been empirically
proven to be a safe value before data sampling can begin.

Of the known hardware to use this controller, GCW Zero and Anbernic RG-350
utilize the touchscreen mode by having their joystick(s) attached to the
X/Y positive/negative input pins.

JZ4770 and later SoCs introduce a low-level command feature. With it, up
to 32 commands can be programmed, each one corresponding to a sampling
job. It allows to change the low-voltage reference, the high-voltage
reference, have them connected to VCC, GND, or one of the X-/X+ or Y-/Y+
pins.

This patch introduces support for 6 stream-capable channels:
- channel #0 samples X+/GND
- channel #1 samples Y+/GND
- channel #2 samples X-/GND
- channel #3 samples Y-/GND
- channel #4 samples X+/X-
- channel #5 samples Y+/Y-

Being able to sample X-/GND and Y-/GND is useful on some devices, where
one joystick is connected to the X+/Y+ pins, and a second joystick is
connected to the X-/Y- pins.

All the boards which probe this driver have the interrupt provided from
Device Tree, with no need to handle a case where the IRQ was not provided.

Co-developed-by: Paul Cercueil <paul@crapouillou.net>
Signed-off-by: Paul Cercueil <paul@crapouillou.net>
Signed-off-by: Artur Rojek <contact@artur-rojek.eu>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
pull bot pushed a commit that referenced this pull request Aug 11, 2020
https://bugzilla.kernel.org/show_bug.cgi?id=208565

PID: 257    TASK: ecdd0000  CPU: 0   COMMAND: "init"
  #0 [<c0b420ec>] (__schedule) from [<c0b423c8>]
  #1 [<c0b423c8>] (schedule) from [<c0b459d4>]
  #2 [<c0b459d4>] (rwsem_down_read_failed) from [<c0b44fa0>]
  #3 [<c0b44fa0>] (down_read) from [<c044233c>]
  #4 [<c044233c>] (f2fs_truncate_blocks) from [<c0442890>]
  #5 [<c0442890>] (f2fs_truncate) from [<c044d408>]
  #6 [<c044d408>] (f2fs_evict_inode) from [<c030be18>]
  #7 [<c030be18>] (evict) from [<c030a558>]
  #8 [<c030a558>] (iput) from [<c047c600>]
  #9 [<c047c600>] (f2fs_sync_node_pages) from [<c0465414>]
 #10 [<c0465414>] (f2fs_write_checkpoint) from [<c04575f4>]
 #11 [<c04575f4>] (f2fs_sync_fs) from [<c0441918>]
 #12 [<c0441918>] (f2fs_do_sync_file) from [<c0441098>]
 #13 [<c0441098>] (f2fs_sync_file) from [<c0323fa0>]
 #14 [<c0323fa0>] (vfs_fsync_range) from [<c0324294>]
 #15 [<c0324294>] (do_fsync) from [<c0324014>]
 #16 [<c0324014>] (sys_fsync) from [<c0108bc0>]

This can be caused by flush_dirty_inode() in f2fs_sync_node_pages() where
iput() requires f2fs_lock_op() again resulting in livelock.

Reported-by: Zhiguo Niu <Zhiguo.Niu@unisoc.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
pull bot pushed a commit that referenced this pull request Aug 11, 2020
… set

We received an error report that perf-record caused 'Segmentation fault'
on a newly system (e.g. on the new installed ubuntu).

  (gdb) backtrace
  #0  __read_once_size (size=4, res=<synthetic pointer>, p=0x14) at /root/0-jinyao/acme/tools/include/linux/compiler.h:139
  #1  atomic_read (v=0x14) at /root/0-jinyao/acme/tools/include/asm/../../arch/x86/include/asm/atomic.h:28
  #2  refcount_read (r=0x14) at /root/0-jinyao/acme/tools/include/linux/refcount.h:65
  #3  perf_mmap__read_init (map=map@entry=0x0) at mmap.c:177
  #4  0x0000561ce5c0de39 in perf_evlist__poll_thread (arg=0x561ce68584d0) at util/sideband_evlist.c:62
  #5  0x00007fad78491609 in start_thread (arg=<optimized out>) at pthread_create.c:477
  #6  0x00007fad7823c103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The root cause is, evlist__add_bpf_sb_event() just returns 0 if
HAVE_LIBBPF_SUPPORT is not defined (inline function path). So it will
not create a valid evsel for side-band event.

But perf-record still creates BPF side band thread to process the
side-band event, then the error happpens.

We can reproduce this issue by removing the libelf-dev. e.g.
1. apt-get remove libelf-dev
2. perf record -a -- sleep 1

  root@test:~# ./perf record -a -- sleep 1
  perf: Segmentation fault
  Obtained 6 stack frames.
  ./perf(+0x28eee8) [0x5562d6ef6ee8]
  /lib/x86_64-linux-gnu/libc.so.6(+0x46210) [0x7fbfdc65f210]
  ./perf(+0x342e74) [0x5562d6faae74]
  ./perf(+0x257e39) [0x5562d6ebfe39]
  /lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fbfdc990609]
  /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fbfdc73b103]
  Segmentation fault (core dumped)

To fix this issue,

1. We either install the missing libraries to let HAVE_LIBBPF_SUPPORT
   be defined.
   e.g. apt-get install libelf-dev and install other related libraries.

2. Use this patch to skip the side-band event setup if HAVE_LIBBPF_SUPPORT
   is not set.

Committer notes:

The side band thread is not used just with BPF, it is also used with
--switch-output-event, so narrow the ifdef to the BPF specific part.

Fixes: 23cbb41 ("perf record: Move side band evlist setup to separate routine")
Signed-off-by: Jin Yao <yao.jin@linux.intel.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jin Yao <yao.jin@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lore.kernel.org/lkml/20200805022937.29184-1-yao.jin@linux.intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
pull bot pushed a commit that referenced this pull request Aug 14, 2020
Yonghong Song says:

====================
Andrii raised a concern that current uapi for bpf iterator map
element is a little restrictive and not suitable for future potential
complex customization. This is a valid suggestion, considering people
may indeed add more complex custimization to the iterator, e.g.,
cgroup_id + user_id, etc. for task or task_file. Another example might
be map_id plus additional control so that the bpf iterator may bail
out a bucket earlier if a bucket has too many elements which may hold
lock too long and impact other parts of systems.

Patch #1 modified uapi with kernel changes. Patch #2
adjusted libbpf api accordingly.

Changelogs:
  v3 -> v4:
    . add a forward declaration of bpf_iter_link_info in
      tools/lib/bpf/bpf.h in case that libbpf is built against
      not-latest uapi bpf.h.
    . target the patch set to "bpf" instead of "bpf-next"
  v2 -> v3:
    . undo "not reject iter_info.map.map_fd == 0" from v1.
      In the future map_fd may become optional, so let us use map_fd == 0
      indicating the map_fd is not set by user space.
    . add link_info_len to bpf_iter_attach_opts to ensure always correct
      link_info_len from user. Otherwise, libbpf may deduce incorrect
      link_info_len if it uses different uapi header than the user app.
  v1 -> v2:
    . ensure link_create target_fd/flags == 0 since they are not used. (Andrii)
    . if either of iter_info ptr == 0 or iter_info_len == 0, but not both,
      return error to user space. (Andrii)
    . do not reject iter_info.map.map_fd == 0, go ahead to use it trying to
      get a map reference since the map_fd is required for map_elem iterator.
    . use bpf_iter_link_info in bpf_iter_attach_opts instead of map_fd.
      this way, user space is responsible to set up bpf_iter_link_info and
      libbpf just passes the data to the kernel, simplifying libbpf design.
      (Andrii)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
pull bot pushed a commit that referenced this pull request Aug 15, 2020
Freeing chip on error may lead to an Oops at the next time
the system goes to resume. Fix this by removing all
snd_echo_free() calls on error.

Fixes: 47b5d02 ("ALSA: Echoaudio - Add suspend support #2")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Link: https://lore.kernel.org/r/20200813074632.17022-1-dinghao.liu@zju.edu.cn
Signed-off-by: Takashi Iwai <tiwai@suse.de>
pull bot pushed a commit that referenced this pull request Oct 1, 2025
Leon Hwang says:

====================
bpf: Allow union argument in trampoline based programs

While tracing 'release_pages' with bpfsnoop[0], the verifier reports:

The function release_pages arg0 type UNION is unsupported.

However, it should be acceptable to trace functions that have 'union'
arguments.

This patch set enables such support in the verifier by allowing 'union'
as a valid argument type.

Changes:
v3 -> v4:
* Address comments from Alexei:
  * Trim bpftrace output in patch #1 log.
  * Drop the referenced commit info and the test output in patch #2 log.

v2 -> v3:
* Address comments from Alexei:
  * Reuse the existing flag BTF_FMODEL_STRUCT_ARG.
  * Update the comment of the flag BTF_FMODEL_STRUCT_ARG.

v1 -> v2:
* Add 16B 'union' argument support in x86_64 trampoline.
* Update selftests using bpf_testmod.
* Add test case about 16-bytes 'union' argument.
* Address comments from Alexei:
  * Study the patch set about 'struct' argument support.
  * Update selftests to cover more cases.
v1: https://lore.kernel.org/bpf/20250905133226.84675-1-leon.hwang@linux.dev/

Links:
[0] https://github.com/bpfsnoop/bpfsnoop
====================

Link: https://patch.msgid.link/20250919044110.23729-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
pull bot pushed a commit that referenced this pull request Oct 3, 2025
generic/091 may fail, then it bisects to the bad commit ba8dac3
("f2fs: fix to zero post-eof page").

What will cause generic/091 to fail is something like below Testcase #1:
1. write 16k as compressed blocks
2. truncate to 12k
3. truncate to 20k
4. verify data in range of [12k, 16k], however data is not zero as
expected

Script of Testcase #1
mkfs.f2fs -f -O extra_attr,compression /dev/vdb
mount -t f2fs -o compress_extension=* /dev/vdb /mnt/f2fs
dd if=/dev/zero of=/mnt/f2fs/file bs=12k count=1
dd if=/dev/random of=/mnt/f2fs/file bs=4k count=1 seek=3 conv=notrunc
sync
truncate -s $((12*1024)) /mnt/f2fs/file
truncate -s $((20*1024)) /mnt/f2fs/file
dd if=/mnt/f2fs/file of=/mnt/f2fs/data bs=4k count=1 skip=3
od /mnt/f2fs/data
umount /mnt/f2fs

Analisys:
in step 2), we will redirty all data pages from #0 to #3 in compressed
cluster, and zero page #3,
in step 3), f2fs_setattr() will call f2fs_zero_post_eof_page() to drop
all page cache post eof, includeing dirtied page #3,
in step 4) when we read data from page #3, it will decompressed cluster
and extra random data to page #3, finally, we hit the non-zeroed data
post eof.

However, the commit ba8dac3 ("f2fs: fix to zero post-eof page") just
let the issue be reproduced easily, w/o the commit, it can reproduce this
bug w/ below Testcase #2:
1. write 16k as compressed blocks
2. truncate to 8k
3. truncate to 12k
4. truncate to 20k
5. verify data in range of [12k, 16k], however data is not zero as
expected

Script of Testcase #2
mkfs.f2fs -f -O extra_attr,compression /dev/vdb
mount -t f2fs -o compress_extension=* /dev/vdb /mnt/f2fs
dd if=/dev/zero of=/mnt/f2fs/file bs=12k count=1
dd if=/dev/random of=/mnt/f2fs/file bs=4k count=1 seek=3 conv=notrunc
sync
truncate -s $((8*1024)) /mnt/f2fs/file
truncate -s $((12*1024)) /mnt/f2fs/file
truncate -s $((20*1024)) /mnt/f2fs/file
echo 3 > /proc/sys/vm/drop_caches
dd if=/mnt/f2fs/file of=/mnt/f2fs/data bs=4k count=1 skip=3
od /mnt/f2fs/data
umount /mnt/f2fs

Anlysis:
in step 2), we will redirty all data pages from #0 to #3 in compressed
cluster, and zero page #2 and #3,
in step 3), we will truncate page #3 in page cache,
in step 4), expand file size,
in step 5), hit random data post eof w/ the same reason in Testcase #1.

Root Cause:
In f2fs_truncate_partial_cluster(), after we truncate partial data block
on compressed cluster, all pages in cluster including the one post eof
will be dirtied, after another tuncation, dirty page post eof will be
dropped, however on-disk compressed cluster is still valid, it may
include non-zero data post eof, result in exposing previous non-zero data
post eof while reading.

Fix:
In f2fs_truncate_partial_cluster(), let change as below to fix:
- call filemap_write_and_wait_range() to flush dirty page
- call truncate_pagecache() to drop pages or zero partial page post eof
- call f2fs_do_truncate_blocks() to truncate non-compress cluster to
  last valid block

Fixes: 3265d3d ("f2fs: support partial truncation on compressed inode")
Reported-by: Jan Prusakowski <jprusakowski@google.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
pull bot pushed a commit that referenced this pull request Oct 3, 2025
As JY reported in bugzilla [1],

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
pc : [0xffffffe51d249484] f2fs_is_cp_guaranteed+0x70/0x98
lr : [0xffffffe51d24adbc] f2fs_merge_page_bio+0x520/0x6d4
CPU: 3 UID: 0 PID: 6790 Comm: kworker/u16:3 Tainted: P    B   W  OE      6.12.30-android16-5-maybe-dirty-4k #1 5f7701c9cbf727d1eebe77c89bbbeb3371e895e5
Tainted: [P]=PROPRIETARY_MODULE, [B]=BAD_PAGE, [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Workqueue: writeback wb_workfn (flush-254:49)
Call trace:
 f2fs_is_cp_guaranteed+0x70/0x98
 f2fs_inplace_write_data+0x174/0x2f4
 f2fs_do_write_data_page+0x214/0x81c
 f2fs_write_single_data_page+0x28c/0x764
 f2fs_write_data_pages+0x78c/0xce4
 do_writepages+0xe8/0x2fc
 __writeback_single_inode+0x4c/0x4b4
 writeback_sb_inodes+0x314/0x540
 __writeback_inodes_wb+0xa4/0xf4
 wb_writeback+0x160/0x448
 wb_workfn+0x2f0/0x5dc
 process_scheduled_works+0x1c8/0x458
 worker_thread+0x334/0x3f0
 kthread+0x118/0x1ac
 ret_from_fork+0x10/0x20

[1] https://bugzilla.kernel.org/show_bug.cgi?id=220575

The panic was caused by UAF issue w/ below race condition:

kworker
- writepages
 - f2fs_write_cache_pages
  - f2fs_write_single_data_page
   - f2fs_do_write_data_page
    - f2fs_inplace_write_data
     - f2fs_merge_page_bio
      - add_inu_page
      : cache page #1 into bio & cache bio in
        io->bio_list
  - f2fs_write_single_data_page
   - f2fs_do_write_data_page
    - f2fs_inplace_write_data
     - f2fs_merge_page_bio
      - add_inu_page
      : cache page #2 into bio which is linked
        in io->bio_list
						write
						- f2fs_write_begin
						: write page #1
						 - f2fs_folio_wait_writeback
						  - f2fs_submit_merged_ipu_write
						   - f2fs_submit_write_bio
						   : submit bio which inclues page #1 and #2

						software IRQ
						- f2fs_write_end_io
						 - fscrypt_free_bounce_page
						 : freed bounced page which belongs to page #2
      - inc_page_count( , WB_DATA_TYPE(data_folio), false)
      : data_folio points to fio->encrypted_page
        the bounced page can be freed before
        accessing it in f2fs_is_cp_guarantee()

It can reproduce w/ below testcase:
Run below script in shell #1:
for ((i=1;i>0;i++)) do xfs_io -f /mnt/f2fs/enc/file \
-c "pwrite 0 32k" -c "fdatasync"

Run below script in shell #2:
for ((i=1;i>0;i++)) do xfs_io -f /mnt/f2fs/enc/file \
-c "pwrite 0 32k" -c "fdatasync"

So, in f2fs_merge_page_bio(), let's avoid using fio->encrypted_page after
commit page into internal ipu cache.

Fixes: 0b20fce ("f2fs: cache global IPU bio")
Reported-by: JY <JY.Ho@mediatek.com>
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
pull bot pushed a commit that referenced this pull request Oct 4, 2025
When running as an SNP or TDX guest under KVM, force the legacy PCI hole,
i.e. memory between Top of Lower Usable DRAM and 4GiB, to be mapped as UC
via a forced variable MTRR range.

In most KVM-based setups, legacy devices such as the HPET and TPM are
enumerated via ACPI.  ACPI enumeration includes a Memory32Fixed entry, and
optionally a SystemMemory descriptor for an OperationRegion, e.g. if the
device needs to be accessed via a Control Method.

If a SystemMemory entry is present, then the kernel's ACPI driver will
auto-ioremap the region so that it can be accessed at will.  However, the
ACPI spec doesn't provide a way to enumerate the memory type of
SystemMemory regions, i.e. there's no way to tell software that a region
must be mapped as UC vs. WB, etc.  As a result, Linux's ACPI driver always
maps SystemMemory regions using ioremap_cache(), i.e. as WB on x86.

The dedicated device drivers however, e.g. the HPET driver and TPM driver,
want to map their associated memory as UC or WC, as accessing PCI devices
using WB is unsupported.

On bare metal and non-CoCO, the conflicting requirements "work" as firmware
configures the PCI hole (and other device memory) to be UC in the MTRRs.
So even though the ACPI mappings request WB, they are forced to UC- in the
kernel's tracking due to the kernel properly handling the MTRR overrides,
and thus are compatible with the drivers' requested WC/UC-.

With force WB MTRRs on SNP and TDX guests, the ACPI mappings get their
requested WB if the ACPI mappings are established before the dedicated
driver code attempts to initialize the device.  E.g. if acpi_init()
runs before the corresponding device driver is probed, ACPI's WB mapping
will "win", and result in the driver's ioremap() failing because the
existing WB mapping isn't compatible with the requested WC/UC-.

E.g. when a TPM is emulated by the hypervisor (ignoring the security
implications of relying on what is allegedly an untrusted entity to store
measurements), the TPM driver will request UC and fail:

  [  1.730459] ioremap error for 0xfed40000-0xfed45000, requested 0x2, got 0x0
  [  1.732780] tpm_tis MSFT0101:00: probe with driver tpm_tis failed with error -12

Note, the '0x2' and '0x0' values refer to "enum page_cache_mode", not x86's
memtypes (which frustratingly are an almost pure inversion; 2 == WB, 0 == UC).
E.g. tracing mapping requests for TPM TIS yields:

 Mapping TPM TIS with req_type = 0
 WARNING: CPU: 22 PID: 1 at arch/x86/mm/pat/memtype.c:530 memtype_reserve+0x2ab/0x460
 Modules linked in:
 CPU: 22 UID: 0 PID: 1 Comm: swapper/0 Tainted: G        W           6.16.0-rc7+ #2 VOLUNTARY
 Tainted: [W]=WARN
 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/29/2025
 RIP: 0010:memtype_reserve+0x2ab/0x460
  __ioremap_caller+0x16d/0x3d0
  ioremap_cache+0x17/0x30
  x86_acpi_os_ioremap+0xe/0x20
  acpi_os_map_iomem+0x1f3/0x240
  acpi_os_map_memory+0xe/0x20
  acpi_ex_system_memory_space_handler+0x273/0x440
  acpi_ev_address_space_dispatch+0x176/0x4c0
  acpi_ex_access_region+0x2ad/0x530
  acpi_ex_field_datum_io+0xa2/0x4f0
  acpi_ex_extract_from_field+0x296/0x3e0
  acpi_ex_read_data_from_field+0xd1/0x460
  acpi_ex_resolve_node_to_value+0x2ee/0x530
  acpi_ex_resolve_to_value+0x1f2/0x540
  acpi_ds_evaluate_name_path+0x11b/0x190
  acpi_ds_exec_end_op+0x456/0x960
  acpi_ps_parse_loop+0x27a/0xa50
  acpi_ps_parse_aml+0x226/0x600
  acpi_ps_execute_method+0x172/0x3e0
  acpi_ns_evaluate+0x175/0x5f0
  acpi_evaluate_object+0x213/0x490
  acpi_evaluate_integer+0x6d/0x140
  acpi_bus_get_status+0x93/0x150
  acpi_add_single_object+0x43a/0x7c0
  acpi_bus_check_add+0x149/0x3a0
  acpi_bus_check_add_1+0x16/0x30
  acpi_ns_walk_namespace+0x22c/0x360
  acpi_walk_namespace+0x15c/0x170
  acpi_bus_scan+0x1dd/0x200
  acpi_scan_init+0xe5/0x2b0
  acpi_init+0x264/0x5b0
  do_one_initcall+0x5a/0x310
  kernel_init_freeable+0x34f/0x4f0
  kernel_init+0x1b/0x200
  ret_from_fork+0x186/0x1b0
  ret_from_fork_asm+0x1a/0x30
  </TASK>

The above traces are from a Google-VMM based VM, but the same behavior
happens with a QEMU based VM that is modified to add a SystemMemory range
for the TPM TIS address space.

The only reason this doesn't cause problems for HPET, which appears to
require a SystemMemory region, is because HPET gets special treatment via
x86_init.timers.timer_init(), and so gets a chance to create its UC-
mapping before acpi_init() clobbers things.  Disabling the early call to
hpet_time_init() yields the same behavior for HPET:

  [  0.318264] ioremap error for 0xfed00000-0xfed01000, requested 0x2, got 0x0

Hack around the ACPI gap by forcing the legacy PCI hole to UC when
overriding the (virtual) MTRRs for CoCo guest, so that ioremap handling
of MTRRs naturally kicks in and forces the ACPI mappings to be UC.

Note, the requested/mapped memtype doesn't actually matter in terms of
accessing the device.  In practically every setup, legacy PCI devices are
emulated by the hypervisor, and accesses are intercepted and handled as
emulated MMIO, i.e. never access physical memory and thus don't have an
effective memtype.

Even in a theoretical setup where such devices are passed through by the
host, i.e. point at real MMIO memory, it is KVM's (as the hypervisor)
responsibility to force the memory to be WC/UC, e.g. via EPT memtype
under TDX or real hardware MTRRs under SNP.  Not doing so cannot work,
and the hypervisor is highly motivated to do the right thing as letting
the guest access hardware MMIO with WB would likely result in a variety
of fatal #MCs.

In other words, forcing the range to be UC is all about coercing the
kernel's tracking into thinking that it has established UC mappings, so
that the ioremap code doesn't reject mappings from e.g. the TPM driver and
thus prevent the driver from loading and the device from functioning.

Note #2, relying on guest firmware to handle this scenario, e.g. by setting
virtual MTRRs and then consuming them in Linux, is not a viable option, as
the virtual MTRR state is managed by the untrusted hypervisor, and because
OVMF at least has stopped programming virtual MTRRs when running as a TDX
guest.

Link: https://lore.kernel.org/all/8137d98e-8825-415b-9282-1d2a115bb51a@linux.intel.com
Fixes: 8e690b8 ("x86/kvm: Override default caching mode for SEV-SNP and TDX")
Cc: stable@vger.kernel.org
Cc: Peter Gonda <pgonda@google.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Jürgen Groß <jgross@suse.com>
Cc: Korakit Seemakhupt <korakit@google.com>
Cc: Jianxiong Gao <jxgao@google.com>
Cc: Nikolay Borisov <nik.borisov@suse.com>
Suggested-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Tested-by: Korakit Seemakhupt <korakit@google.com>
Link: https://lore.kernel.org/r/20250828005249.39339-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Oct 5, 2025
We're generally not proponents of rewrites (nasty uncomfortable things
that make you late for dinner!). So why rewrite Binder?

Binder has been evolving over the past 15+ years to meet the evolving
needs of Android. Its responsibilities, expectations, and complexity
have grown considerably during that time. While we expect Binder to
continue to evolve along with Android, there are a number of factors
that currently constrain our ability to develop/maintain it. Briefly
those are:

1. Complexity: Binder is at the intersection of everything in Android and
   fulfills many responsibilities beyond IPC. It has become many things
   to many people, and due to its many features and their interactions
   with each other, its complexity is quite high. In just 6kLOC it must
   deliver transactions to the right threads. It must correctly parse
   and translate the contents of transactions, which can contain several
   objects of different types (e.g., pointers, fds) that can interact
   with each other. It controls the size of thread pools in userspace,
   and ensures that transactions are assigned to threads in ways that
   avoid deadlocks where the threadpool has run out of threads. It must
   track refcounts of objects that are shared by several processes by
   forwarding refcount changes between the processes correctly.  It must
   handle numerous error scenarios and it combines/nests 13 different
   locks, 7 reference counters, and atomic variables. Finally, It must
   do all of this as fast and efficiently as possible. Minor performance
   regressions can cause a noticeably degraded user experience.

2. Things to improve: Thousand-line functions [1], error-prone error
   handling [2], and confusing structure can occur as a code base grows
   organically. After more than a decade of development, this codebase
   could use an overhaul.

[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/android/binder.c?h=v6.5#n2896
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/android/binder.c?h=v6.5#n3658

3. Security critical: Binder is a critical part of Android's sandboxing
   strategy. Even Android's most de-privileged sandboxes (e.g. the
   Chrome renderer, or SW Codec) have direct access to Binder. More than
   just about any other component, it's important that Binder provide
   robust security, and itself be robust against security
   vulnerabilities.

It's #1 (high complexity) that has made continuing to evolve Binder and
resolving #2 (tech debt) exceptionally difficult without causing #3
(security issues). For Binder to continue to meet Android's needs, we
need better ways to manage (and reduce!) complexity without increasing
the risk.

The biggest change is obviously the choice of programming language. We
decided to use Rust because it directly addresses a number of the
challenges within Binder that we have faced during the last years. It
prevents mistakes with ref counting, locking, bounds checking, and also
does a lot to reduce the complexity of error handling. Additionally,
we've been able to use the more expressive type system to encode the
ownership semantics of the various structs and pointers, which takes the
complexity of managing object lifetimes out of the hands of the
programmer, reducing the risk of use-after-frees and similar problems.

Rust has many different pointer types that it uses to encode ownership
semantics into the type system, and this is probably one of the most
important aspects of how it helps in Binder. The Binder driver has a lot
of different objects that have complex ownership semantics; some
pointers own a refcount, some pointers have exclusive ownership, and
some pointers just reference the object and it is kept alive in some
other manner. With Rust, we can use a different pointer type for each
kind of pointer, which enables the compiler to enforce that the
ownership semantics are implemented correctly.

Another useful feature is Rust's error handling. Rust allows for more
simplified error handling with features such as destructors, and you get
compilation failures if errors are not properly handled. This means that
even though Rust requires you to spend more lines of code than C on
things such as writing down invariants that are left implicit in C, the
Rust driver is still slightly smaller than C binder: Rust is 5.5kLOC and
C is 5.8kLOC. (These numbers are excluding blank lines, comments,
binderfs, and any debugging facilities in C that are not yet implemented
in the Rust driver. The numbers include abstractions in rust/kernel/
that are unlikely to be used by other drivers than Binder.)

Although this rewrite completely rethinks how the code is structured and
how assumptions are enforced, we do not fundamentally change *how* the
driver does the things it does. A lot of careful thought has gone into
the existing design. The rewrite is aimed rather at improving code
health, structure, readability, robustness, security, maintainability
and extensibility. We also include more inline documentation, and
improve how assumptions in the code are enforced. Furthermore, all
unsafe code is annotated with a SAFETY comment that explains why it is
correct.

We have left the binderfs filesystem component in C. Rewriting it in
Rust would be a large amount of work and requires a lot of bindings to
the file system interfaces. Binderfs has not historically had the same
challenges with security and complexity, so rewriting binderfs seems to
have lower value than the rest of Binder.

Correctness and feature parity
------------------------------

Rust binder passes all tests that validate the correctness of Binder in
the Android Open Source Project. We can boot a device, and run a variety
of apps and functionality without issues. We have performed this both on
the Cuttlefish Android emulator device, and on a Pixel 6 Pro.

As for feature parity, Rust binder currently implements all features
that C binder supports, with the exception of some debugging facilities.
The missing debugging facilities will be added before we submit the Rust
implementation upstream.

Tracepoints
-----------

I did not include all of the tracepoints as I felt that the mechansim
for making C access fields of Rust structs should be discussed on list
separately. I also did not include the support for building Rust Binder
as a module since that requires exporting a bunch of additional symbols
on the C side.

Original RFC Link with old benchmark numbers:
	https://lore.kernel.org/r/20231101-rust-binder-v1-0-08ba9197f637@google.com

Co-developed-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Signed-off-by: Wedson Almeida Filho <wedsonaf@gmail.com>
Co-developed-by: Matt Gilbride <mattgilbride@google.com>
Signed-off-by: Matt Gilbride <mattgilbride@google.com>
Acked-by: Carlos Llamas <cmllamas@google.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
Link: https://lore.kernel.org/r/20250919-rust-binder-v2-1-a384b09f28dd@google.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pull bot pushed a commit that referenced this pull request Oct 6, 2025
Check for an invalid length during LAUNCH_UPDATE at the start of
snp_launch_update() instead of subtly relying on kvm_gmem_populate() to
detect the bad state.  Code that directly handles userspace input
absolutely should sanitize those inputs; failure to do so is asking for
bugs where KVM consumes an invalid "npages".

Keep the check in gmem, but wrap it in a WARN to flag any bad usage by
the caller.

Note, this is technically an ABI change as KVM would previously allow a
length of '0'.  But allowing a length of '0' is nonsensical and creates
pointless conundrums in KVM.  E.g. an empty range is arguably neither
private nor shared, but LAUNCH_UPDATE will fail if the starting gpa can't
be made private.  In practice, no known or well-behaved VMM passes a
length of '0'.

Note #2, the PAGE_ALIGNED(params.len) check ensures that lengths between
1 and 4095 (inclusive) are also rejected, i.e. that KVM won't end up with
npages=0 when doing "npages = params.len / PAGE_SIZE".

Cc: Thomas Lendacky <thomas.lendacky@amd.com>
Cc: Michael Roth <michael.roth@amd.com>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/20250919211649.1575654-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Oct 6, 2025
Don't emulate branch instructions, e.g. CALL/RET/JMP etc., that are
affected by Shadow Stacks and/or Indirect Branch Tracking when said
features are enabled in the guest, as fully emulating CET would require
significant complexity for no practical benefit (KVM shouldn't need to
emulate branch instructions on modern hosts).  Simply doing nothing isn't
an option as that would allow a malicious entity to subvert CET
protections via the emulator.

To detect instructions that are subject to IBT or affect IBT state, use
the existing IsBranch flag along with the source operand type to detect
indirect branches, and the existing NearBranch flag to detect far JMPs
and CALLs, all of which are effectively indirect.  Explicitly check for
emulation of IRET, FAR RET (IMM), and SYSEXIT (the ret-like far branches)
instead of adding another flag, e.g. IsRet, as it's unlikely the emulator
will ever need to check for return-like instructions outside of this one
specific flow.  Use an allow-list instead of a deny-list because (a) it's
a shorter list and (b) so that a missed entry gets a false positive, not a
false negative (i.e. reject emulation instead of clobbering CET state).

For Shadow Stacks, explicitly track instructions that directly affect the
current SSP, as KVM's emulator doesn't have existing flags that can be
used to precisely detect such instructions.  Alternatively, the em_xxx()
helpers could directly check for ShadowStack interactions, but using a
dedicated flag is arguably easier to audit, and allows for handling both
IBT and SHSTK in one fell swoop.

Note!  On far transfers, do NOT consult the current privilege level and
instead treat SHSTK/IBT as being enabled if they're enabled for User *or*
Supervisor mode.  On inter-privilege level far transfers, SHSTK and IBT
can be in play for the target privilege level, i.e. checking the current
privilege could get a false negative, and KVM doesn't know the target
privilege level until emulation gets under way.

Note #2, FAR JMP from 64-bit mode to compatibility mode interacts with
the current SSP, but only to ensure SSP[63:32] == 0.  Don't tag FAR JMP
as SHSTK, which would be rather confusing and would result in FAR JMP
being rejected unnecessarily the vast majority of the time (ignoring that
it's unlikely to ever be emulated).  A future commit will add the #GP(0)
check for the specific FAR JMP scenario.

Note #3, task switches also modify SSP and so need to be rejected.  That
too will be addressed in a future commit.

Suggested-by: Chao Gao <chao.gao@intel.com>
Originally-by: Yang Weijiang <weijiang.yang@intel.com>
Cc: Mathias Krause <minipli@grsecurity.net>
Cc: John Allen <john.allen@amd.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20250919223258.1604852-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Oct 6, 2025
Before disabling SR-IOV via config space accesses to the parent PF,
sriov_disable() first removes the PCI devices representing the VFs.

Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()")
such removal operations are serialized against concurrent remove and
rescan using the pci_rescan_remove_lock. No such locking was ever added
in sriov_disable() however. In particular when commit 18f9e9d
("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device
removal into sriov_del_vfs() there was still no locking around the
pci_iov_remove_virtfn() calls.

On s390 the lack of serialization in sriov_disable() may cause double
remove and list corruption with the below (amended) trace being observed:

  PSW:  0704c00180000000 0000000c914e4b38 (klist_put+56)
  GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001
	00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480
	0000000000000001 0000000000000000 0000000000000000 0000000180692828
	00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8
  #0 [3800313fb20] device_del at c9158ad5c
  #1 [3800313fb88] pci_remove_bus_device at c915105ba
  #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198
  #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0
  #4 [3800313fc60] zpci_bus_remove_device at c90fb6104
  #5 [3800313fca0] __zpci_event_availability at c90fb3dca
  #6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2
  #7 [3800313fd60] crw_collect_info at c91905822
  #8 [3800313fe10] kthread at c90feb390
  #9 [3800313fe68] __ret_from_fork at c90f6aa64
  #10 [3800313fe98] ret_from_fork at c9194f3f2.

This is because in addition to sriov_disable() removing the VFs, the
platform also generates hot-unplug events for the VFs. This being the
reverse operation to the hotplug events generated by sriov_enable() and
handled via pdev->no_vf_scan. And while the event processing takes
pci_rescan_remove_lock and checks whether the struct pci_dev still exists,
the lack of synchronization makes this checking racy.

Other races may also be possible of course though given that this lack of
locking persisted so long observable races seem very rare. Even on s390 the
list corruption was only observed with certain devices since the platform
events are only triggered by config accesses after the removal, so as long
as the removal finished synchronously they would not race. Either way the
locking is missing so fix this by adding it to the sriov_del_vfs() helper.

Just like PCI rescan-remove, locking is also missing in sriov_add_vfs()
including for the error case where pci_stop_and_remove_bus_device() is
called without the PCI rescan-remove lock being held. Even in the non-error
case, adding new PCI devices and buses should be serialized via the PCI
rescan-remove lock. Add the necessary locking.

Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()")
Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Benjamin Block <bblock@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Reviewed-by: Julian Ruess <julianr@linux.ibm.com>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20250826-pci_fix_sriov_disable-v1-1-2d0bc938f2a3@linux.ibm.com
pull bot pushed a commit that referenced this pull request Oct 9, 2025
The test starts a workload and then opens events. If the events fail
to open, for example because of perf_event_paranoid, the gopipe of the
workload is leaked and the file descriptor leak check fails when the
test exits. To avoid this cancel the workload when opening the events
fails.

Before:
```
$ perf test -vv 7
  7: PERF_RECORD_* events & perf_sample fields:
 --- start ---
test child forked, pid 1189568
Using CPUID GenuineIntel-6-B7-1
 ------------------------------------------------------------
perf_event_attr:
  type                    	   0 (PERF_TYPE_HARDWARE)
  config                  	   0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                	   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
 ------------------------------------------------------------
perf_event_attr:
  type                             0 (PERF_TYPE_HARDWARE)
  config                           0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/)
  disabled                         1
  exclude_kernel                   1
 ------------------------------------------------------------
sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 3
Attempt to add: software/cpu-clock/
..after resolving event: software/config=0/
cpu-clock -> software/cpu-clock/
 ------------------------------------------------------------
perf_event_attr:
  type                             1 (PERF_TYPE_SOFTWARE)
  size                             136
  config                           0x9 (PERF_COUNT_SW_DUMMY)
  sample_type                      IP|TID|TIME|CPU
  read_format                      ID|LOST
  disabled                         1
  inherit                          1
  mmap                             1
  comm                             1
  enable_on_exec                   1
  task                             1
  sample_id_all                    1
  mmap2                            1
  comm_exec                        1
  ksymbol                          1
  bpf_event                        1
  { wakeup_events, wakeup_watermark } 1
 ------------------------------------------------------------
sys_perf_event_open: pid 1189569  cpu 0  group_fd -1  flags 0x8
sys_perf_event_open failed, error -13
perf_evlist__open: Permission denied
 ---- end(-2) ----
Leak of file descriptor 6 that opened: 'pipe:[14200347]'
 ---- unexpected signal (6) ----
iFailed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
Failed to read build ID for //anon
    #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311
    #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0
    #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44
    #3 0x7f29ce849cc2 in raise raise.c:27
    #4 0x7f29ce8324ac in abort abort.c:81
    #5 0x565358f662d4 in check_leaks builtin-test.c:226
    #6 0x565358f6682e in run_test_child builtin-test.c:344
    #7 0x565358ef7121 in start_command run-command.c:128
    #8 0x565358f67273 in start_test builtin-test.c:545
    #9 0x565358f6771d in __cmd_test builtin-test.c:647
    #10 0x565358f682bd in cmd_test builtin-test.c:849
    #11 0x565358ee5ded in run_builtin perf.c:349
    #12 0x565358ee6085 in handle_internal_command perf.c:401
    #13 0x565358ee61de in run_argv perf.c:448
    #14 0x565358ee6527 in main perf.c:555
    #15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74
    #16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128
    #17 0x565358e391c1 in _start perf[851c1]
  7: PERF_RECORD_* events & perf_sample fields                       : FAILED!
```

After:
```
$ perf test 7
  7: PERF_RECORD_* events & perf_sample fields                       : Skip (permissions)
```

Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object")
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.ibm.com>
Cc: Chun-Tse Shao <ctshao@google.com>
Cc: Howard Chu <howardchu95@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Clark <james.clark@linaro.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
pull bot pushed a commit that referenced this pull request Oct 16, 2025
Since blamed commit, unregister_netdevice_many_notify() takes the netdev
mutex if the device needs it.

If the device list is too long, this will lock more device mutexes than
lockdep can handle:

unshare -n \
 bash -c 'for i in $(seq 1 100);do ip link add foo$i type dummy;done'

BUG: MAX_LOCK_DEPTH too low!
turning off the locking correctness validator.
depth: 48  max: 48!
48 locks held by kworker/u16:1/69:
 #0: ..148 ((wq_completion)netns){+.+.}-{0:0}, at: process_one_work
 #1: ..d40 (net_cleanup_work){+.+.}-{0:0}, at: process_one_work
 #2: ..bd0 (pernet_ops_rwsem){++++}-{4:4}, at: cleanup_net
 #3: ..aa8 (rtnl_mutex){+.+.}-{4:4}, at: default_device_exit_batch
 #4: ..cb0 (&dev_instance_lock_key#3){+.+.}-{4:4}, at: unregister_netdevice_many_notify
[..]

Add a helper to close and then unlock a list of net_devices.
Devices that are not up have to be skipped - netif_close_many always
removes them from the list without any other actions taken, so they'd
remain in locked state.

Close devices whenever we've used up half of the tracking slots or we
processed entire list without hitting the limit.

Fixes: 7e4d784 ("net: hold netdev instance lock during rtnetlink operations")
Signed-off-by: Florian Westphal <fw@strlen.de>
Link: https://patch.msgid.link/20251013185052.14021-1-fw@strlen.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pull bot pushed a commit that referenced this pull request Oct 19, 2025
Expand the prefault memory selftest to add a regression test for a KVM bug
where KVM's retry logic would result in (breakable) deadlock due to the
memslot deletion waiting on prefaulting to release SRCU, and prefaulting
waiting on the memslot to fully disappear (KVM uses a two-step process to
delete memslots, and KVM x86 retries page faults if a to-be-deleted, a.k.a.
INVALID, memslot is encountered).

To exercise concurrent memslot remove, spawn a second thread to initiate
memslot removal at roughly the same time as prefaulting.  Test memslot
removal for all testcases, i.e. don't limit concurrent removal to only the
success case.  There are essentially three prefault scenarios (so far)
that are of interest:

 1. Success
 2. ENOENT due to no memslot
 3. EAGAIN due to INVALID memslot

For all intents and purposes, #1 and #2 are mutually exclusive, or rather,
easier to test via separate testcases since writing to non-existent memory
is trivial.  But for #3, making it mutually exclusive with #1 _or_ #2 is
actually more complex than testing memslot removal for all scenarios.  The
only requirement to let memslot removal coexist with other scenarios is a
way to guarantee a stable result, e.g. that the "no memslot" test observes
ENOENT, not EAGAIN, for the final checks.

So, rather than make memslot removal mutually exclusive with the ENOENT
scenario, simply restore the memslot and retry prefaulting.  For the "no
memslot" case, KVM_PRE_FAULT_MEMORY should be idempotent, i.e. should
always fail with ENOENT regardless of how many times userspace attempts
prefaulting.

Pass in both the base GPA and the offset (instead of the "full" GPA) so
that the worker can recreate the memslot.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/r/20250924174255.2141847-1-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Oct 25, 2025
The original code causes a circular locking dependency found by lockdep.

======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 Tainted: G S   U
------------------------------------------------------
xe_fault_inject/5091 is trying to acquire lock:
ffff888156815688 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}, at: __flush_work+0x25d/0x660

but task is already holding lock:

ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&devcd->mutex){+.+.}-{3:3}:
       mutex_lock_nested+0x4e/0xc0
       devcd_data_write+0x27/0x90
       sysfs_kf_bin_write+0x80/0xf0
       kernfs_fop_write_iter+0x169/0x220
       vfs_write+0x293/0x560
       ksys_write+0x72/0xf0
       __x64_sys_write+0x19/0x30
       x64_sys_call+0x2bf/0x2660
       do_syscall_64+0x93/0xb60
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (kn->active#236){++++}-{0:0}:
       kernfs_drain+0x1e2/0x200
       __kernfs_remove+0xae/0x400
       kernfs_remove_by_name_ns+0x5d/0xc0
       remove_files+0x54/0x70
       sysfs_remove_group+0x3d/0xa0
       sysfs_remove_groups+0x2e/0x60
       device_remove_attrs+0xc7/0x100
       device_del+0x15d/0x3b0
       devcd_del+0x19/0x30
       process_one_work+0x22b/0x6f0
       worker_thread+0x1e8/0x3d0
       kthread+0x11c/0x250
       ret_from_fork+0x26c/0x2e0
       ret_from_fork_asm+0x1a/0x30
-> #0 ((work_completion)(&(&devcd->del_wk)->work)){+.+.}-{0:0}:
       __lock_acquire+0x1661/0x2860
       lock_acquire+0xc4/0x2f0
       __flush_work+0x27a/0x660
       flush_delayed_work+0x5d/0xa0
       dev_coredump_put+0x63/0xa0
       xe_driver_devcoredump_fini+0x12/0x20 [xe]
       devm_action_release+0x12/0x30
       release_nodes+0x3a/0x120
       devres_release_all+0x8a/0xd0
       device_unbind_cleanup+0x12/0x80
       device_release_driver_internal+0x23a/0x280
       device_driver_detach+0x14/0x20
       unbind_store+0xaf/0xc0
       drv_attr_store+0x21/0x50
       sysfs_kf_write+0x4a/0x80
       kernfs_fop_write_iter+0x169/0x220
       vfs_write+0x293/0x560
       ksys_write+0x72/0xf0
       __x64_sys_write+0x19/0x30
       x64_sys_call+0x2bf/0x2660
       do_syscall_64+0x93/0xb60
       entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of: (work_completion)(&(&devcd->del_wk)->work) --> kn->active#236 --> &devcd->mutex
 Possible unsafe locking scenario:
       CPU0                    CPU1
       ----                    ----
  lock(&devcd->mutex);
                               lock(kn->active#236);
                               lock(&devcd->mutex);
  lock((work_completion)(&(&devcd->del_wk)->work));
 *** DEADLOCK ***
5 locks held by xe_fault_inject/5091:
 #0: ffff8881129f9488 (sb_writers#5){.+.+}-{0:0}, at: ksys_write+0x72/0xf0
 #1: ffff88810c755078 (&of->mutex#2){+.+.}-{3:3}, at: kernfs_fop_write_iter+0x123/0x220
 #2: ffff8881054811a0 (&dev->mutex){....}-{3:3}, at: device_release_driver_internal+0x55/0x280
 #3: ffff888156815620 (&devcd->mutex){+.+.}-{3:3}, at: dev_coredump_put+0x3f/0xa0
 #4: ffffffff8359e020 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x72/0x660
stack backtrace:
CPU: 14 UID: 0 PID: 5091 Comm: xe_fault_inject Tainted: G S   U              6.16.0-rc6-lgci-xe-xe-pw-151626v3+ #1 PREEMPT_{RT,(lazy)}
Tainted: [S]=CPU_OUT_OF_SPEC, [U]=USER
Hardware name: Micro-Star International Co., Ltd. MS-7D25/PRO Z690-A DDR4(MS-7D25), BIOS 1.10 12/13/2021
Call Trace:
 <TASK>
 dump_stack_lvl+0x91/0xf0
 dump_stack+0x10/0x20
 print_circular_bug+0x285/0x360
 check_noncircular+0x135/0x150
 ? register_lock_class+0x48/0x4a0
 __lock_acquire+0x1661/0x2860
 lock_acquire+0xc4/0x2f0
 ? __flush_work+0x25d/0x660
 ? mark_held_locks+0x46/0x90
 ? __flush_work+0x25d/0x660
 __flush_work+0x27a/0x660
 ? __flush_work+0x25d/0x660
 ? trace_hardirqs_on+0x1e/0xd0
 ? __pfx_wq_barrier_func+0x10/0x10
 flush_delayed_work+0x5d/0xa0
 dev_coredump_put+0x63/0xa0
 xe_driver_devcoredump_fini+0x12/0x20 [xe]
 devm_action_release+0x12/0x30
 release_nodes+0x3a/0x120
 devres_release_all+0x8a/0xd0
 device_unbind_cleanup+0x12/0x80
 device_release_driver_internal+0x23a/0x280
 ? bus_find_device+0xa8/0xe0
 device_driver_detach+0x14/0x20
 unbind_store+0xaf/0xc0
 drv_attr_store+0x21/0x50
 sysfs_kf_write+0x4a/0x80
 kernfs_fop_write_iter+0x169/0x220
 vfs_write+0x293/0x560
 ksys_write+0x72/0xf0
 __x64_sys_write+0x19/0x30
 x64_sys_call+0x2bf/0x2660
 do_syscall_64+0x93/0xb60
 ? __f_unlock_pos+0x15/0x20
 ? __x64_sys_getdents64+0x9b/0x130
 ? __pfx_filldir64+0x10/0x10
 ? do_syscall_64+0x1a2/0xb60
 ? clear_bhb_loop+0x30/0x80
 ? clear_bhb_loop+0x30/0x80
 entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x76e292edd574
Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d d5 ea 0e 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
RSP: 002b:00007fffe247a828 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000076e292edd574
RDX: 000000000000000c RSI: 00006267f6306063 RDI: 000000000000000b
RBP: 000000000000000c R08: 000076e292fc4b20 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00006267f6306063
R13: 000000000000000b R14: 00006267e6859c00 R15: 000076e29322a000
 </TASK>
xe 0000:03:00.0: [drm] Xe device coredump has been deleted.

Fixes: 01daccf ("devcoredump : Serialize devcd_del work")
Cc: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org # v6.1+
Signed-off-by: Maarten Lankhorst <dev@lankhorst.se>
Cc: Matthew Brost <matthew.brost@intel.com>
Acked-by: Mukesh Ojha <mukesh.ojha@oss.qualcomm.com>
Link: https://lore.kernel.org/r/20250723142416.1020423-1-dev@lankhorst.se
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pull bot pushed a commit that referenced this pull request Nov 6, 2025
Michael Chan says:

====================
bnxt_en: Bug fixes

Patches 1, 3, and 4 are bug fixes related to the FW log tracing driver
coredump feature recently added in 6.13.  Patch #1 adds the necessary
call to shutdown the FW logging DMA during PCI shutdown.  Patch #3 fixes
a possible null pointer derefernce when using early versions of the FW
with this feature.  Patch #4 adds the coredump header information
unconditionally to make it more robust.

Patch #2 fixes a possible memory leak during PTP shutdown.  Patch #5
eliminates a dmesg warning when doing devlink reload.
====================

Link: https://patch.msgid.link/20251104005700.542174-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pull bot pushed a commit that referenced this pull request Nov 8, 2025
On completion of i915_vma_pin_ww(), a synchronous variant of
dma_fence_work_commit() is called.  When pinning a VMA to GGTT address
space on a Cherry View family processor, or on a Broxton generation SoC
with VTD enabled, i.e., when stop_machine() is then called from
intel_ggtt_bind_vma(), that can potentially lead to lock inversion among
reservation_ww and cpu_hotplug locks.

[86.861179] ======================================================
[86.861193] WARNING: possible circular locking dependency detected
[86.861209] 6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 Tainted: G     U
[86.861226] ------------------------------------------------------
[86.861238] i915_module_loa/1432 is trying to acquire lock:
[86.861252] ffffffff83489090 (cpu_hotplug_lock){++++}-{0:0}, at: stop_machine+0x1c/0x50
[86.861290]
but task is already holding lock:
[86.861303] ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915]
[86.862233]
which lock already depends on the new lock.
[86.862251]
the existing dependency chain (in reverse order) is:
[86.862265]
-> #5 (reservation_ww_class_mutex){+.+.}-{3:3}:
[86.862292]        dma_resv_lockdep+0x19a/0x390
[86.862315]        do_one_initcall+0x60/0x3f0
[86.862334]        kernel_init_freeable+0x3cd/0x680
[86.862353]        kernel_init+0x1b/0x200
[86.862369]        ret_from_fork+0x47/0x70
[86.862383]        ret_from_fork_asm+0x1a/0x30
[86.862399]
-> #4 (reservation_ww_class_acquire){+.+.}-{0:0}:
[86.862425]        dma_resv_lockdep+0x178/0x390
[86.862440]        do_one_initcall+0x60/0x3f0
[86.862454]        kernel_init_freeable+0x3cd/0x680
[86.862470]        kernel_init+0x1b/0x200
[86.862482]        ret_from_fork+0x47/0x70
[86.862495]        ret_from_fork_asm+0x1a/0x30
[86.862509]
-> #3 (&mm->mmap_lock){++++}-{3:3}:
[86.862531]        down_read_killable+0x46/0x1e0
[86.862546]        lock_mm_and_find_vma+0xa2/0x280
[86.862561]        do_user_addr_fault+0x266/0x8e0
[86.862578]        exc_page_fault+0x8a/0x2f0
[86.862593]        asm_exc_page_fault+0x27/0x30
[86.862607]        filldir64+0xeb/0x180
[86.862620]        kernfs_fop_readdir+0x118/0x480
[86.862635]        iterate_dir+0xcf/0x2b0
[86.862648]        __x64_sys_getdents64+0x84/0x140
[86.862661]        x64_sys_call+0x1058/0x2660
[86.862675]        do_syscall_64+0x91/0xe90
[86.862689]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[86.862703]
-> #2 (&root->kernfs_rwsem){++++}-{3:3}:
[86.862725]        down_write+0x3e/0xf0
[86.862738]        kernfs_add_one+0x30/0x3c0
[86.862751]        kernfs_create_dir_ns+0x53/0xb0
[86.862765]        internal_create_group+0x134/0x4c0
[86.862779]        sysfs_create_group+0x13/0x20
[86.862792]        topology_add_dev+0x1d/0x30
[86.862806]        cpuhp_invoke_callback+0x4b5/0x850
[86.862822]        cpuhp_issue_call+0xbf/0x1f0
[86.862836]        __cpuhp_setup_state_cpuslocked+0x111/0x320
[86.862852]        __cpuhp_setup_state+0xb0/0x220
[86.862866]        topology_sysfs_init+0x30/0x50
[86.862879]        do_one_initcall+0x60/0x3f0
[86.862893]        kernel_init_freeable+0x3cd/0x680
[86.862908]        kernel_init+0x1b/0x200
[86.862921]        ret_from_fork+0x47/0x70
[86.862934]        ret_from_fork_asm+0x1a/0x30
[86.862947]
-> #1 (cpuhp_state_mutex){+.+.}-{3:3}:
[86.862969]        __mutex_lock+0xaa/0xed0
[86.862982]        mutex_lock_nested+0x1b/0x30
[86.862995]        __cpuhp_setup_state_cpuslocked+0x67/0x320
[86.863012]        __cpuhp_setup_state+0xb0/0x220
[86.863026]        page_alloc_init_cpuhp+0x2d/0x60
[86.863041]        mm_core_init+0x22/0x2d0
[86.863054]        start_kernel+0x576/0xbd0
[86.863068]        x86_64_start_reservations+0x18/0x30
[86.863084]        x86_64_start_kernel+0xbf/0x110
[86.863098]        common_startup_64+0x13e/0x141
[86.863114]
-> #0 (cpu_hotplug_lock){++++}-{0:0}:
[86.863135]        __lock_acquire+0x1635/0x2810
[86.863152]        lock_acquire+0xc4/0x2f0
[86.863166]        cpus_read_lock+0x41/0x100
[86.863180]        stop_machine+0x1c/0x50
[86.863194]        bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915]
[86.863987]        intel_ggtt_bind_vma+0x43/0x70 [i915]
[86.864735]        __vma_bind+0x55/0x70 [i915]
[86.865510]        fence_work+0x26/0xa0 [i915]
[86.866248]        fence_notify+0xa1/0x140 [i915]
[86.866983]        __i915_sw_fence_complete+0x8f/0x270 [i915]
[86.867719]        i915_sw_fence_commit+0x39/0x60 [i915]
[86.868453]        i915_vma_pin_ww+0x462/0x1360 [i915]
[86.869228]        i915_vma_pin.constprop.0+0x133/0x1d0 [i915]
[86.870001]        initial_plane_vma+0x307/0x840 [i915]
[86.870774]        intel_initial_plane_config+0x33f/0x670 [i915]
[86.871546]        intel_display_driver_probe_nogem+0x1c6/0x260 [i915]
[86.872330]        i915_driver_probe+0x7fa/0xe80 [i915]
[86.873057]        i915_pci_probe+0xe6/0x220 [i915]
[86.873782]        local_pci_probe+0x47/0xb0
[86.873802]        pci_device_probe+0xf3/0x260
[86.873817]        really_probe+0xf1/0x3c0
[86.873833]        __driver_probe_device+0x8c/0x180
[86.873848]        driver_probe_device+0x24/0xd0
[86.873862]        __driver_attach+0x10f/0x220
[86.873876]        bus_for_each_dev+0x7f/0xe0
[86.873892]        driver_attach+0x1e/0x30
[86.873904]        bus_add_driver+0x151/0x290
[86.873917]        driver_register+0x5e/0x130
[86.873931]        __pci_register_driver+0x7d/0x90
[86.873945]        i915_pci_register_driver+0x23/0x30 [i915]
[86.874678]        i915_init+0x37/0x120 [i915]
[86.875347]        do_one_initcall+0x60/0x3f0
[86.875369]        do_init_module+0x97/0x2a0
[86.875385]        load_module+0x2c54/0x2d80
[86.875398]        init_module_from_file+0x96/0xe0
[86.875413]        idempotent_init_module+0x117/0x330
[86.875426]        __x64_sys_finit_module+0x77/0x100
[86.875440]        x64_sys_call+0x24de/0x2660
[86.875454]        do_syscall_64+0x91/0xe90
[86.875470]        entry_SYSCALL_64_after_hwframe+0x76/0x7e
[86.875486]
other info that might help us debug this:
[86.875502] Chain exists of:
  cpu_hotplug_lock --> reservation_ww_class_acquire --> reservation_ww_class_mutex
[86.875539]  Possible unsafe locking scenario:
[86.875552]        CPU0                    CPU1
[86.875563]        ----                    ----
[86.875573]   lock(reservation_ww_class_mutex);
[86.875588]                                lock(reservation_ww_class_acquire);
[86.875606]                                lock(reservation_ww_class_mutex);
[86.875624]   rlock(cpu_hotplug_lock);
[86.875637]
 *** DEADLOCK ***
[86.875650] 3 locks held by i915_module_loa/1432:
[86.875663]  #0: ffff888101f5c1b0 (&dev->mutex){....}-{3:3}, at: __driver_attach+0x104/0x220
[86.875699]  #1: ffffc90002e0b4a0 (reservation_ww_class_acquire){+.+.}-{0:0}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915]
[86.876512]  #2: ffffc90002e0b4c8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: i915_vma_pin.constprop.0+0x39/0x1d0 [i915]
[86.877305]
stack backtrace:
[86.877326] CPU: 0 UID: 0 PID: 1432 Comm: i915_module_loa Tainted: G     U              6.15.0-rc5-CI_DRM_16515-gca0305cadc2d+ #1 PREEMPT(voluntary)
[86.877334] Tainted: [U]=USER
[86.877336] Hardware name:  /NUC5CPYB, BIOS PYBSWCEL.86A.0079.2020.0420.1316 04/20/2020
[86.877339] Call Trace:
[86.877344]  <TASK>
[86.877353]  dump_stack_lvl+0x91/0xf0
[86.877364]  dump_stack+0x10/0x20
[86.877369]  print_circular_bug+0x285/0x360
[86.877379]  check_noncircular+0x135/0x150
[86.877390]  __lock_acquire+0x1635/0x2810
[86.877403]  lock_acquire+0xc4/0x2f0
[86.877408]  ? stop_machine+0x1c/0x50
[86.877422]  ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915]
[86.878173]  cpus_read_lock+0x41/0x100
[86.878182]  ? stop_machine+0x1c/0x50
[86.878191]  ? __pfx_bxt_vtd_ggtt_insert_entries__cb+0x10/0x10 [i915]
[86.878916]  stop_machine+0x1c/0x50
[86.878927]  bxt_vtd_ggtt_insert_entries__BKL+0x3b/0x60 [i915]
[86.879652]  intel_ggtt_bind_vma+0x43/0x70 [i915]
[86.880375]  __vma_bind+0x55/0x70 [i915]
[86.881133]  fence_work+0x26/0xa0 [i915]
[86.881851]  fence_notify+0xa1/0x140 [i915]
[86.882566]  __i915_sw_fence_complete+0x8f/0x270 [i915]
[86.883286]  i915_sw_fence_commit+0x39/0x60 [i915]
[86.884003]  i915_vma_pin_ww+0x462/0x1360 [i915]
[86.884756]  ? i915_vma_pin.constprop.0+0x6c/0x1d0 [i915]
[86.885513]  i915_vma_pin.constprop.0+0x133/0x1d0 [i915]
[86.886281]  initial_plane_vma+0x307/0x840 [i915]
[86.887049]  intel_initial_plane_config+0x33f/0x670 [i915]
[86.887819]  intel_display_driver_probe_nogem+0x1c6/0x260 [i915]
[86.888587]  i915_driver_probe+0x7fa/0xe80 [i915]
[86.889293]  ? mutex_unlock+0x12/0x20
[86.889301]  ? drm_privacy_screen_get+0x171/0x190
[86.889308]  ? acpi_dev_found+0x66/0x80
[86.889321]  i915_pci_probe+0xe6/0x220 [i915]
[86.890038]  local_pci_probe+0x47/0xb0
[86.890049]  pci_device_probe+0xf3/0x260
[86.890058]  really_probe+0xf1/0x3c0
[86.890067]  __driver_probe_device+0x8c/0x180
[86.890072]  driver_probe_device+0x24/0xd0
[86.890078]  __driver_attach+0x10f/0x220
[86.890083]  ? __pfx___driver_attach+0x10/0x10
[86.890088]  bus_for_each_dev+0x7f/0xe0
[86.890097]  driver_attach+0x1e/0x30
[86.890101]  bus_add_driver+0x151/0x290
[86.890107]  driver_register+0x5e/0x130
[86.890113]  __pci_register_driver+0x7d/0x90
[86.890119]  i915_pci_register_driver+0x23/0x30 [i915]
[86.890833]  i915_init+0x37/0x120 [i915]
[86.891482]  ? __pfx_i915_init+0x10/0x10 [i915]
[86.892135]  do_one_initcall+0x60/0x3f0
[86.892145]  ? __kmalloc_cache_noprof+0x33f/0x470
[86.892157]  do_init_module+0x97/0x2a0
[86.892164]  load_module+0x2c54/0x2d80
[86.892168]  ? __kernel_read+0x15c/0x300
[86.892185]  ? kernel_read_file+0x2b1/0x320
[86.892195]  init_module_from_file+0x96/0xe0
[86.892199]  ? init_module_from_file+0x96/0xe0
[86.892211]  idempotent_init_module+0x117/0x330
[86.892224]  __x64_sys_finit_module+0x77/0x100
[86.892230]  x64_sys_call+0x24de/0x2660
[86.892236]  do_syscall_64+0x91/0xe90
[86.892243]  ? irqentry_exit+0x77/0xb0
[86.892249]  ? sysvec_apic_timer_interrupt+0x57/0xc0
[86.892256]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[86.892261] RIP: 0033:0x7303e1b2725d
[86.892271] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b bb 0d 00 f7 d8 64 89 01 48
[86.892276] RSP: 002b:00007ffddd1fdb38 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[86.892281] RAX: ffffffffffffffda RBX: 00005d771d88fd90 RCX: 00007303e1b2725d
[86.892285] RDX: 0000000000000000 RSI: 00005d771d893aa0 RDI: 000000000000000c
[86.892287] RBP: 00007ffddd1fdbf0 R08: 0000000000000040 R09: 00007ffddd1fdb80
[86.892289] R10: 00007303e1c03b20 R11: 0000000000000246 R12: 00005d771d893aa0
[86.892292] R13: 0000000000000000 R14: 00005d771d88f0d0 R15: 00005d771d895710
[86.892304]  </TASK>

Call asynchronous variant of dma_fence_work_commit() in that case.

v3: Provide more verbose in-line comment (Andi),
  - mention target environments in commit message.

Fixes: 7d1c261 ("drm/i915: Take reservation lock around i915_vma_pin.")
Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/14985
Cc: Andi Shyti <andi.shyti@kernel.org>
Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@linux.intel.com>
Reviewed-by: Sebastian Brzezinka <sebastian.brzezinka@intel.com>
Reviewed-by: Krzysztof Karas <krzysztof.karas@intel.com>
Acked-by: Andi Shyti <andi.shyti@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Link: https://lore.kernel.org/r/20251023082925.351307-6-janusz.krzysztofik@linux.intel.com
(cherry picked from commit 648ef13)
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
pull bot pushed a commit that referenced this pull request Nov 8, 2025
When a connector is connected but inactive (e.g., disabled by desktop
environments), pipe_ctx->stream_res.tg will be destroyed. Then, reading
odm_combine_segments causes kernel NULL pointer dereference.

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: Oops: 0000 [#1] SMP NOPTI
 CPU: 16 UID: 0 PID: 26474 Comm: cat Not tainted 6.17.0+ #2 PREEMPT(lazy)  e6a17af9ee6db7c63e9d90dbe5b28ccab67520c6
 Hardware name: LENOVO 21Q4/LNVNB161216, BIOS PXCN25WW 03/27/2025
 RIP: 0010:odm_combine_segments_show+0x93/0xf0 [amdgpu]
 Code: 41 83 b8 b0 00 00 00 01 75 6e 48 98 ba a1 ff ff ff 48 c1 e0 0c 48 8d 8c 07 d8 02 00 00 48 85 c9 74 2d 48 8b bc 07 f0 08 00 00 <48> 8b 07 48 8b 80 08 02 00>
 RSP: 0018:ffffd1bf4b953c58 EFLAGS: 00010286
 RAX: 0000000000005000 RBX: ffff8e35976b02d0 RCX: ffff8e3aeed052d8
 RDX: 00000000ffffffa1 RSI: ffff8e35a3120800 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff8e3580eb0000 R09: ffff8e35976b02d0
 R10: ffffd1bf4b953c78 R11: 0000000000000000 R12: ffffd1bf4b953d08
 R13: 0000000000040000 R14: 0000000000000001 R15: 0000000000000001
 FS:  00007f44d3f9f740(0000) GS:ffff8e3caa47f000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000006485c2000 CR4: 0000000000f50ef0
 PKRU: 55555554
 Call Trace:
  <TASK>
  seq_read_iter+0x125/0x490
  ? __alloc_frozen_pages_noprof+0x18f/0x350
  seq_read+0x12c/0x170
  full_proxy_read+0x51/0x80
  vfs_read+0xbc/0x390
  ? __handle_mm_fault+0xa46/0xef0
  ? do_syscall_64+0x71/0x900
  ksys_read+0x73/0xf0
  do_syscall_64+0x71/0x900
  ? count_memcg_events+0xc2/0x190
  ? handle_mm_fault+0x1d7/0x2d0
  ? do_user_addr_fault+0x21a/0x690
  ? exc_page_fault+0x7e/0x1a0
  entry_SYSCALL_64_after_hwframe+0x6c/0x74
 RIP: 0033:0x7f44d4031687
 Code: 48 89 fa 4c 89 df e8 58 b3 00 00 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 1a 5b c3 0f 1f 84 00 00 00 00 00 48 8b 44 24 10 0f 05 <5b> c3 0f 1f 80 00 00 00 00>
 RSP: 002b:00007ffdb4b5f0b0 EFLAGS: 00000202 ORIG_RAX: 0000000000000000
 RAX: ffffffffffffffda RBX: 00007f44d3f9f740 RCX: 00007f44d4031687
 RDX: 0000000000040000 RSI: 00007f44d3f5e000 RDI: 0000000000000003
 RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000202 R12: 00007f44d3f5e000
 R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000040000
  </TASK>
 Modules linked in: tls tcp_diag inet_diag xt_mark ccm snd_hrtimer snd_seq_dummy snd_seq_midi snd_seq_oss snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device x>
  snd_hda_codec_atihdmi snd_hda_codec_realtek_lib lenovo_wmi_helpers think_lmi snd_hda_codec_generic snd_hda_codec_hdmi snd_soc_core kvm snd_compress uvcvideo sn>
  platform_profile joydev amd_pmc mousedev mac_hid sch_fq_codel uinput i2c_dev parport_pc ppdev lp parport nvme_fabrics loop nfnetlink ip_tables x_tables dm_cryp>
 CR2: 0000000000000000
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:odm_combine_segments_show+0x93/0xf0 [amdgpu]
 Code: 41 83 b8 b0 00 00 00 01 75 6e 48 98 ba a1 ff ff ff 48 c1 e0 0c 48 8d 8c 07 d8 02 00 00 48 85 c9 74 2d 48 8b bc 07 f0 08 00 00 <48> 8b 07 48 8b 80 08 02 00>
 RSP: 0018:ffffd1bf4b953c58 EFLAGS: 00010286
 RAX: 0000000000005000 RBX: ffff8e35976b02d0 RCX: ffff8e3aeed052d8
 RDX: 00000000ffffffa1 RSI: ffff8e35a3120800 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff8e3580eb0000 R09: ffff8e35976b02d0
 R10: ffffd1bf4b953c78 R11: 0000000000000000 R12: ffffd1bf4b953d08
 R13: 0000000000040000 R14: 0000000000000001 R15: 0000000000000001
 FS:  00007f44d3f9f740(0000) GS:ffff8e3caa47f000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 00000006485c2000 CR4: 0000000000f50ef0
 PKRU: 55555554

Fix this by checking pipe_ctx->stream_res.tg before dereferencing.

Fixes: 07926ba ("drm/amd/display: Add debugfs interface for ODM combine info")
Signed-off-by: Rong Zhang <i@rong.moe>
Reviewed-by: Mario Limoncello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
(cherry picked from commit f19bbec)
Cc: stable@vger.kernel.org
pull bot pushed a commit that referenced this pull request Nov 10, 2025
Add VMX exit handlers for SEAMCALL and TDCALL to inject a #UD if a non-TD
guest attempts to execute SEAMCALL or TDCALL.  Neither SEAMCALL nor TDCALL
is gated by any software enablement other than VMXON, and so will generate
a VM-Exit instead of e.g. a native #UD when executed from the guest kernel.

Note!  No unprivileged DoS of the L1 kernel is possible as TDCALL and
SEAMCALL #GP at CPL > 0, and the CPL check is performed prior to the VMX
non-root (VM-Exit) check, i.e. userspace can't crash the VM. And for a
nested guest, KVM forwards unknown exits to L1, i.e. an L2 kernel can
crash itself, but not L1.

Note #2!  The Intel® Trust Domain CPU Architectural Extensions spec's
pseudocode shows the CPL > 0 check for SEAMCALL coming _after_ the VM-Exit,
but that appears to be a documentation bug (likely because the CPL > 0
check was incorrectly bundled with other lower-priority #GP checks).
Testing on SPR and EMR shows that the CPL > 0 check is performed before
the VMX non-root check, i.e. SEAMCALL #GPs when executed in usermode.

Note #3!  The aforementioned Trust Domain spec uses confusing pseudocode
that says that SEAMCALL will #UD if executed "inSEAM", but "inSEAM"
specifically means in SEAM Root Mode, i.e. in the TDX-Module.  The long-
form description explicitly states that SEAMCALL generates an exit when
executed in "SEAM VMX non-root operation".  But that's a moot point as the
TDX-Module injects #UD if the guest attempts to execute SEAMCALL, as
documented in the "Unconditionally Blocked Instructions" section of the
TDX-Module base specification.

Cc: stable@vger.kernel.org
Cc: Kai Huang <kai.huang@intel.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Link: https://lore.kernel.org/r/20251016182148.69085-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Nov 10, 2025
 into HEAD

KVM/riscv fixes for 6.18, take #2

- Fix check for local interrupts on riscv32
- Read HGEIP CSR on the correct cpu when checking for IMSIC interrupts
- Remove automatic I/O mapping from kvm_arch_prepare_memory_region()
pull bot pushed a commit that referenced this pull request Nov 10, 2025
…/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm654 fixes for 6.18, take #2

* Core fixes

  - Fix trapping regression when no in-kernel irqchip is present
    (20251021094358.1963807-1-sascha.bischoff@arm.com)

  - Check host-provided, untrusted ranges and offsets in pKVM
    (20251016164541.3771235-1-vdonnefort@google.com)
    (20251017075710.2605118-1-sebastianene@google.com)

  - Fix regression restoring the ID_PFR1_EL1 register
    (20251030122707.2033690-1-maz@kernel.org

  - Fix vgic ITS locking issues when LPIs are not directly injected
    (20251107184847.1784820-1-oupton@kernel.org)

* Test fixes

  - Correct target CPU programming in vgic_lpi_stress selftest
    (20251020145946.48288-1-mdittgen@amazon.de)

  - Fix exposure of SCTLR2_EL2 and ZCR_EL2 in get-reg-list selftest
    (20251023-b4-kvm-arm64-get-reg-list-sctlr-el2-v1-1-088f88ff992a@kernel.org)
    (20251024-kvm-arm64-get-reg-list-zcr-el2-v1-1-0cd0ff75e22f@kernel.org)

* Misc

  - Update Oliver's email address
    (20251107012830.1708225-1-oupton@kernel.org)
pull bot pushed a commit that referenced this pull request Nov 13, 2025
When freeing indexed arrays, the corresponding free function should
be called for each entry of the indexed array. For example, for
for 'struct tc_act_attrs' 'tc_act_attrs_free(...)' needs to be called
for each entry.

Previously, memory leaks were reported when enabling the ASAN
analyzer.

=================================================================
==874==ERROR: LeakSanitizer: detected memory leaks

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7f221fd20cb5 in malloc ./debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:67
    #1 0x55c98db048af in tc_act_attrs_set_options_vlan_parms ../generated/tc-user.h:2813
    #2 0x55c98db048af in main  ./linux/tools/net/ynl/samples/tc-filter-add.c:71

Direct leak of 24 byte(s) in 1 object(s) allocated from:
    #0 0x7f221fd20cb5 in malloc ./debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:67
    #1 0x55c98db04a93 in tc_act_attrs_set_options_vlan_parms ../generated/tc-user.h:2813
    #2 0x55c98db04a93 in main ./linux/tools/net/ynl/samples/tc-filter-add.c:74

Direct leak of 10 byte(s) in 2 object(s) allocated from:
    #0 0x7f221fd20cb5 in malloc ./debug/gcc/gcc/libsanitizer/asan/asan_malloc_linux.cpp:67
    #1 0x55c98db0527d in tc_act_attrs_set_kind ../generated/tc-user.h:1622

SUMMARY: AddressSanitizer: 58 byte(s) leaked in 4 allocation(s).

The following diff illustrates the changes introduced compared to the
previous version of the code.

 void tc_flower_attrs_free(struct tc_flower_attrs *obj)
 {
+	unsigned int i;
+
 	free(obj->indev);
+	for (i = 0; i < obj->_count.act; i++)
+		tc_act_attrs_free(&obj->act[i]);
 	free(obj->act);
 	free(obj->key_eth_dst);
 	free(obj->key_eth_dst_mask);

Signed-off-by: Zahari Doychev <zahari.doychev@linux.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://patch.msgid.link/20251106151529.453026-3-zahari.doychev@linux.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
pull bot pushed a commit that referenced this pull request Nov 20, 2025
Handle skb allocation failures in RX path, to avoid NULL pointer
dereference and RX stalls under memory pressure. If the refill fails
with -ENOMEM, complete napi polling and wake up later to retry via timer.
Also explicitly re-enable RX DMA after oom, so the dmac doesn't remain
stopped in this situation.

Previously, memory pressure could lead to skb allocation failures and
subsequent Oops like:

	Oops: Kernel access of bad area, sig: 11 [#2]
	Hardware name: SonyPS3 Cell Broadband Engine 0x701000 PS3
	NIP [c0003d0000065900] gelic_net_poll+0x6c/0x2d0 [ps3_gelic] (unreliable)
	LR [c0003d00000659c4] gelic_net_poll+0x130/0x2d0 [ps3_gelic]
	Call Trace:
	  gelic_net_poll+0x130/0x2d0 [ps3_gelic] (unreliable)
	  __napi_poll+0x44/0x168
	  net_rx_action+0x178/0x290

Steps to reproduce the issue:
	1. Start a continuous network traffic, like scp of a 20GB file
	2. Inject failslab errors using the kernel fault injection:
	    echo -1 > /sys/kernel/debug/failslab/times
	    echo 30 > /sys/kernel/debug/failslab/interval
	    echo 100 > /sys/kernel/debug/failslab/probability
	3. After some time, traces start to appear, kernel Oopses
	   and the system stops

Step 2 is not always necessary, as it is usually already triggered by
the transfer of a big enough file.

Fixes: 02c1889 ("ps3: gigabit ethernet driver for PS3, take3")
Signed-off-by: Florian Fuchs <fuchsfl@gmail.com>
Link: https://patch.msgid.link/20251113181000.3914980-1-fuchsfl@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pull bot pushed a commit that referenced this pull request Dec 6, 2025
When populating the initial memory image for a TDX guest, ADD pages to the
TD as part of establishing the mappings in the mirror EPT, as opposed to
creating the mappings and then doing ADD after the fact.  Doing ADD in the
S-EPT callbacks eliminates the need to track "premapped" pages, as the
mirror EPT (M-EPT) and S-EPT are always synchronized, e.g. if ADD fails,
KVM reverts to the previous M-EPT entry (guaranteed to be !PRESENT).

Eliminating the hole where the M-EPT can have a mapping that doesn't exist
in the S-EPT in turn obviates the need to handle errors that are unique to
encountering a missing S-EPT entry (see tdx_is_sept_zap_err_due_to_premap()).

Keeping the M-EPT and S-EPT synchronized also eliminates the need to check
for unconsumed "premap" entries during tdx_td_finalize(), as there simply
can't be any such entries.  Dropping that check in particular reduces the
overall cognitive load, as the management of nr_premapped with respect
to removal of S-EPT is _very_ subtle.  E.g. successful removal of an S-EPT
entry after it completed ADD doesn't adjust nr_premapped, but it's not
clear why that's "ok" but having half-baked entries is not (it's not truly
"ok" in that removing pages from the image will likely prevent the guest
from booting, but from KVM's perspective it's "ok").

Doing ADD in the S-EPT path requires passing an argument via a scratch
field, but the current approach of tracking the number of "premapped"
pages effectively does the same.  And the "premapped" counter is much more
dangerous, as it doesn't have a singular lock to protect its usage, since
nr_premapped can be modified as soon as mmu_lock is dropped, at least in
theory.  I.e. nr_premapped is guarded by slots_lock, but only for "happy"
paths.

Note, this approach was used/tried at various points in TDX development,
but was ultimately discarded due to a desire to avoid stashing temporary
state in kvm_tdx.  But as above, KVM ended up with such state anyways,
and fully committing to using temporary state provides better access
rules (100% guarded by slots_lock), and makes several edge cases flat out
impossible.

Note #2, continue to extend the measurement outside of mmu_lock, as it's
a slow operation (typically 16 SEAMCALLs per page whose data is included
in the measurement), and doesn't *need* to be done under mmu_lock, e.g.
for consistency purposes.  However, MR.EXTEND isn't _that_ slow, e.g.
~1ms latency to measure a full page, so if it needs to be done under
mmu_lock in the future, e.g. because KVM gains a flow that can remove
S-EPT entries during KVM_TDX_INIT_MEM_REGION, then extending the
measurement can also be moved into the S-EPT mapping path (again, only if
absolutely necessary).  P.S. _If_ MR.EXTEND is moved into the S-EPT path,
take care not to return an error up the stack if TDH_MR_EXTEND fails, as
removing the M-EPT entry but not the S-EPT entry would result in
inconsistent state!

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Dec 6, 2025
During vCPU creation, a vCPU may be destroyed immediately after
kvm_arch_vcpu_create() (e.g., due to vCPU id confiliction). However, the
vcpu_load() inside kvm_arch_vcpu_create() may have associate the vCPU to
pCPU via "list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu))"
before invoking tdx_vcpu_free().

Though there's no need to invoke tdh_vp_flush() on the vCPU, failing to
dissociate the vCPU from pCPU (i.e., "list_del(&to_tdx(vcpu)->cpu_list)")
will cause list corruption of the per-pCPU list associated_tdvcpus.

Then, a later list_add() during vcpu_load() would detect list corruption
and print calltrace as shown below.

Dissociate a vCPU from its associated pCPU in tdx_vcpu_free() for the vCPUs
destroyed immediately after creation which must be in
VCPU_TD_STATE_UNINITIALIZED state.

kernel BUG at lib/list_debug.c:29!
Oops: invalid opcode: 0000 [#2] SMP NOPTI
RIP: 0010:__list_add_valid_or_report+0x82/0xd0

Call Trace:
 <TASK>
 tdx_vcpu_load+0xa8/0x120
 vt_vcpu_load+0x25/0x30
 kvm_arch_vcpu_load+0x81/0x300
 vcpu_load+0x55/0x90
 kvm_arch_vcpu_create+0x24f/0x330
 kvm_vm_ioctl_create_vcpu+0x1b1/0x53
 kvm_vm_ioctl+0xc2/0xa60
  __x64_sys_ioctl+0x9a/0xf0
 x64_sys_call+0x10ee/0x20d0
 do_syscall_64+0xc3/0x470
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Fixes: d789fa6 ("KVM: TDX: Handle vCPU dissociation")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Yan Zhao <yan.y.zhao@intel.com>
Tested-by: Kai Huang <kai.huang@intel.com>
Link: https://patch.msgid.link/20251030200951.3402865-29-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Dec 6, 2025
According to the APM Volume #2, Section 15.17, Table 15-10 (24593—Rev.
3.42—March 2024), When "GIF==0", an "Debug exception or trap, due to
breakpoint register match" should be "Ignored and discarded".

KVM lacks any handling of this. Even when vGIF is enabled and vGIF==0,
the CPU does not ignore #DBs and relies on the VMM to do so.

Handling this is possible, but the complexity is unjustified given the
rarity of using HW breakpoints when GIF==0 (e.g. near VMRUN). KVM would
need to intercept the #DB, temporarily disable the breakpoint,
singe-step over the instruction (probably reusing NMI singe-stepping),
and re-enable the breakpoint.

Instead, document this as an erratum.

Signed-off-by: Yosry Ahmed <yosry.ahmed@linux.dev>
Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Link: https://patch.msgid.link/20251030223757.2950309-1-yosry.ahmed@linux.dev
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Dec 6, 2025
Rework the handling of the MMIO Stale Data mitigation to clear CPU buffers
immediately prior to VM-Enter, i.e. in the same location that KVM emits a
VERW for unconditional (at runtime) clearing.  Co-locating the code and
using a single ALTERNATIVES_2 makes it more obvious how VMX mitigates the
various vulnerabilities.

Deliberately order the alternatives as:

 0. Do nothing
 1. Clear if vCPU can access MMIO
 2. Clear always

since the last alternative wins in ALTERNATIVES_2(), i.e. so that KVM will
honor the strictest mitigation (always clear CPU buffers) if multiple
mitigations are selected.  E.g. even if the kernel chooses to mitigate
MMIO Stale Data via X86_FEATURE_CLEAR_CPU_BUF_VM_MMIO, another mitigation
may enable X86_FEATURE_CLEAR_CPU_BUF_VM, and that other thing needs to win.

Note, decoupling the MMIO mitigation from the L1TF mitigation also fixes
a mostly-benign flaw where KVM wouldn't do any clearing/flushing if the
L1TF mitigation is configured to conditionally flush the L1D, and the MMIO
mitigation but not any other "clear CPU buffers" mitigation is enabled.
For that specific scenario, KVM would skip clearing CPU buffers for the
MMIO mitigation even though the kernel requested a clear on every VM-Enter.

Note #2, the flaw goes back to the introduction of the MDS mitigation.  The
MDS mitigation was inadvertently fixed by commit 43fb862 ("KVM/VMX:
Move VERW closer to VMentry for MDS mitigation"), but previous kernels
that flush CPU buffers in vmx_vcpu_enter_exit() are affected (though it's
unlikely the flaw is meaningfully exploitable even older kernels).

Fixes: 650b68a ("x86/kvm/vmx: Add MDS protection when L1D Flush is not active")
Suggested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Reviewed-by: Brendan Jackman <jackmanb@google.com>
Link: https://patch.msgid.link/20251113233746.1703361-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
pull bot pushed a commit that referenced this pull request Dec 7, 2025
… 'T'

When perf report with annotation for a symbol, press 's' and 'T', then exit
the annotate browser. Once annotate the same symbol, the annotate browser
will crash.

The browser.arch was required to be correctly updated when data type
feature was enabled by 'T'. Usually it was initialized by symbol__annotate2
function. If a symbol has already been correctly annotated at the first
time, it should not call the symbol__annotate2 function again, thus the
browser.arch will not get initialized. Then at the second time to show the
annotate browser, the data type needs to be displayed but the browser.arch
is empty.

Stack trace as below:

Perf: Segmentation fault
-------- backtrace --------
    #0 0x55d365 in ui__signal_backtrace setup.c:0
    #1 0x7f5ff1a3e930 in __restore_rt libc.so.6[3e930]
    #2 0x570f08 in arch__is perf[570f08]
    #3 0x562186 in annotate_get_insn_location perf[562186]
    #4 0x562626 in __hist_entry__get_data_type annotate.c:0
    #5 0x56476d in annotation_line__write perf[56476d]
    #6 0x54e2db in annotate_browser__write annotate.c:0
    #7 0x54d061 in ui_browser__list_head_refresh perf[54d061]
    #8 0x54dc9e in annotate_browser__refresh annotate.c:0
    #9 0x54c03d in __ui_browser__refresh browser.c:0
    #10 0x54ccf8 in ui_browser__run perf[54ccf8]
    #11 0x54eb92 in __hist_entry__tui_annotate perf[54eb92]
    #12 0x552293 in do_annotate hists.c:0
    #13 0x55941c in evsel__hists_browse hists.c:0
    #14 0x55b00f in evlist__tui_browse_hists perf[55b00f]
    #15 0x42ff02 in cmd_report perf[42ff02]
    #16 0x494008 in run_builtin perf.c:0
    #17 0x494305 in handle_internal_command perf.c:0
    #18 0x410547 in main perf[410547]
    #19 0x7f5ff1a295d0 in __libc_start_call_main libc.so.6[295d0]
    #20 0x7f5ff1a29680 in __libc_start_main@@GLIBC_2.34 libc.so.6[29680]
    #21 0x410b75 in _start perf[410b75]

Fixes: 1d4374a ("perf annotate: Add 'T' hot key to toggle data type display")
Reviewed-by: James Clark <james.clark@linaro.org>
Tested-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Tianyou Li <tianyou.li@intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
pull bot pushed a commit that referenced this pull request Dec 7, 2025
When using perf record with the `--overwrite` option, a segmentation fault
occurs if an event fails to open. For example:

  perf record -e cycles-ct -F 1000 -a --overwrite
  Error:
  cycles-ct:H: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'
  perf: Segmentation fault
      #0 0x6466b6 in dump_stack debug.c:366
      #1 0x646729 in sighandler_dump_stack debug.c:378
      #2 0x453fd1 in sigsegv_handler builtin-record.c:722
      #3 0x7f8454e65090 in __restore_rt libc-2.32.so[54090]
      #4 0x6c5671 in __perf_event__synthesize_id_index synthetic-events.c:1862
      #5 0x6c5ac0 in perf_event__synthesize_id_index synthetic-events.c:1943
      #6 0x458090 in record__synthesize builtin-record.c:2075
      #7 0x45a85a in __cmd_record builtin-record.c:2888
      #8 0x45deb6 in cmd_record builtin-record.c:4374
      #9 0x4e5e33 in run_builtin perf.c:349
      #10 0x4e60bf in handle_internal_command perf.c:401
      #11 0x4e6215 in run_argv perf.c:448
      #12 0x4e653a in main perf.c:555
      #13 0x7f8454e4fa72 in __libc_start_main libc-2.32.so[3ea72]
      #14 0x43a3ee in _start ??:0

The --overwrite option implies --tail-synthesize, which collects non-sample
events reflecting the system status when recording finishes. However, when
evsel opening fails (e.g., unsupported event 'cycles-ct'), session->evlist
is not initialized and remains NULL. The code unconditionally calls
record__synthesize() in the error path, which iterates through the NULL
evlist pointer and causes a segfault.

To fix it, move the record__synthesize() call inside the error check block, so
it's only called when there was no error during recording, ensuring that evlist
is properly initialized.

Fixes: 4ea648a ("perf record: Add --tail-synthesize option")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
pull bot pushed a commit that referenced this pull request Dec 7, 2025
When interrupting perf stat in repeat mode with a signal the signal is
passed to the child process but the repeat doesn't terminate:
```
$ perf stat -v --null --repeat 10 sleep 1
Control descriptor is not initialized
[ perf stat: executing run #1 ... ]
[ perf stat: executing run #2 ... ]
^Csleep: Interrupt
[ perf stat: executing run #3 ... ]
[ perf stat: executing run #4 ... ]
[ perf stat: executing run #5 ... ]
[ perf stat: executing run #6 ... ]
[ perf stat: executing run #7 ... ]
[ perf stat: executing run #8 ... ]
[ perf stat: executing run #9 ... ]
[ perf stat: executing run #10 ... ]

 Performance counter stats for 'sleep 1' (10 runs):

            0.9500 +- 0.0512 seconds time elapsed  ( +-  5.39% )

0.01user 0.02system 0:09.53elapsed 0%CPU (0avgtext+0avgdata 18940maxresident)k
29944inputs+0outputs (0major+2629minor)pagefaults 0swaps
```

Terminate the repeated run and give a reasonable exit value:
```
$ perf stat -v --null --repeat 10 sleep 1
Control descriptor is not initialized
[ perf stat: executing run #1 ... ]
[ perf stat: executing run #2 ... ]
[ perf stat: executing run #3 ... ]
^Csleep: Interrupt

 Performance counter stats for 'sleep 1' (10 runs):

             0.680 +- 0.321 seconds time elapsed  ( +- 47.16% )

Command exited with non-zero status 130
0.00user 0.01system 0:02.05elapsed 0%CPU (0avgtext+0avgdata 70688maxresident)k
0inputs+0outputs (0major+5002minor)pagefaults 0swaps
```

Note, this also changes the exit value for non-repeat runs when
interrupted by a signal.

Reported-by: Ingo Molnar <mingo@kernel.org>
Closes: https://lore.kernel.org/lkml/aS5wjmbAM9ka3M2g@gmail.com/
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Thomas Richter <tmricht@linux.ibm.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
pull bot pushed a commit that referenced this pull request Dec 8, 2025
Commit 1d2da79 ("pinctrl: renesas: rzg2l: Avoid configuring ISEL in
gpio_irq_{en,dis}able*()") dropped the configuration of ISEL from
struct irq_chip::{irq_enable, irq_disable} APIs and moved it to
struct gpio_chip::irq::{child_to_parent_hwirq,
child_irq_domain_ops::free} APIs to fix spurious IRQs.

After commit 1d2da79 ("pinctrl: renesas: rzg2l: Avoid configuring ISEL
in gpio_irq_{en,dis}able*()"), ISEL was no longer configured properly on
resume. This is because the pinctrl resume code used
struct irq_chip::irq_enable  (called from rzg2l_gpio_irq_restore()) to
reconfigure the wakeup interrupts. Some drivers (e.g. Ethernet) may also
reconfigure non-wakeup interrupts on resume through their own code,
eventually calling struct irq_chip::irq_enable.

Fix this by adding ISEL configuration back into the
struct irq_chip::irq_enable API and on resume path for wakeup interrupts.

As struct irq_chip::irq_enable needs now to lock to update the ISEL,
convert the struct rzg2l_pinctrl::lock to a raw spinlock and replace the
locking API calls with the raw variants. Otherwise the lockdep reports
invalid wait context when probing the adv7511 module on RZ/G2L:

 [ BUG: Invalid wait context ]
 6.17.0-rc5-next-20250911-00001-gfcfac22533c9 #18 Not tainted
 -----------------------------
 (udev-worker)/165 is trying to lock:
 ffff00000e3664a8 (&pctrl->lock){....}-{3:3}, at: rzg2l_gpio_irq_enable+0x38/0x78
 other info that might help us debug this:
 context-{5:5}
 3 locks held by (udev-worker)/165:
 #0: ffff00000e890108 (&dev->mutex){....}-{4:4}, at: __driver_attach+0x90/0x1ac
 #1: ffff000011c07240 (request_class){+.+.}-{4:4}, at: __setup_irq+0xb4/0x6dc
 #2: ffff000011c070c8 (lock_class){....}-{2:2}, at: __setup_irq+0xdc/0x6dc
 stack backtrace:
 CPU: 1 UID: 0 PID: 165 Comm: (udev-worker) Not tainted 6.17.0-rc5-next-20250911-00001-gfcfac22533c9 #18 PREEMPT
 Hardware name: Renesas SMARC EVK based on r9a07g044l2 (DT)
 Call trace:
 show_stack+0x18/0x24 (C)
 dump_stack_lvl+0x90/0xd0
 dump_stack+0x18/0x24
 __lock_acquire+0xa14/0x20b4
 lock_acquire+0x1c8/0x354
 _raw_spin_lock_irqsave+0x60/0x88
 rzg2l_gpio_irq_enable+0x38/0x78
 irq_enable+0x40/0x8c
 __irq_startup+0x78/0xa4
 irq_startup+0x108/0x16c
 __setup_irq+0x3c0/0x6dc
 request_threaded_irq+0xec/0x1ac
 devm_request_threaded_irq+0x80/0x134
 adv7511_probe+0x928/0x9a4 [adv7511]
 i2c_device_probe+0x22c/0x3dc
 really_probe+0xbc/0x2a0
 __driver_probe_device+0x78/0x12c
 driver_probe_device+0x40/0x164
 __driver_attach+0x9c/0x1ac
 bus_for_each_dev+0x74/0xd0
 driver_attach+0x24/0x30
 bus_add_driver+0xe4/0x208
 driver_register+0x60/0x128
 i2c_register_driver+0x48/0xd0
 adv7511_init+0x5c/0x1000 [adv7511]
 do_one_initcall+0x64/0x30c
 do_init_module+0x58/0x23c
 load_module+0x1bcc/0x1d40
 init_module_from_file+0x88/0xc4
 idempotent_init_module+0x188/0x27c
 __arm64_sys_finit_module+0x68/0xac
 invoke_syscall+0x48/0x110
 el0_svc_common.constprop.0+0xc0/0xe0
 do_el0_svc+0x1c/0x28
 el0_svc+0x4c/0x160
 el0t_64_sync_handler+0xa0/0xe4
 el0t_64_sync+0x198/0x19c

Having ISEL configuration back into the struct irq_chip::irq_enable API
should be safe with respect to spurious IRQs, as in the probe case IRQs
are enabled anyway in struct gpio_chip::irq::child_to_parent_hwirq. No
spurious IRQs were detected on suspend/resume, boot, ethernet link
insert/remove tests (executed on RZ/G3S). Boot, ethernet link
insert/remove tests were also executed successfully on RZ/G2L.

Fixes: 1d2da79 ("pinctrl: renesas: rzg2l: Avoid configuring ISEL in gpio_irq_{en,dis}able*(")
Cc: stable@vger.kernel.org
Signed-off-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://patch.msgid.link/20250912095308.3603704-1-claudiu.beznea.uj@bp.renesas.com
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
pull bot pushed a commit that referenced this pull request Dec 8, 2025
The pic64gx has a second pinmux "downstream" of the iomux0 pinmux. The
documentation for the SoC provides no name for this device, but it is
used to swap pins between either GPIO controller #2 or select other
functions, hence the "gpio2" name. Currently there is no documentation
about what each bit actually does that is publicly available, nor (I
believe) what pins are affected. That info is as follows:

pin     role (1/0)
---     ----------
E14	MAC_0_MDC/GPIO_2_0
E15	MAC_0_MDIO/GPIO_2_1
F16	MAC_1_MDC/GPIO_2_2
F17	MAC_1_MDIO/GPIO_2_3
D19	SPI_0_CLK/GPIO_2_4
B18	SPI_0_SS0/GPIO_2_5
B10	CAN_0_RXBUS/GPIO_2_6
C14	PCIE_PERST_2#/GPIO_2_7
E18	PCIE_WAKE#/GPIO_2_8
D18	PCIE_PERST_1#/GPIO_2_9
E19	SPI_0_DO/GPIO_2_10
C7	SPI_0_DI/GPIO_2_11
D6	QSPI_SS0/GPIO_2_12
D7	QSPI_CLK (B)/GPIO_2_13
C9	QSPI_DATA0/GPIO_2_14
C10	QSPI_DATA1/GPIO_2_15
A5	QSPI_DATA2/GPIO_2_16
A6	QSPI_DATA3/GPIO_2_17
D8	MMUART_3_RXD/GPIO_2_18
D9	MMUART_3_TXD/GPIO_2_19
B8	MMUART_4_RXD/GPIO_2_20
A8	MMUART_4_TXD/GPIO_2_21
C12	CAN_1_TXBUS/GPIO_2_22
B12	CAN_1_RXBUS/GPIO_2_23
A11	CAN_0_TX_EBL_N/GPIO_2_24
A10	CAN_1_TX_EBL_N/GPIO_2_25
D11	MMUART_2_RXD/GPIO_2_26
C11	MMUART_2_TXD/GPIO_2_27
B9	CAN_0_TXBUS/GPIO_2_28

Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
pull bot pushed a commit that referenced this pull request Dec 8, 2025
The pic64gx has a second pinmux "downstream" of the iomux0 pinmux. The
documentation for the SoC provides no name for this device, but it is
used to swap pins between either GPIO controller #2 or select other
functions, hence the "gpio2" name. Add a driver for it.

Signed-off-by: Conor Dooley <conor.dooley@microchip.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants