forked from gregkh/linux
-
Notifications
You must be signed in to change notification settings - Fork 22
Amzn2 5.10.144 SVE state trap patch #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
RamaMalladiAWS
wants to merge
8
commits into
amazonlinux:amazon-5.10.144
from
RamaMalladiAWS:amzn2-5.10.144-svepatch
Closed
Amzn2 5.10.144 SVE state trap patch #1
RamaMalladiAWS
wants to merge
8
commits into
amazonlinux:amazon-5.10.144
from
RamaMalladiAWS:amzn2-5.10.144-svepatch
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Since 8383741 (KVM: arm64: Get rid of host SVE tracking/saving) KVM has not tracked the host SVE state, relying on the fact that we currently disable SVE whenever we perform a syscall. This may not be true in future since performance optimisation may result in us keeping SVE enabled in order to avoid needing to take access traps to reenable it. Handle this by clearing TIF_SVE and converting the stored task state to FPSIMD format when preparing to run the guest. This is done with a new call fpsimd_kvm_prepare() to keep the direct state manipulation functions internal to fpsimd.c. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221115094640.112848-2-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
When we save the state for the floating point registers this can be done in the form visible through either the FPSIMD V registers or the SVE Z and P registers. At present we track which format is currently used based on TIF_SVE and the SME streaming mode state but particularly in the SVE case this limits our options for optimising things, especially around syscalls. Introduce a new enum which we place together with saved floating point state in both thread_struct and the KVM guest state which explicitly states which format is active and keep it up to date when we change it. At present we do not use this state except to verify that it has the expected value when loading the state, future patches will introduce functional changes. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221115094640.112848-3-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
In order to avoid needlessly saving and restoring the guest registers KVM relies on the host FPSMID code to save the guest registers when we context switch away from the guest. This is done by binding the KVM guest state to the CPU on top of the task state that was originally there, then carefully managing the TIF_SVE flag for the task to cause the host to save the full SVE state when needed regardless of the needs of the host task. This works well enough but isn't terribly direct about what is going on and makes it much more complicated to try to optimise what we're doing with the SVE register state. Let's instead have KVM pass in the register state it wants saving when it binds to the CPU. We introduce a new FP_STATE_CURRENT for use during normal task binding to indicate that we should base our decisions on the current task. This should not be used when actually saving. Ideally we might want to use a separate enum for the type to save but this enum and the enum values would then need to be named which has problems with clarity and ambiguity. In order to ease any future debugging that might be required this patch does not actually update any of the decision making about what to save, it merely starts tracking the new information and warns if the requested state is not what we would otherwise have decided to save. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221115094640.112848-4-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
Now that we are explicitly telling the host FP code which register state it needs to save we can remove the manipulation of TIF_SVE from the KVM code, simplifying it and allowing us to optimise our handling of normal tasks. Remove the manipulation of TIF_SVE from KVM and instead rely on to_save to ensure we save the correct data for it. There should be no functional or performance impact from this change. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221115094640.112848-5-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
Now that we are recording the type of floating point register state we are saving when we write the register state out to memory we can use that information when we load from memory to decide which format to load, bringing TIF_SVE into line with what we saved rather than relying on TIF_SVE to determine what to load. The SME state details are already recorded directly in the saved SVCR and handled based on the information there. Since we are not changing any of the save paths there should be no functional change from this patch, further patches will make use of this to optimise and clarify the code. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221115094640.112848-6-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
Now that we track the type of the stored register state separately to what is active in the task, it is valid to have the FPSIMD register state stored while in streaming mode. Remove the special case handling for SME when setting FPSIMD register state. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221115094640.112848-7-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
The syscall ABI says that the SVE register state not shared with FPSIMD may not be preserved on syscall, and this is the only mechanism we have in the ABI to stop tracking the extra SVE state for a process. Currently we do this unconditionally by means of disabling SVE for the process on syscall, causing userspace to take a trap to EL1 if it uses SVE again. These extra traps result in a noticeable overhead for using SVE instead of FPSIMD in some workloads, especially for simple syscalls where we can return directly to userspace and would not otherwise need to update the floating point registers. Tests with fp-pidbench show an approximately 70% overhead on a range of implementations when SVE is in use - while this is an extreme and entirely artificial benchmark it is clear that there is some useful room for improvement here. Now that we have the ability to track the decision about what to save seprately to TIF_SVE we can improve things by leaving TIF_SVE enabled on syscall but only saving the FPSIMD registers if we are in a syscall. This means that if we need to restore the register state from memory (eg, after a context switch or kernel mode NEON) we will drop TIF_SVE and reenable traps for userspace but if we can just return to userspace then traps will remain disabled. Since our current implementation and hence ABI has the effect of zeroing all the SVE register state not shared with FPSIMD on syscall we replace the disabling of TIF_SVE with a flush of the non-shared register state, this means that there is still some overhead for syscalls when SVE is in use but it is very much reduced. Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20221115094640.112848-8-broonie@kernel.org Signed-off-by: Will Deacon <will@kernel.org>
…yscall by default" https://lore.kernel.org/linux-arm-kernel/20220620124158.482039-6-broonie@kernel.org/T/#m4951d967db2b2091769db05d2ffc29ddf319b62b Other code changes for successfully merging this SVE patch fix patch to the kernel 5.10.
risbhat
pushed a commit
that referenced
this pull request
Aug 1, 2023
This patch came from the 2.10.5 branch of AmazonFSxLustreClient repo. The patch from that repo is: mdc: hold lock while walking changelog dev list (LU-12566) Prevent the following GPF which is causing stuck processes, when running mount and umount concurrently on the same host. general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:mdc_changelog_cdev_finish+0x3f/0x1b9 [mdc] ... Call Trace: mdc_precleanup+0x2a/0x3c0 [mdc] Original patch was: LU-12566 mdc: hold lock while walking changelog dev list In mdc_changelog_cdev_finish() we need chlg_registered_dev_lock while walking and changing entries on the chlog_registered_devs and ced_obds lists in chlg_registered_dev_find_by_obd(). Move the calling of chlg_registered_dev_find_by_obd() under the mutex, and add assertions to the places where the lists are walked and changed that the mutex is held. Lustre-change: https://review.whamcloud.com/35668 Lustre-commit: a260c530801db7f58efa93b774f06b0ce72649a3 Fixes: 1d40214d96dd ("LU-7659 mdc: expose changelog through char devices") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Ib62fdff87cde6a4bcfb9bea24a2ea72a933ebbe5 Signed-off-by: Minh Diep <mdiep@whamcloud.com> Reviewed-on: https://review.whamcloud.com/35835 Signed-off-by: Andy Strohman <astroh@amazon.com>
risbhat
pushed a commit
that referenced
this pull request
Aug 1, 2023
Steal time initialization requires mapping a memory region which invokes a memory allocation. Doing this at CPU starting time results in the following trace when CONFIG_DEBUG_ATOMIC_SLEEP is enabled: BUG: sleeping function called from invalid context at mm/slab.h:498 in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #1 Call trace: dump_backtrace+0x0/0x208 show_stack+0x1c/0x28 dump_stack+0xc4/0x11c ___might_sleep+0xf8/0x130 __might_sleep+0x58/0x90 slab_pre_alloc_hook.constprop.101+0xd0/0x118 kmem_cache_alloc_node_trace+0x84/0x270 __get_vm_area_node+0x88/0x210 get_vm_area_caller+0x38/0x40 __ioremap_caller+0x70/0xf8 ioremap_cache+0x78/0xb0 memremap+0x9c/0x1a8 init_stolen_time_cpu+0x54/0xf0 cpuhp_invoke_callback+0xa8/0x720 notify_cpu_starting+0xc8/0xd8 secondary_start_kernel+0x114/0x180 CPU1: Booted secondary processor 0x0000000001 [0x431f0a11] However we don't need to initialize steal time at CPU starting time. We can simply wait until CPU online time, just sacrificing a bit of accuracy by returning zero for steal time until we know better. While at it, add __init to the functions that are only called by pv_time_init() which is __init. Signed-off-by: Andrew Jones <drjones@redhat.com> Fixes: e0685fa ("arm64: Retrieve stolen time as paravirtualized guest") Cc: stable@vger.kernel.org Reviewed-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20200916154530.40809-1-drjones@redhat.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
q2ven
pushed a commit
that referenced
this pull request
Aug 1, 2023
This patch came from the 2.10.5 branch of AmazonFSxLustreClient repo. The patch from that repo is: mdc: hold lock while walking changelog dev list (LU-12566) Prevent the following GPF which is causing stuck processes, when running mount and umount concurrently on the same host. general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:mdc_changelog_cdev_finish+0x3f/0x1b9 [mdc] ... Call Trace: mdc_precleanup+0x2a/0x3c0 [mdc] Original patch was: LU-12566 mdc: hold lock while walking changelog dev list In mdc_changelog_cdev_finish() we need chlg_registered_dev_lock while walking and changing entries on the chlog_registered_devs and ced_obds lists in chlg_registered_dev_find_by_obd(). Move the calling of chlg_registered_dev_find_by_obd() under the mutex, and add assertions to the places where the lists are walked and changed that the mutex is held. Lustre-change: https://review.whamcloud.com/35668 Lustre-commit: a260c530801db7f58efa93b774f06b0ce72649a3 Fixes: 1d40214d96dd ("LU-7659 mdc: expose changelog through char devices") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Ib62fdff87cde6a4bcfb9bea24a2ea72a933ebbe5 Signed-off-by: Minh Diep <mdiep@whamcloud.com> Reviewed-on: https://review.whamcloud.com/35835 Signed-off-by: Andy Strohman <astroh@amazon.com>
q2ven
pushed a commit
that referenced
this pull request
Aug 1, 2023
…oles on aarch64 offline_pages() properly checks for memory holes and bails out. However, we do a page_zone(pfn_to_page(start_pfn)) before calling offline_pages() when offlining a memory block. We should not unconditionally call page_zone(pfn_to_page(start_pfn)) on aarch64 in offlining code, otherwise we can trigger a BUG when hitting a memory hole: kernel BUG at include/linux/mm.h:1383! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb nvme i2c_algo_bit mlx5_core i2c_core nvme_core firmware_class CPU: 13 PID: 1694 Comm: ranbug Not tainted 5.12.0-next-20210524+ #4 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) pc : memory_subsys_offline+0x1f8/0x250 lr : memory_subsys_offline+0x1f8/0x250 Call trace: memory_subsys_offline+0x1f8/0x250 device_offline+0x154/0x1d8 online_store+0xa4/0x118 dev_attr_store+0x44/0x78 sysfs_kf_write+0xe8/0x138 kernfs_fop_write_iter+0x26c/0x3d0 new_sync_write+0x2bc/0x4f8 vfs_write+0x718/0xc88 ksys_write+0xf8/0x1e0 __arm64_sys_write+0x74/0xa8 invoke_syscall.constprop.0+0x78/0x1e8 do_el0_svc+0xe4/0x298 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb8 el0_sync+0x178/0x180 Kernel panic - not syncing: Oops - BUG: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x00000251,20000846 Memory Limit: none If nr_vmemmap_pages is set, we know that we are dealing with hotplugged memory that doesn't have any holes. So call page_zone(pfn_to_page(start_pfn)) only when really necessary -- when nr_vmemmap_pages is set and we actually adjust the present pages. Link: https://lkml.kernel.org/r/20210526075226.5572-1-david@redhat.com Fixes: a08a2ae ("mm,memory_hotplug: allocate memmap from the added memory range") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Qian Cai (QUIC) <quic_qiancai@quicinc.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9281305)
paniakin-aws
pushed a commit
that referenced
this pull request
Aug 1, 2023
This patch came from the 2.10.5 branch of AmazonFSxLustreClient repo. The patch from that repo is: mdc: hold lock while walking changelog dev list (LU-12566) Prevent the following GPF which is causing stuck processes, when running mount and umount concurrently on the same host. general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:mdc_changelog_cdev_finish+0x3f/0x1b9 [mdc] ... Call Trace: mdc_precleanup+0x2a/0x3c0 [mdc] Original patch was: LU-12566 mdc: hold lock while walking changelog dev list In mdc_changelog_cdev_finish() we need chlg_registered_dev_lock while walking and changing entries on the chlog_registered_devs and ced_obds lists in chlg_registered_dev_find_by_obd(). Move the calling of chlg_registered_dev_find_by_obd() under the mutex, and add assertions to the places where the lists are walked and changed that the mutex is held. Lustre-change: https://review.whamcloud.com/35668 Lustre-commit: a260c530801db7f58efa93b774f06b0ce72649a3 Fixes: 1d40214d96dd ("LU-7659 mdc: expose changelog through char devices") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Ib62fdff87cde6a4bcfb9bea24a2ea72a933ebbe5 Signed-off-by: Minh Diep <mdiep@whamcloud.com> Reviewed-on: https://review.whamcloud.com/35835 Signed-off-by: Andy Strohman <astroh@amazon.com>
paniakin-aws
pushed a commit
that referenced
this pull request
Aug 1, 2023
…oles on aarch64 offline_pages() properly checks for memory holes and bails out. However, we do a page_zone(pfn_to_page(start_pfn)) before calling offline_pages() when offlining a memory block. We should not unconditionally call page_zone(pfn_to_page(start_pfn)) on aarch64 in offlining code, otherwise we can trigger a BUG when hitting a memory hole: kernel BUG at include/linux/mm.h:1383! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb nvme i2c_algo_bit mlx5_core i2c_core nvme_core firmware_class CPU: 13 PID: 1694 Comm: ranbug Not tainted 5.12.0-next-20210524+ #4 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) pc : memory_subsys_offline+0x1f8/0x250 lr : memory_subsys_offline+0x1f8/0x250 Call trace: memory_subsys_offline+0x1f8/0x250 device_offline+0x154/0x1d8 online_store+0xa4/0x118 dev_attr_store+0x44/0x78 sysfs_kf_write+0xe8/0x138 kernfs_fop_write_iter+0x26c/0x3d0 new_sync_write+0x2bc/0x4f8 vfs_write+0x718/0xc88 ksys_write+0xf8/0x1e0 __arm64_sys_write+0x74/0xa8 invoke_syscall.constprop.0+0x78/0x1e8 do_el0_svc+0xe4/0x298 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb8 el0_sync+0x178/0x180 Kernel panic - not syncing: Oops - BUG: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x00000251,20000846 Memory Limit: none If nr_vmemmap_pages is set, we know that we are dealing with hotplugged memory that doesn't have any holes. So call page_zone(pfn_to_page(start_pfn)) only when really necessary -- when nr_vmemmap_pages is set and we actually adjust the present pages. Link: https://lkml.kernel.org/r/20210526075226.5572-1-david@redhat.com Fixes: a08a2ae ("mm,memory_hotplug: allocate memmap from the added memory range") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Qian Cai (QUIC) <quic_qiancai@quicinc.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9281305)
sj-aws
pushed a commit
that referenced
this pull request
Aug 2, 2023
This patch came from the 2.10.5 branch of AmazonFSxLustreClient repo. The patch from that repo is: mdc: hold lock while walking changelog dev list (LU-12566) Prevent the following GPF which is causing stuck processes, when running mount and umount concurrently on the same host. general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:mdc_changelog_cdev_finish+0x3f/0x1b9 [mdc] ... Call Trace: mdc_precleanup+0x2a/0x3c0 [mdc] Original patch was: LU-12566 mdc: hold lock while walking changelog dev list In mdc_changelog_cdev_finish() we need chlg_registered_dev_lock while walking and changing entries on the chlog_registered_devs and ced_obds lists in chlg_registered_dev_find_by_obd(). Move the calling of chlg_registered_dev_find_by_obd() under the mutex, and add assertions to the places where the lists are walked and changed that the mutex is held. Lustre-change: https://review.whamcloud.com/35668 Lustre-commit: a260c530801db7f58efa93b774f06b0ce72649a3 Fixes: 1d40214d96dd ("LU-7659 mdc: expose changelog through char devices") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Ib62fdff87cde6a4bcfb9bea24a2ea72a933ebbe5 Signed-off-by: Minh Diep <mdiep@whamcloud.com> Reviewed-on: https://review.whamcloud.com/35835 Signed-off-by: Andy Strohman <astroh@amazon.com>
sj-aws
pushed a commit
that referenced
this pull request
Aug 2, 2023
Steal time initialization requires mapping a memory region which invokes a memory allocation. Doing this at CPU starting time results in the following trace when CONFIG_DEBUG_ATOMIC_SLEEP is enabled: BUG: sleeping function called from invalid context at mm/slab.h:498 in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #1 Call trace: dump_backtrace+0x0/0x208 show_stack+0x1c/0x28 dump_stack+0xc4/0x11c ___might_sleep+0xf8/0x130 __might_sleep+0x58/0x90 slab_pre_alloc_hook.constprop.101+0xd0/0x118 kmem_cache_alloc_node_trace+0x84/0x270 __get_vm_area_node+0x88/0x210 get_vm_area_caller+0x38/0x40 __ioremap_caller+0x70/0xf8 ioremap_cache+0x78/0xb0 memremap+0x9c/0x1a8 init_stolen_time_cpu+0x54/0xf0 cpuhp_invoke_callback+0xa8/0x720 notify_cpu_starting+0xc8/0xd8 secondary_start_kernel+0x114/0x180 CPU1: Booted secondary processor 0x0000000001 [0x431f0a11] However we don't need to initialize steal time at CPU starting time. We can simply wait until CPU online time, just sacrificing a bit of accuracy by returning zero for steal time until we know better. While at it, add __init to the functions that are only called by pv_time_init() which is __init. Signed-off-by: Andrew Jones <drjones@redhat.com> Fixes: e0685fa ("arm64: Retrieve stolen time as paravirtualized guest") Cc: stable@vger.kernel.org Reviewed-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20200916154530.40809-1-drjones@redhat.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
sj-aws
pushed a commit
that referenced
this pull request
Aug 2, 2023
This patch came from the 2.10.5 branch of AmazonFSxLustreClient repo. The patch from that repo is: mdc: hold lock while walking changelog dev list (LU-12566) Prevent the following GPF which is causing stuck processes, when running mount and umount concurrently on the same host. general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:mdc_changelog_cdev_finish+0x3f/0x1b9 [mdc] ... Call Trace: mdc_precleanup+0x2a/0x3c0 [mdc] Original patch was: LU-12566 mdc: hold lock while walking changelog dev list In mdc_changelog_cdev_finish() we need chlg_registered_dev_lock while walking and changing entries on the chlog_registered_devs and ced_obds lists in chlg_registered_dev_find_by_obd(). Move the calling of chlg_registered_dev_find_by_obd() under the mutex, and add assertions to the places where the lists are walked and changed that the mutex is held. Lustre-change: https://review.whamcloud.com/35668 Lustre-commit: a260c530801db7f58efa93b774f06b0ce72649a3 Fixes: 1d40214d96dd ("LU-7659 mdc: expose changelog through char devices") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Ib62fdff87cde6a4bcfb9bea24a2ea72a933ebbe5 Signed-off-by: Minh Diep <mdiep@whamcloud.com> Reviewed-on: https://review.whamcloud.com/35835 Signed-off-by: Andy Strohman <astroh@amazon.com>
sj-aws
pushed a commit
that referenced
this pull request
Aug 2, 2023
…oles on aarch64 offline_pages() properly checks for memory holes and bails out. However, we do a page_zone(pfn_to_page(start_pfn)) before calling offline_pages() when offlining a memory block. We should not unconditionally call page_zone(pfn_to_page(start_pfn)) on aarch64 in offlining code, otherwise we can trigger a BUG when hitting a memory hole: kernel BUG at include/linux/mm.h:1383! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: loop processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb nvme i2c_algo_bit mlx5_core i2c_core nvme_core firmware_class CPU: 13 PID: 1694 Comm: ranbug Not tainted 5.12.0-next-20210524+ #4 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) pc : memory_subsys_offline+0x1f8/0x250 lr : memory_subsys_offline+0x1f8/0x250 Call trace: memory_subsys_offline+0x1f8/0x250 device_offline+0x154/0x1d8 online_store+0xa4/0x118 dev_attr_store+0x44/0x78 sysfs_kf_write+0xe8/0x138 kernfs_fop_write_iter+0x26c/0x3d0 new_sync_write+0x2bc/0x4f8 vfs_write+0x718/0xc88 ksys_write+0xf8/0x1e0 __arm64_sys_write+0x74/0xa8 invoke_syscall.constprop.0+0x78/0x1e8 do_el0_svc+0xe4/0x298 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb8 el0_sync+0x178/0x180 Kernel panic - not syncing: Oops - BUG: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x00000251,20000846 Memory Limit: none If nr_vmemmap_pages is set, we know that we are dealing with hotplugged memory that doesn't have any holes. So call page_zone(pfn_to_page(start_pfn)) only when really necessary -- when nr_vmemmap_pages is set and we actually adjust the present pages. Link: https://lkml.kernel.org/r/20210526075226.5572-1-david@redhat.com Fixes: a08a2ae ("mm,memory_hotplug: allocate memmap from the added memory range") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Qian Cai (QUIC) <quic_qiancai@quicinc.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9281305)
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit 8a4a0b2 ] If we disable quotas while we have a relocation of a metadata block group that has extents belonging to the quota root, we can cause the relocation to fail with -ENOENT. This is because relocation builds backref nodes for extents of the quota root and later needs to walk the backrefs and access the quota root - however if in between a task disables quotas, it results in deleting the quota root from the root tree (with btrfs_del_root(), called from btrfs_quota_disable(). This can be sporadically triggered by test case btrfs/255 from fstests: $ ./check btrfs/255 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg) - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad) # --- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000 # +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100 # @@ -1,2 +1,4 @@ # QA output created by 255 # +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory # +There may be more info in syslog - try dmesg | tail # Silence is golden # ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff) Ran: btrfs/255 Failures: btrfs/255 Failed 1 of 1 tests To fix this make the quota disable operation take the cleaner mutex, as relocation of a block group also takes this mutex. This is also what we do when deleting a subvolume/snapshot, we take the cleaner mutex in the cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root() at btrfs_drop_snapshot() while under the protection of the cleaner mutex. Fixes: bed92ea ("Btrfs: qgroup implementation and prototypes") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit d4a7ce6 ] The Xeon validation group has been carrying out some loaded tests with various HW configurations, and they have seen some transmit queue time out happening during the test. This will cause the reset adapter function to be called by igc_tx_timeout(). Similar race conditions may arise when the interface is being brought down and up in igc_reinit_locked(), an interrupt being generated, and igc_clean_tx_irq() being called to complete the TX. When the igc_tx_timeout() function is invoked, this patch will turn off all TX ring HW queues during igc_down() process. TX ring HW queues will be activated again during the igc_configure_tx_ring() process when performing the igc_up() procedure later. This patch also moved existing igc_disable_tx_ring_hw() to avoid using forward declaration. Kernel trace: [ 7678.747813] ------------[ cut here ]------------ [ 7678.757914] NETDEV WATCHDOG: enp1s0 (igc): transmit queue 2 timed out [ 7678.770117] WARNING: CPU: 0 PID: 13 at net/sched/sch_generic.c:525 dev_watchdog+0x1ae/0x1f0 [ 7678.784459] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 7678.784496] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 7679.200403] RIP: 0010:dev_watchdog+0x1ae/0x1f0 [ 7679.210201] Code: 28 e9 53 ff ff ff 4c 89 e7 c6 05 06 42 b9 00 01 e8 17 d1 fb ff 44 89 e9 4c 89 e6 48 c7 c7 40 ad fb 81 48 89 c2 e8 52 62 82 ff <0f> 0b e9 72 ff ff ff 65 8b 05 80 7d 7c 7e 89 c0 48 0f a3 05 0a c1 [ 7679.245438] RSP: 0018:ffa00000001f7d90 EFLAGS: 00010282 [ 7679.256021] RAX: 0000000000000000 RBX: ff11000109938440 RCX: 0000000000000000 [ 7679.268710] RDX: ff11000361e26cd8 RSI: ff11000361e1b880 RDI: ff11000361e1b880 [ 7679.281314] RBP: ffa00000001f7da8 R08: ff1100035f8fffe8 R09: 0000000000027ffb [ 7679.293840] R10: 0000000000001f0a R11: ff1100035f840000 R12: ff11000109938000 [ 7679.306276] R13: 0000000000000002 R14: dead000000000122 R15: ffa00000001f7e18 [ 7679.318648] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 7679.332064] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7679.342757] CR2: 00007ffff7fca168 CR3: 000000013b08a006 CR4: 0000000000471ef8 [ 7679.354984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7679.367207] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 7679.379370] PKRU: 55555554 [ 7679.386446] Call Trace: [ 7679.393152] <TASK> [ 7679.399363] ? __pfx_dev_watchdog+0x10/0x10 [ 7679.407870] call_timer_fn+0x31/0x110 [ 7679.415698] expire_timers+0xb2/0x120 [ 7679.423403] run_timer_softirq+0x179/0x1e0 [ 7679.431532] ? __schedule+0x2b1/0x820 [ 7679.439078] __do_softirq+0xd1/0x295 [ 7679.446426] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 7679.454867] run_ksoftirqd+0x22/0x30 [ 7679.462058] smpboot_thread_fn+0xb7/0x160 [ 7679.469670] kthread+0xcd/0xf0 [ 7679.476097] ? __pfx_kthread+0x10/0x10 [ 7679.483211] ret_from_fork+0x29/0x50 [ 7679.490047] </TASK> [ 7679.495204] ---[ end trace 0000000000000000 ]--- [ 7679.503179] igc 0000:01:00.0 enp1s0: Register Dump [ 7679.511230] igc 0000:01:00.0 enp1s0: Register Name Value [ 7679.519892] igc 0000:01:00.0 enp1s0: CTRL 181c0641 [ 7679.528782] igc 0000:01:00.0 enp1s0: STATUS 40280683 [ 7679.537551] igc 0000:01:00.0 enp1s0: CTRL_EXT 10000040 [ 7679.546284] igc 0000:01:00.0 enp1s0: MDIC 180a3800 [ 7679.554942] igc 0000:01:00.0 enp1s0: ICR 00000081 [ 7679.563503] igc 0000:01:00.0 enp1s0: RCTL 04408022 [ 7679.571963] igc 0000:01:00.0 enp1s0: RDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.583075] igc 0000:01:00.0 enp1s0: RDH[0-3] 00000068 000000b6 0000000f 00000031 [ 7679.594162] igc 0000:01:00.0 enp1s0: RDT[0-3] 00000066 000000b2 0000000 00000030 [ 7679.605174] igc 0000:01:00.0 enp1s0: RXDCTL[0-3] 02040808 02040808 02040808 02040808 [ 7679.616196] igc 0000:01:00.0 enp1s0: RDBAL[0-3] 1bb7c000 1bb7f000 1bb82000 0ef33000 [ 7679.627242] igc 0000:01:00.0 enp1s0: RDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.638256] igc 0000:01:00.0 enp1s0: TCTL a503f0fa [ 7679.646607] igc 0000:01:00.0 enp1s0: TDBAL[0-3] 2ba4a000 1bb6f000 1bb74000 1bb79000 [ 7679.657609] igc 0000:01:00.0 enp1s0: TDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.668551] igc 0000:01:00.0 enp1s0: TDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.679470] igc 0000:01:00.0 enp1s0: TDH[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.690406] igc 0000:01:00.0 enp1s0: TDT[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.701264] igc 0000:01:00.0 enp1s0: TXDCTL[0-3] 02100108 02100108 02100108 02100108 [ 7679.712123] igc 0000:01:00.0 enp1s0: Reset adapter [ 7683.085967] igc 0000:01:00.0 enp1s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [ 8086.945561] ------------[ cut here ]------------ Entering kdb (current=0xffffffff8220b200, pid 0) on processor 0 Oops: (null) due to oops @ 0xffffffff81573888 RIP: 0010:dql_completed+0x148/0x160 Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? igc_poll+0x1a9/0x14d0 [igc] __napi_poll+0x2e/0x1b0 net_rx_action+0x126/0x250 __do_softirq+0xd1/0x295 irq_exit_rcu+0xc5/0xf0 common_interrupt+0x86/0xa0 </IRQ> <TASK> asm_common_interrupt+0x27/0x40 RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 cpuidle_enter+0x2e/0x50 call_cpuidle+0x23/0x40 do_idle+0x1be/0x220 cpu_startup_entry+0x20/0x30 rest_init+0xb5/0xc0 arch_call_rest_init+0xe/0x30 start_kernel+0x448/0x760 x86_64_start_kernel+0x109/0x150 secondary_startup_64_no_verify+0xe0/0xeb </TASK> more> [0]kdb> [0]kdb> [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, type go a second time if you really want to continue [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, attempting to continue [ 8086.955689] refcount_t: underflow; use-after-free. [ 8086.955697] WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 refcount_warn_saturate+0xc2/0x110 [ 8086.955706] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.955751] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 8086.955784] RIP: 0010:refcount_warn_saturate+0xc2/0x110 [ 8086.955788] Code: 01 e8 82 e7 b4 ff 0f 0b 5d c3 cc cc cc cc 80 3d 68 c6 eb 00 00 75 81 48 c7 c7 a0 87 f6 81 c6 05 58 c6 eb 00 01 e8 5e e7 b4 ff <0f> 0b 5d c3 cc cc cc cc 80 3d 42 c6 eb 00 00 0f 85 59 ff ff ff 48 [ 8086.955790] RSP: 0018:ffa0000000003da0 EFLAGS: 00010286 [ 8086.955793] RAX: 0000000000000000 RBX: ff1100011da40ee0 RCX: ff11000361e1b888 [ 8086.955794] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ff11000361e1b880 [ 8086.955795] RBP: ffa0000000003da0 R08: 80000000ffff9f45 R09: ffa0000000003d28 [ 8086.955796] R10: ff1100035f840000 R11: 0000000000000028 R12: ff11000319ff8000 [ 8086.955797] R13: ff1100011bb79d60 R14: 00000000ffffffd6 R15: ff1100037039cb00 [ 8086.955798] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955801] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955803] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955803] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955804] PKRU: 55555554 [ 8086.955805] Call Trace: [ 8086.955806] <IRQ> [ 8086.955808] tcp_wfree+0x112/0x130 [ 8086.955814] skb_release_head_state+0x24/0xa0 [ 8086.955818] napi_consume_skb+0x9c/0x160 [ 8086.955821] igc_poll+0x5d8/0x14d0 [igc] [ 8086.955835] __napi_poll+0x2e/0x1b0 [ 8086.955839] net_rx_action+0x126/0x250 [ 8086.955843] __do_softirq+0xd1/0x295 [ 8086.955846] irq_exit_rcu+0xc5/0xf0 [ 8086.955851] common_interrupt+0x86/0xa0 [ 8086.955857] </IRQ> [ 8086.955857] <TASK> [ 8086.955858] asm_common_interrupt+0x27/0x40 [ 8086.955862] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955866] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955867] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955869] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955870] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955871] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955872] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955873] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955875] cpuidle_enter+0x2e/0x50 [ 8086.955880] call_cpuidle+0x23/0x40 [ 8086.955884] do_idle+0x1be/0x220 [ 8086.955887] cpu_startup_entry+0x20/0x30 [ 8086.955889] rest_init+0xb5/0xc0 [ 8086.955892] arch_call_rest_init+0xe/0x30 [ 8086.955895] start_kernel+0x448/0x760 [ 8086.955898] x86_64_start_kernel+0x109/0x150 [ 8086.955900] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955904] </TASK> [ 8086.955904] ---[ end trace 0000000000000000 ]--- [ 8086.955912] ------------[ cut here ]------------ [ 8086.955913] kernel BUG at lib/dynamic_queue_limits.c:27! [ 8086.955918] invalid opcode: 0000 [#1] SMP [ 8086.955922] RIP: 0010:dql_completed+0x148/0x160 [ 8086.955925] Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc [ 8086.955927] RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 [ 8086.955928] RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 [ 8086.955929] RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 [ 8086.955930] RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 [ 8086.955931] R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 [ 8086.955932] R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 [ 8086.955933] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955935] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955937] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955938] PKRU: 55555554 [ 8086.955939] Call Trace: [ 8086.955939] <IRQ> [ 8086.955940] ? igc_poll+0x1a9/0x14d0 [igc] [ 8086.955949] __napi_poll+0x2e/0x1b0 [ 8086.955952] net_rx_action+0x126/0x250 [ 8086.955956] __do_softirq+0xd1/0x295 [ 8086.955958] irq_exit_rcu+0xc5/0xf0 [ 8086.955961] common_interrupt+0x86/0xa0 [ 8086.955964] </IRQ> [ 8086.955965] <TASK> [ 8086.955965] asm_common_interrupt+0x27/0x40 [ 8086.955968] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955971] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955972] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955973] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955974] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955974] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955975] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955976] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955978] cpuidle_enter+0x2e/0x50 [ 8086.955981] call_cpuidle+0x23/0x40 [ 8086.955984] do_idle+0x1be/0x220 [ 8086.955985] cpu_startup_entry+0x20/0x30 [ 8086.955987] rest_init+0xb5/0xc0 [ 8086.955990] arch_call_rest_init+0xe/0x30 [ 8086.955992] start_kernel+0x448/0x760 [ 8086.955994] x86_64_start_kernel+0x109/0x150 [ 8086.955996] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955998] </TASK> [ 8086.955999] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.956029] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [16762.543675] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.593 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.595 msecs [16762.543673] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.495 msecs [16762.543679] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.598 msecs [16762.543690] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.605 msecs [16762.543684] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543693] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.613 msecs [16762.543784] ---[ end trace 0000000000000000 ]--- [16762.849099] RIP: 0010:dql_completed+0x148/0x160 PANIC: Fatal exception in interrupt Fixes: 9b27517 ("igc: Add ndo_tx_timeout support") Tested-by: Alejandra Victoria Alcaraz <alejandra.victoria.alcaraz@intel.com> Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit 0a771f7 ] On error when building the rule, the immediate expression unbinds the chain, hence objects can be deactivated by the transaction records. Otherwise, it is possible to trigger the following warning: WARNING: CPU: 3 PID: 915 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] CPU: 3 PID: 915 Comm: chain-bind-err- Not tainted 6.1.39 #1 RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] Fixes: 4bedf9e ("netfilter: nf_tables: fix chain binding transaction logic") Reported-by: Kevin Rich <kevinrich1337@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
…RULE_CHAIN_ID [ Upstream commit 0ebc106 ] Bail out with EOPNOTSUPP when adding rule to bound chain via NFTA_RULE_CHAIN_ID. The following warning splat is shown when adding a rule to a deleted bound chain: WARNING: CPU: 2 PID: 13692 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] CPU: 2 PID: 13692 Comm: chain-bound-rul Not tainted 6.1.39 #1 RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] Fixes: d0e2c7d ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Reported-by: Kevin Rich <kevinrich1337@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
commit 6609141 upstream. The 'commit 52e4fdf09ebc ("drm/amd/display: use low clocks for no plane configs")' introduced a change that set low clock values for DCN31 and DCN32. As a result of these changes, DC started to spam the log with the following warning: ------------[ cut here ]------------ WARNING: CPU: 8 PID: 1486 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_dccg.c:58 dccg2_update_dpp_dto+0x3f/0xf0 [amdgpu] [..] CPU: 8 PID: 1486 Comm: kms_atomic Tainted: G W 5.18.0+ #1 RIP: 0010:dccg2_update_dpp_dto+0x3f/0xf0 [amdgpu] RSP: 0018:ffffbbd8025334d0 EFLAGS: 00010206 RAX: 00000000000001ee RBX: ffffa02c87dd3de0 RCX: 00000000000a7f80 RDX: 000000000007dec3 RSI: 0000000000000000 RDI: ffffa02c87dd3de0 RBP: ffffbbd8025334e8 R08: 0000000000000001 R09: 0000000000000005 R10: 00000000000331a0 R11: ffffffffc0b03d80 R12: ffffa02ca576d000 R13: ffffa02cd02c0000 R14: 00000000001453bc R15: ffffa02cdc280000 [..] dcn20_update_clocks_update_dpp_dto+0x4e/0xa0 [amdgpu] dcn32_update_clocks+0x5d9/0x650 [amdgpu] dcn20_prepare_bandwidth+0x49/0x100 [amdgpu] dcn30_prepare_bandwidth+0x63/0x80 [amdgpu] dc_commit_state_no_check+0x39d/0x13e0 [amdgpu] dc_commit_streams+0x1f9/0x3b0 [amdgpu] dc_commit_state+0x37/0x120 [amdgpu] amdgpu_dm_atomic_commit_tail+0x5e5/0x2520 [amdgpu] ? _raw_spin_unlock_irqrestore+0x1f/0x40 ? down_trylock+0x2c/0x40 ? vprintk_emit+0x186/0x2c0 ? vprintk_default+0x1d/0x20 ? vprintk+0x4e/0x60 We can easily trigger this issue by using a 4k@120 or a 2k@165 and running some of the kms_atomic tests. This warning is triggered because the per-pipe clock update is not happening; this commit fixes this issue by ensuring that DPPCLK is updated when calculating the watermark and dlg is invoked. Fixes: 2641c7b ("drm/amd/display: use low clocks for no plane configs") Reported-by: Mark Broadworth <mark.broadworth@amd.com> Reviewed-by: Rodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: Dmytro Laktyushkin <Dmytro.Laktyushkin@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit 8a4a0b2 ] If we disable quotas while we have a relocation of a metadata block group that has extents belonging to the quota root, we can cause the relocation to fail with -ENOENT. This is because relocation builds backref nodes for extents of the quota root and later needs to walk the backrefs and access the quota root - however if in between a task disables quotas, it results in deleting the quota root from the root tree (with btrfs_del_root(), called from btrfs_quota_disable(). This can be sporadically triggered by test case btrfs/255 from fstests: $ ./check btrfs/255 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg) - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad) # --- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000 # +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100 # @@ -1,2 +1,4 @@ # QA output created by 255 # +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory # +There may be more info in syslog - try dmesg | tail # Silence is golden # ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff) Ran: btrfs/255 Failures: btrfs/255 Failed 1 of 1 tests To fix this make the quota disable operation take the cleaner mutex, as relocation of a block group also takes this mutex. This is also what we do when deleting a subvolume/snapshot, we take the cleaner mutex in the cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root() at btrfs_drop_snapshot() while under the protection of the cleaner mutex. Fixes: bed92ea ("Btrfs: qgroup implementation and prototypes") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit d4a7ce6 ] The Xeon validation group has been carrying out some loaded tests with various HW configurations, and they have seen some transmit queue time out happening during the test. This will cause the reset adapter function to be called by igc_tx_timeout(). Similar race conditions may arise when the interface is being brought down and up in igc_reinit_locked(), an interrupt being generated, and igc_clean_tx_irq() being called to complete the TX. When the igc_tx_timeout() function is invoked, this patch will turn off all TX ring HW queues during igc_down() process. TX ring HW queues will be activated again during the igc_configure_tx_ring() process when performing the igc_up() procedure later. This patch also moved existing igc_disable_tx_ring_hw() to avoid using forward declaration. Kernel trace: [ 7678.747813] ------------[ cut here ]------------ [ 7678.757914] NETDEV WATCHDOG: enp1s0 (igc): transmit queue 2 timed out [ 7678.770117] WARNING: CPU: 0 PID: 13 at net/sched/sch_generic.c:525 dev_watchdog+0x1ae/0x1f0 [ 7678.784459] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 7678.784496] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 7679.200403] RIP: 0010:dev_watchdog+0x1ae/0x1f0 [ 7679.210201] Code: 28 e9 53 ff ff ff 4c 89 e7 c6 05 06 42 b9 00 01 e8 17 d1 fb ff 44 89 e9 4c 89 e6 48 c7 c7 40 ad fb 81 48 89 c2 e8 52 62 82 ff <0f> 0b e9 72 ff ff ff 65 8b 05 80 7d 7c 7e 89 c0 48 0f a3 05 0a c1 [ 7679.245438] RSP: 0018:ffa00000001f7d90 EFLAGS: 00010282 [ 7679.256021] RAX: 0000000000000000 RBX: ff11000109938440 RCX: 0000000000000000 [ 7679.268710] RDX: ff11000361e26cd8 RSI: ff11000361e1b880 RDI: ff11000361e1b880 [ 7679.281314] RBP: ffa00000001f7da8 R08: ff1100035f8fffe8 R09: 0000000000027ffb [ 7679.293840] R10: 0000000000001f0a R11: ff1100035f840000 R12: ff11000109938000 [ 7679.306276] R13: 0000000000000002 R14: dead000000000122 R15: ffa00000001f7e18 [ 7679.318648] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 7679.332064] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7679.342757] CR2: 00007ffff7fca168 CR3: 000000013b08a006 CR4: 0000000000471ef8 [ 7679.354984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7679.367207] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 7679.379370] PKRU: 55555554 [ 7679.386446] Call Trace: [ 7679.393152] <TASK> [ 7679.399363] ? __pfx_dev_watchdog+0x10/0x10 [ 7679.407870] call_timer_fn+0x31/0x110 [ 7679.415698] expire_timers+0xb2/0x120 [ 7679.423403] run_timer_softirq+0x179/0x1e0 [ 7679.431532] ? __schedule+0x2b1/0x820 [ 7679.439078] __do_softirq+0xd1/0x295 [ 7679.446426] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 7679.454867] run_ksoftirqd+0x22/0x30 [ 7679.462058] smpboot_thread_fn+0xb7/0x160 [ 7679.469670] kthread+0xcd/0xf0 [ 7679.476097] ? __pfx_kthread+0x10/0x10 [ 7679.483211] ret_from_fork+0x29/0x50 [ 7679.490047] </TASK> [ 7679.495204] ---[ end trace 0000000000000000 ]--- [ 7679.503179] igc 0000:01:00.0 enp1s0: Register Dump [ 7679.511230] igc 0000:01:00.0 enp1s0: Register Name Value [ 7679.519892] igc 0000:01:00.0 enp1s0: CTRL 181c0641 [ 7679.528782] igc 0000:01:00.0 enp1s0: STATUS 40280683 [ 7679.537551] igc 0000:01:00.0 enp1s0: CTRL_EXT 10000040 [ 7679.546284] igc 0000:01:00.0 enp1s0: MDIC 180a3800 [ 7679.554942] igc 0000:01:00.0 enp1s0: ICR 00000081 [ 7679.563503] igc 0000:01:00.0 enp1s0: RCTL 04408022 [ 7679.571963] igc 0000:01:00.0 enp1s0: RDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.583075] igc 0000:01:00.0 enp1s0: RDH[0-3] 00000068 000000b6 0000000f 00000031 [ 7679.594162] igc 0000:01:00.0 enp1s0: RDT[0-3] 00000066 000000b2 0000000 00000030 [ 7679.605174] igc 0000:01:00.0 enp1s0: RXDCTL[0-3] 02040808 02040808 02040808 02040808 [ 7679.616196] igc 0000:01:00.0 enp1s0: RDBAL[0-3] 1bb7c000 1bb7f000 1bb82000 0ef33000 [ 7679.627242] igc 0000:01:00.0 enp1s0: RDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.638256] igc 0000:01:00.0 enp1s0: TCTL a503f0fa [ 7679.646607] igc 0000:01:00.0 enp1s0: TDBAL[0-3] 2ba4a000 1bb6f000 1bb74000 1bb79000 [ 7679.657609] igc 0000:01:00.0 enp1s0: TDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.668551] igc 0000:01:00.0 enp1s0: TDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.679470] igc 0000:01:00.0 enp1s0: TDH[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.690406] igc 0000:01:00.0 enp1s0: TDT[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.701264] igc 0000:01:00.0 enp1s0: TXDCTL[0-3] 02100108 02100108 02100108 02100108 [ 7679.712123] igc 0000:01:00.0 enp1s0: Reset adapter [ 7683.085967] igc 0000:01:00.0 enp1s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [ 8086.945561] ------------[ cut here ]------------ Entering kdb (current=0xffffffff8220b200, pid 0) on processor 0 Oops: (null) due to oops @ 0xffffffff81573888 RIP: 0010:dql_completed+0x148/0x160 Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? igc_poll+0x1a9/0x14d0 [igc] __napi_poll+0x2e/0x1b0 net_rx_action+0x126/0x250 __do_softirq+0xd1/0x295 irq_exit_rcu+0xc5/0xf0 common_interrupt+0x86/0xa0 </IRQ> <TASK> asm_common_interrupt+0x27/0x40 RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 cpuidle_enter+0x2e/0x50 call_cpuidle+0x23/0x40 do_idle+0x1be/0x220 cpu_startup_entry+0x20/0x30 rest_init+0xb5/0xc0 arch_call_rest_init+0xe/0x30 start_kernel+0x448/0x760 x86_64_start_kernel+0x109/0x150 secondary_startup_64_no_verify+0xe0/0xeb </TASK> more> [0]kdb> [0]kdb> [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, type go a second time if you really want to continue [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, attempting to continue [ 8086.955689] refcount_t: underflow; use-after-free. [ 8086.955697] WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 refcount_warn_saturate+0xc2/0x110 [ 8086.955706] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.955751] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 8086.955784] RIP: 0010:refcount_warn_saturate+0xc2/0x110 [ 8086.955788] Code: 01 e8 82 e7 b4 ff 0f 0b 5d c3 cc cc cc cc 80 3d 68 c6 eb 00 00 75 81 48 c7 c7 a0 87 f6 81 c6 05 58 c6 eb 00 01 e8 5e e7 b4 ff <0f> 0b 5d c3 cc cc cc cc 80 3d 42 c6 eb 00 00 0f 85 59 ff ff ff 48 [ 8086.955790] RSP: 0018:ffa0000000003da0 EFLAGS: 00010286 [ 8086.955793] RAX: 0000000000000000 RBX: ff1100011da40ee0 RCX: ff11000361e1b888 [ 8086.955794] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ff11000361e1b880 [ 8086.955795] RBP: ffa0000000003da0 R08: 80000000ffff9f45 R09: ffa0000000003d28 [ 8086.955796] R10: ff1100035f840000 R11: 0000000000000028 R12: ff11000319ff8000 [ 8086.955797] R13: ff1100011bb79d60 R14: 00000000ffffffd6 R15: ff1100037039cb00 [ 8086.955798] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955801] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955803] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955803] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955804] PKRU: 55555554 [ 8086.955805] Call Trace: [ 8086.955806] <IRQ> [ 8086.955808] tcp_wfree+0x112/0x130 [ 8086.955814] skb_release_head_state+0x24/0xa0 [ 8086.955818] napi_consume_skb+0x9c/0x160 [ 8086.955821] igc_poll+0x5d8/0x14d0 [igc] [ 8086.955835] __napi_poll+0x2e/0x1b0 [ 8086.955839] net_rx_action+0x126/0x250 [ 8086.955843] __do_softirq+0xd1/0x295 [ 8086.955846] irq_exit_rcu+0xc5/0xf0 [ 8086.955851] common_interrupt+0x86/0xa0 [ 8086.955857] </IRQ> [ 8086.955857] <TASK> [ 8086.955858] asm_common_interrupt+0x27/0x40 [ 8086.955862] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955866] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955867] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955869] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955870] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955871] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955872] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955873] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955875] cpuidle_enter+0x2e/0x50 [ 8086.955880] call_cpuidle+0x23/0x40 [ 8086.955884] do_idle+0x1be/0x220 [ 8086.955887] cpu_startup_entry+0x20/0x30 [ 8086.955889] rest_init+0xb5/0xc0 [ 8086.955892] arch_call_rest_init+0xe/0x30 [ 8086.955895] start_kernel+0x448/0x760 [ 8086.955898] x86_64_start_kernel+0x109/0x150 [ 8086.955900] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955904] </TASK> [ 8086.955904] ---[ end trace 0000000000000000 ]--- [ 8086.955912] ------------[ cut here ]------------ [ 8086.955913] kernel BUG at lib/dynamic_queue_limits.c:27! [ 8086.955918] invalid opcode: 0000 [#1] SMP [ 8086.955922] RIP: 0010:dql_completed+0x148/0x160 [ 8086.955925] Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc [ 8086.955927] RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 [ 8086.955928] RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 [ 8086.955929] RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 [ 8086.955930] RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 [ 8086.955931] R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 [ 8086.955932] R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 [ 8086.955933] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955935] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955937] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955938] PKRU: 55555554 [ 8086.955939] Call Trace: [ 8086.955939] <IRQ> [ 8086.955940] ? igc_poll+0x1a9/0x14d0 [igc] [ 8086.955949] __napi_poll+0x2e/0x1b0 [ 8086.955952] net_rx_action+0x126/0x250 [ 8086.955956] __do_softirq+0xd1/0x295 [ 8086.955958] irq_exit_rcu+0xc5/0xf0 [ 8086.955961] common_interrupt+0x86/0xa0 [ 8086.955964] </IRQ> [ 8086.955965] <TASK> [ 8086.955965] asm_common_interrupt+0x27/0x40 [ 8086.955968] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955971] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955972] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955973] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955974] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955974] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955975] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955976] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955978] cpuidle_enter+0x2e/0x50 [ 8086.955981] call_cpuidle+0x23/0x40 [ 8086.955984] do_idle+0x1be/0x220 [ 8086.955985] cpu_startup_entry+0x20/0x30 [ 8086.955987] rest_init+0xb5/0xc0 [ 8086.955990] arch_call_rest_init+0xe/0x30 [ 8086.955992] start_kernel+0x448/0x760 [ 8086.955994] x86_64_start_kernel+0x109/0x150 [ 8086.955996] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955998] </TASK> [ 8086.955999] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.956029] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [16762.543675] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.593 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.595 msecs [16762.543673] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.495 msecs [16762.543679] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.598 msecs [16762.543690] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.605 msecs [16762.543684] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543693] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.613 msecs [16762.543784] ---[ end trace 0000000000000000 ]--- [16762.849099] RIP: 0010:dql_completed+0x148/0x160 PANIC: Fatal exception in interrupt Fixes: 9b27517 ("igc: Add ndo_tx_timeout support") Tested-by: Alejandra Victoria Alcaraz <alejandra.victoria.alcaraz@intel.com> Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit 0a771f7 ] On error when building the rule, the immediate expression unbinds the chain, hence objects can be deactivated by the transaction records. Otherwise, it is possible to trigger the following warning: WARNING: CPU: 3 PID: 915 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] CPU: 3 PID: 915 Comm: chain-bind-err- Not tainted 6.1.39 #1 RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] Fixes: 4bedf9e ("netfilter: nf_tables: fix chain binding transaction logic") Reported-by: Kevin Rich <kevinrich1337@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
…RULE_CHAIN_ID [ Upstream commit 0ebc106 ] Bail out with EOPNOTSUPP when adding rule to bound chain via NFTA_RULE_CHAIN_ID. The following warning splat is shown when adding a rule to a deleted bound chain: WARNING: CPU: 2 PID: 13692 at net/netfilter/nf_tables_api.c:2013 nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] CPU: 2 PID: 13692 Comm: chain-bound-rul Not tainted 6.1.39 #1 RIP: 0010:nf_tables_chain_destroy+0x1f7/0x210 [nf_tables] Fixes: d0e2c7d ("netfilter: nf_tables: add NFT_CHAIN_BINDING") Reported-by: Kevin Rich <kevinrich1337@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
commit b7c822f upstream. Even though the test suite covers this it somehow became obscured that this wasn't working. The test iommufd_ioas.mock_domain.access_domain_destory would blow up rarely. end should be set to 1 because this just pushed an item, the carry, to the pfns list. Sometimes the test would blow up with: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 5 PID: 584 Comm: iommufd Not tainted 6.5.0-rc1-dirty #1236 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:batch_unpin+0xa2/0x100 [iommufd] Code: 17 48 81 fe ff ff 07 00 77 70 48 8b 15 b7 be 97 e2 48 85 d2 74 14 48 8b 14 fa 48 85 d2 74 0b 40 0f b6 f6 48 c1 e6 04 48 01 f2 <48> 8b 3a 48 c1 e0 06 89 ca 48 89 de 48 83 e7 f0 48 01 c7 e8 96 dc RSP: 0018:ffffc90001677a58 EFLAGS: 00010246 RAX: 00007f7e2646f000 RBX: 0000000000000000 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 00000000fefc4c8d RDI: 0000000000fefc4c RBP: ffffc90001677a80 R08: 0000000000000048 R09: 0000000000000200 R10: 0000000000030b98 R11: ffffffff81f3bb40 R12: 0000000000000001 R13: ffff888101f75800 R14: ffffc90001677ad0 R15: 00000000000001fe FS: 00007f9323679740(0000) GS:ffff8881ba540000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000105ede003 CR4: 00000000003706a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ? show_regs+0x5c/0x70 ? __die+0x1f/0x60 ? page_fault_oops+0x15d/0x440 ? lock_release+0xbc/0x240 ? exc_page_fault+0x4a4/0x970 ? asm_exc_page_fault+0x27/0x30 ? batch_unpin+0xa2/0x100 [iommufd] ? batch_unpin+0xba/0x100 [iommufd] __iopt_area_unfill_domain+0x198/0x430 [iommufd] ? __mutex_lock+0x8c/0xb80 ? __mutex_lock+0x6aa/0xb80 ? xa_erase+0x28/0x30 ? iopt_table_remove_domain+0x162/0x320 [iommufd] ? lock_release+0xbc/0x240 iopt_area_unfill_domain+0xd/0x10 [iommufd] iopt_table_remove_domain+0x195/0x320 [iommufd] iommufd_hw_pagetable_destroy+0xb3/0x110 [iommufd] iommufd_object_destroy_user+0x8e/0xf0 [iommufd] iommufd_device_detach+0xc5/0x140 [iommufd] iommufd_selftest_destroy+0x1f/0x70 [iommufd] iommufd_object_destroy_user+0x8e/0xf0 [iommufd] iommufd_destroy+0x3a/0x50 [iommufd] iommufd_fops_ioctl+0xfb/0x170 [iommufd] __x64_sys_ioctl+0x40d/0x9a0 do_syscall_64+0x3c/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 Link: https://lore.kernel.org/r/3-v1-85aacb2af554+bc-iommufd_syz3_jgg@nvidia.com Cc: <stable@vger.kernel.org> Fixes: f394576 ("iommufd: PFN handling for iopt_pages") Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Reported-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit 8a4a0b2 ] If we disable quotas while we have a relocation of a metadata block group that has extents belonging to the quota root, we can cause the relocation to fail with -ENOENT. This is because relocation builds backref nodes for extents of the quota root and later needs to walk the backrefs and access the quota root - however if in between a task disables quotas, it results in deleting the quota root from the root tree (with btrfs_del_root(), called from btrfs_quota_disable(). This can be sporadically triggered by test case btrfs/255 from fstests: $ ./check btrfs/255 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023 MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg) - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad) # --- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000 # +++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100 # @@ -1,2 +1,4 @@ # QA output created by 255 # +ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory # +There may be more info in syslog - try dmesg | tail # Silence is golden # ... (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff) Ran: btrfs/255 Failures: btrfs/255 Failed 1 of 1 tests To fix this make the quota disable operation take the cleaner mutex, as relocation of a block group also takes this mutex. This is also what we do when deleting a subvolume/snapshot, we take the cleaner mutex in the cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root() at btrfs_drop_snapshot() while under the protection of the cleaner mutex. Fixes: bed92ea ("Btrfs: qgroup implementation and prototypes") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
sam-aws
pushed a commit
that referenced
this pull request
Aug 3, 2023
[ Upstream commit d4a7ce6 ] The Xeon validation group has been carrying out some loaded tests with various HW configurations, and they have seen some transmit queue time out happening during the test. This will cause the reset adapter function to be called by igc_tx_timeout(). Similar race conditions may arise when the interface is being brought down and up in igc_reinit_locked(), an interrupt being generated, and igc_clean_tx_irq() being called to complete the TX. When the igc_tx_timeout() function is invoked, this patch will turn off all TX ring HW queues during igc_down() process. TX ring HW queues will be activated again during the igc_configure_tx_ring() process when performing the igc_up() procedure later. This patch also moved existing igc_disable_tx_ring_hw() to avoid using forward declaration. Kernel trace: [ 7678.747813] ------------[ cut here ]------------ [ 7678.757914] NETDEV WATCHDOG: enp1s0 (igc): transmit queue 2 timed out [ 7678.770117] WARNING: CPU: 0 PID: 13 at net/sched/sch_generic.c:525 dev_watchdog+0x1ae/0x1f0 [ 7678.784459] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 7678.784496] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 7679.200403] RIP: 0010:dev_watchdog+0x1ae/0x1f0 [ 7679.210201] Code: 28 e9 53 ff ff ff 4c 89 e7 c6 05 06 42 b9 00 01 e8 17 d1 fb ff 44 89 e9 4c 89 e6 48 c7 c7 40 ad fb 81 48 89 c2 e8 52 62 82 ff <0f> 0b e9 72 ff ff ff 65 8b 05 80 7d 7c 7e 89 c0 48 0f a3 05 0a c1 [ 7679.245438] RSP: 0018:ffa00000001f7d90 EFLAGS: 00010282 [ 7679.256021] RAX: 0000000000000000 RBX: ff11000109938440 RCX: 0000000000000000 [ 7679.268710] RDX: ff11000361e26cd8 RSI: ff11000361e1b880 RDI: ff11000361e1b880 [ 7679.281314] RBP: ffa00000001f7da8 R08: ff1100035f8fffe8 R09: 0000000000027ffb [ 7679.293840] R10: 0000000000001f0a R11: ff1100035f840000 R12: ff11000109938000 [ 7679.306276] R13: 0000000000000002 R14: dead000000000122 R15: ffa00000001f7e18 [ 7679.318648] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 7679.332064] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7679.342757] CR2: 00007ffff7fca168 CR3: 000000013b08a006 CR4: 0000000000471ef8 [ 7679.354984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 7679.367207] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 7679.379370] PKRU: 55555554 [ 7679.386446] Call Trace: [ 7679.393152] <TASK> [ 7679.399363] ? __pfx_dev_watchdog+0x10/0x10 [ 7679.407870] call_timer_fn+0x31/0x110 [ 7679.415698] expire_timers+0xb2/0x120 [ 7679.423403] run_timer_softirq+0x179/0x1e0 [ 7679.431532] ? __schedule+0x2b1/0x820 [ 7679.439078] __do_softirq+0xd1/0x295 [ 7679.446426] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 7679.454867] run_ksoftirqd+0x22/0x30 [ 7679.462058] smpboot_thread_fn+0xb7/0x160 [ 7679.469670] kthread+0xcd/0xf0 [ 7679.476097] ? __pfx_kthread+0x10/0x10 [ 7679.483211] ret_from_fork+0x29/0x50 [ 7679.490047] </TASK> [ 7679.495204] ---[ end trace 0000000000000000 ]--- [ 7679.503179] igc 0000:01:00.0 enp1s0: Register Dump [ 7679.511230] igc 0000:01:00.0 enp1s0: Register Name Value [ 7679.519892] igc 0000:01:00.0 enp1s0: CTRL 181c0641 [ 7679.528782] igc 0000:01:00.0 enp1s0: STATUS 40280683 [ 7679.537551] igc 0000:01:00.0 enp1s0: CTRL_EXT 10000040 [ 7679.546284] igc 0000:01:00.0 enp1s0: MDIC 180a3800 [ 7679.554942] igc 0000:01:00.0 enp1s0: ICR 00000081 [ 7679.563503] igc 0000:01:00.0 enp1s0: RCTL 04408022 [ 7679.571963] igc 0000:01:00.0 enp1s0: RDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.583075] igc 0000:01:00.0 enp1s0: RDH[0-3] 00000068 000000b6 0000000f 00000031 [ 7679.594162] igc 0000:01:00.0 enp1s0: RDT[0-3] 00000066 000000b2 0000000 00000030 [ 7679.605174] igc 0000:01:00.0 enp1s0: RXDCTL[0-3] 02040808 02040808 02040808 02040808 [ 7679.616196] igc 0000:01:00.0 enp1s0: RDBAL[0-3] 1bb7c000 1bb7f000 1bb82000 0ef33000 [ 7679.627242] igc 0000:01:00.0 enp1s0: RDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.638256] igc 0000:01:00.0 enp1s0: TCTL a503f0fa [ 7679.646607] igc 0000:01:00.0 enp1s0: TDBAL[0-3] 2ba4a000 1bb6f000 1bb74000 1bb79000 [ 7679.657609] igc 0000:01:00.0 enp1s0: TDBAH[0-3] 00000001 00000001 00000001 00000001 [ 7679.668551] igc 0000:01:00.0 enp1s0: TDLEN[0-3] 00001000 00001000 00001000 00001000 [ 7679.679470] igc 0000:01:00.0 enp1s0: TDH[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.690406] igc 0000:01:00.0 enp1s0: TDT[0-3] 000000a7 0000002d 000000bf 000000d9 [ 7679.701264] igc 0000:01:00.0 enp1s0: TXDCTL[0-3] 02100108 02100108 02100108 02100108 [ 7679.712123] igc 0000:01:00.0 enp1s0: Reset adapter [ 7683.085967] igc 0000:01:00.0 enp1s0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX [ 8086.945561] ------------[ cut here ]------------ Entering kdb (current=0xffffffff8220b200, pid 0) on processor 0 Oops: (null) due to oops @ 0xffffffff81573888 RIP: 0010:dql_completed+0x148/0x160 Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? igc_poll+0x1a9/0x14d0 [igc] __napi_poll+0x2e/0x1b0 net_rx_action+0x126/0x250 __do_softirq+0xd1/0x295 irq_exit_rcu+0xc5/0xf0 common_interrupt+0x86/0xa0 </IRQ> <TASK> asm_common_interrupt+0x27/0x40 RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 cpuidle_enter+0x2e/0x50 call_cpuidle+0x23/0x40 do_idle+0x1be/0x220 cpu_startup_entry+0x20/0x30 rest_init+0xb5/0xc0 arch_call_rest_init+0xe/0x30 start_kernel+0x448/0x760 x86_64_start_kernel+0x109/0x150 secondary_startup_64_no_verify+0xe0/0xeb </TASK> more> [0]kdb> [0]kdb> [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, type go a second time if you really want to continue [0]kdb> go Catastrophic error detected kdb_continue_catastrophic=0, attempting to continue [ 8086.955689] refcount_t: underflow; use-after-free. [ 8086.955697] WARNING: CPU: 0 PID: 0 at lib/refcount.c:28 refcount_warn_saturate+0xc2/0x110 [ 8086.955706] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.955751] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [ 8086.955784] RIP: 0010:refcount_warn_saturate+0xc2/0x110 [ 8086.955788] Code: 01 e8 82 e7 b4 ff 0f 0b 5d c3 cc cc cc cc 80 3d 68 c6 eb 00 00 75 81 48 c7 c7 a0 87 f6 81 c6 05 58 c6 eb 00 01 e8 5e e7 b4 ff <0f> 0b 5d c3 cc cc cc cc 80 3d 42 c6 eb 00 00 0f 85 59 ff ff ff 48 [ 8086.955790] RSP: 0018:ffa0000000003da0 EFLAGS: 00010286 [ 8086.955793] RAX: 0000000000000000 RBX: ff1100011da40ee0 RCX: ff11000361e1b888 [ 8086.955794] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ff11000361e1b880 [ 8086.955795] RBP: ffa0000000003da0 R08: 80000000ffff9f45 R09: ffa0000000003d28 [ 8086.955796] R10: ff1100035f840000 R11: 0000000000000028 R12: ff11000319ff8000 [ 8086.955797] R13: ff1100011bb79d60 R14: 00000000ffffffd6 R15: ff1100037039cb00 [ 8086.955798] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955800] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955801] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955803] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955803] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955804] PKRU: 55555554 [ 8086.955805] Call Trace: [ 8086.955806] <IRQ> [ 8086.955808] tcp_wfree+0x112/0x130 [ 8086.955814] skb_release_head_state+0x24/0xa0 [ 8086.955818] napi_consume_skb+0x9c/0x160 [ 8086.955821] igc_poll+0x5d8/0x14d0 [igc] [ 8086.955835] __napi_poll+0x2e/0x1b0 [ 8086.955839] net_rx_action+0x126/0x250 [ 8086.955843] __do_softirq+0xd1/0x295 [ 8086.955846] irq_exit_rcu+0xc5/0xf0 [ 8086.955851] common_interrupt+0x86/0xa0 [ 8086.955857] </IRQ> [ 8086.955857] <TASK> [ 8086.955858] asm_common_interrupt+0x27/0x40 [ 8086.955862] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955866] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955867] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955869] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955870] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955871] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955872] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955873] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955875] cpuidle_enter+0x2e/0x50 [ 8086.955880] call_cpuidle+0x23/0x40 [ 8086.955884] do_idle+0x1be/0x220 [ 8086.955887] cpu_startup_entry+0x20/0x30 [ 8086.955889] rest_init+0xb5/0xc0 [ 8086.955892] arch_call_rest_init+0xe/0x30 [ 8086.955895] start_kernel+0x448/0x760 [ 8086.955898] x86_64_start_kernel+0x109/0x150 [ 8086.955900] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955904] </TASK> [ 8086.955904] ---[ end trace 0000000000000000 ]--- [ 8086.955912] ------------[ cut here ]------------ [ 8086.955913] kernel BUG at lib/dynamic_queue_limits.c:27! [ 8086.955918] invalid opcode: 0000 [#1] SMP [ 8086.955922] RIP: 0010:dql_completed+0x148/0x160 [ 8086.955925] Code: c9 00 48 89 57 58 e9 46 ff ff ff 45 85 e4 41 0f 95 c4 41 39 db 0f 95 c1 41 84 cc 74 05 45 85 ed 78 0a 44 89 c1 e9 27 ff ff ff <0f> 0b 01 f6 44 89 c1 29 f1 0f 48 ca eb 8c cc cc cc cc cc cc cc cc [ 8086.955927] RSP: 0018:ffa0000000003e00 EFLAGS: 00010287 [ 8086.955928] RAX: 000000000000006c RBX: ffa0000003eb0f78 RCX: ff11000109938000 [ 8086.955929] RDX: 0000000000000003 RSI: 0000000000000160 RDI: ff110001002e9480 [ 8086.955930] RBP: ffa0000000003ed8 R08: ff110001002e93c0 R09: ffa0000000003d28 [ 8086.955931] R10: 0000000000007cc0 R11: 0000000000007c54 R12: 00000000ffffffd9 [ 8086.955932] R13: ff1100037039cb00 R14: 00000000ffffffd9 R15: ff1100037039c048 [ 8086.955933] FS: 0000000000000000(0000) GS:ff11000361e00000(0000) knlGS:0000000000000000 [ 8086.955934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8086.955935] CR2: 00007ffff7fca168 CR3: 000000013b08a003 CR4: 0000000000471ef8 [ 8086.955936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 8086.955937] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 8086.955938] PKRU: 55555554 [ 8086.955939] Call Trace: [ 8086.955939] <IRQ> [ 8086.955940] ? igc_poll+0x1a9/0x14d0 [igc] [ 8086.955949] __napi_poll+0x2e/0x1b0 [ 8086.955952] net_rx_action+0x126/0x250 [ 8086.955956] __do_softirq+0xd1/0x295 [ 8086.955958] irq_exit_rcu+0xc5/0xf0 [ 8086.955961] common_interrupt+0x86/0xa0 [ 8086.955964] </IRQ> [ 8086.955965] <TASK> [ 8086.955965] asm_common_interrupt+0x27/0x40 [ 8086.955968] RIP: 0010:cpuidle_enter_state+0xd3/0x3e0 [ 8086.955971] Code: 73 f1 ff ff 49 89 c6 8b 05 e2 ca a7 00 85 c0 0f 8f b3 02 00 00 31 ff e8 1b de 75 ff 80 7d d7 00 0f 85 cd 01 00 00 fb 45 85 ff <0f> 88 fd 00 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 ca 48 8d [ 8086.955972] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202 [ 8086.955973] RAX: ff11000361e2a200 RBX: 0000000000000002 RCX: 000000000000001f [ 8086.955974] RDX: 0000000000000000 RSI: 000000003cf3cf3d RDI: 0000000000000000 [ 8086.955974] RBP: ffffffff82203e28 R08: 0000075ae38471c8 R09: 0000000000000018 [ 8086.955975] R10: 000000000000031a R11: ffffffff8238dca0 R12: ffd1ffffff200000 [ 8086.955976] R13: ffffffff8238dca0 R14: 0000075ae38471c8 R15: 0000000000000002 [ 8086.955978] cpuidle_enter+0x2e/0x50 [ 8086.955981] call_cpuidle+0x23/0x40 [ 8086.955984] do_idle+0x1be/0x220 [ 8086.955985] cpu_startup_entry+0x20/0x30 [ 8086.955987] rest_init+0xb5/0xc0 [ 8086.955990] arch_call_rest_init+0xe/0x30 [ 8086.955992] start_kernel+0x448/0x760 [ 8086.955994] x86_64_start_kernel+0x109/0x150 [ 8086.955996] secondary_startup_64_no_verify+0xe0/0xeb [ 8086.955998] </TASK> [ 8086.955999] Modules linked in: xt_conntrack nft_chain_nat xt_MASQUERADE xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay dm_mod emrcha(PO) emriio(PO) rktpm(PO) cegbuf_mod(PO) patch_update(PO) se(PO) sgx_tgts(PO) mktme(PO) keylocker(PO) svtdx(PO) svfs_pci_hotplug(PO) vtd_mod(PO) davemem(PO) svmabort(PO) svindexio(PO) usbx2(PO) ehci_sched(PO) svheartbeat(PO) ioapic(PO) sv8259(PO) svintr(PO) lt(PO) pcierootport(PO) enginefw_mod(PO) ata(PO) smbus(PO) spiflash_cdf(PO) arden(PO) dsa_iax(PO) oobmsm_punit(PO) cpm(PO) svkdb(PO) ebg_pch(PO) pch(PO) sviotargets(PO) svbdf(PO) svmem(PO) svbios(PO) dram(PO) svtsc(PO) targets(PO) superio(PO) svkernel(PO) cswitch(PO) mcf(PO) pentiumIII_mod(PO) fs_svfs(PO) mdevdefdb(PO) svfs_os_services(O) ixgbe mdio mdio_devres libphy emeraldrapids_svdefs(PO) regsupport(O) libnvdimm nls_cp437 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep x86_pkg_temp_thermal snd_hda_core snd_pcm snd_timer isst_if_mbox_pci [ 8086.956029] input_leds isst_if_mmio sg snd isst_if_common soundcore wmi button sad9(O) drm fuse backlight configfs efivarfs ip_tables x_tables vmd sdhci led_class rtl8150 r8152 hid_generic pegasus mmc_block usbhid mmc_core hid megaraid_sas ixgb igb i2c_algo_bit ice i40e hpsa scsi_transport_sas e1000e e1000 e100 ax88179_178a usbnet xhci_pci sd_mod xhci_hcd t10_pi crc32c_intel crc64_rocksoft igc crc64 crc_t10dif usbcore crct10dif_generic ptp crct10dif_common usb_common pps_core [16762.543675] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.593 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.595 msecs [16762.543673] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.495 msecs [16762.543679] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543678] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.598 msecs [16762.543690] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.605 msecs [16762.543684] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.599 msecs [16762.543693] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 8675587.613 msecs [16762.543784] ---[ end trace 0000000000000000 ]--- [16762.849099] RIP: 0010:dql_completed+0x148/0x160 PANIC: Fatal exception in interrupt Fixes: 9b27517 ("igc: Add ndo_tx_timeout support") Tested-by: Alejandra Victoria Alcaraz <alejandra.victoria.alcaraz@intel.com> Signed-off-by: Muhammad Husaini Zulkifli <muhammad.husaini.zulkifli@intel.com> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Tested-by: Naama Meir <naamax.meir@linux.intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sasha Levin <sashal@kernel.org>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 9502dd5 upstream. After commit f7025d8 ("smb: client: allocate crypto only for primary server") and commit b0abcd6 ("smb: client: fix UAF in async decryption"), the channels started reusing AEAD TFM from primary channel to perform synchronous decryption, but that can't done as there could be multiple cifsd threads (one per channel) simultaneously accessing it to perform decryption. This fixes the following KASAN splat when running fstest generic/249 with 'vers=3.1.1,multichannel,max_channels=4,seal' against Windows Server 2022: BUG: KASAN: slab-use-after-free in gf128mul_4k_lle+0xba/0x110 Read of size 8 at addr ffff8881046c18a0 by task cifsd/986 CPU: 3 UID: 0 PID: 986 Comm: cifsd Not tainted 6.15.0-rc1 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 print_report+0x156/0x528 ? gf128mul_4k_lle+0xba/0x110 ? __virt_addr_valid+0x145/0x300 ? __phys_addr+0x46/0x90 ? gf128mul_4k_lle+0xba/0x110 kasan_report+0xdf/0x1a0 ? gf128mul_4k_lle+0xba/0x110 gf128mul_4k_lle+0xba/0x110 ghash_update+0x189/0x210 shash_ahash_update+0x295/0x370 ? __pfx_shash_ahash_update+0x10/0x10 ? __pfx_shash_ahash_update+0x10/0x10 ? __pfx_extract_iter_to_sg+0x10/0x10 ? ___kmalloc_large_node+0x10e/0x180 ? __asan_memset+0x23/0x50 crypto_ahash_update+0x3c/0xc0 gcm_hash_assoc_remain_continue+0x93/0xc0 crypt_message+0xe09/0xec0 [cifs] ? __pfx_crypt_message+0x10/0x10 [cifs] ? _raw_spin_unlock+0x23/0x40 ? __pfx_cifs_readv_from_socket+0x10/0x10 [cifs] decrypt_raw_data+0x229/0x380 [cifs] ? __pfx_decrypt_raw_data+0x10/0x10 [cifs] ? __pfx_cifs_read_iter_from_socket+0x10/0x10 [cifs] smb3_receive_transform+0x837/0xc80 [cifs] ? __pfx_smb3_receive_transform+0x10/0x10 [cifs] ? __pfx___might_resched+0x10/0x10 ? __pfx_smb3_is_transform_hdr+0x10/0x10 [cifs] cifs_demultiplex_thread+0x692/0x1570 [cifs] ? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs] ? rcu_is_watching+0x20/0x50 ? rcu_lockdep_current_cpu_online+0x62/0xb0 ? find_held_lock+0x32/0x90 ? kvm_sched_clock_read+0x11/0x20 ? local_clock_noinstr+0xd/0xd0 ? trace_irq_enable.constprop.0+0xa8/0xe0 ? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs] kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x60 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x31/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Tested-by: David Howells <dhowells@redhat.com> Reported-by: Steve French <stfrench@microsoft.com> Closes: https://lore.kernel.org/r/CAH2r5mu6Yc0-RJXM3kFyBYUB09XmXBrNodOiCVR4EDrmxq5Szg@mail.gmail.com Fixes: f7025d8 ("smb: client: allocate crypto only for primary server") Fixes: b0abcd6 ("smb: client: fix UAF in async decryption") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 5c86919 upstream. The following UAF was triggered when running fstests generic/072 with KASAN enabled against Windows Server 2022 and mount options 'multichannel,max_channels=2,vers=3.1.1,mfsymlinks,noperm' BUG: KASAN: slab-use-after-free in smb2_query_info_compound+0x423/0x6d0 [cifs] Read of size 8 at addr ffff888014941048 by task xfs_io/27534 CPU: 0 PID: 27534 Comm: xfs_io Not tainted 6.6.0-rc7 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack_lvl+0x4a/0x80 print_report+0xcf/0x650 ? srso_alias_return_thunk+0x5/0x7f ? srso_alias_return_thunk+0x5/0x7f ? __phys_addr+0x46/0x90 kasan_report+0xda/0x110 ? smb2_query_info_compound+0x423/0x6d0 [cifs] ? smb2_query_info_compound+0x423/0x6d0 [cifs] smb2_query_info_compound+0x423/0x6d0 [cifs] ? __pfx_smb2_query_info_compound+0x10/0x10 [cifs] ? srso_alias_return_thunk+0x5/0x7f ? __stack_depot_save+0x39/0x480 ? kasan_save_stack+0x33/0x60 ? kasan_set_track+0x25/0x30 ? ____kasan_slab_free+0x126/0x170 smb2_queryfs+0xc2/0x2c0 [cifs] ? __pfx_smb2_queryfs+0x10/0x10 [cifs] ? __pfx___lock_acquire+0x10/0x10 smb311_queryfs+0x210/0x220 [cifs] ? __pfx_smb311_queryfs+0x10/0x10 [cifs] ? srso_alias_return_thunk+0x5/0x7f ? __lock_acquire+0x480/0x26c0 ? lock_release+0x1ed/0x640 ? srso_alias_return_thunk+0x5/0x7f ? do_raw_spin_unlock+0x9b/0x100 cifs_statfs+0x18c/0x4b0 [cifs] statfs_by_dentry+0x9b/0xf0 fd_statfs+0x4e/0xb0 __do_sys_fstatfs+0x7f/0xe0 ? __pfx___do_sys_fstatfs+0x10/0x10 ? srso_alias_return_thunk+0x5/0x7f ? lockdep_hardirqs_on_prepare+0x136/0x200 ? srso_alias_return_thunk+0x5/0x7f do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Allocated by task 27534: kasan_save_stack+0x33/0x60 kasan_set_track+0x25/0x30 __kasan_kmalloc+0x8f/0xa0 open_cached_dir+0x71b/0x1240 [cifs] smb2_query_info_compound+0x5c3/0x6d0 [cifs] smb2_queryfs+0xc2/0x2c0 [cifs] smb311_queryfs+0x210/0x220 [cifs] cifs_statfs+0x18c/0x4b0 [cifs] statfs_by_dentry+0x9b/0xf0 fd_statfs+0x4e/0xb0 __do_sys_fstatfs+0x7f/0xe0 do_syscall_64+0x3f/0x90 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Freed by task 27534: kasan_save_stack+0x33/0x60 kasan_set_track+0x25/0x30 kasan_save_free_info+0x2b/0x50 ____kasan_slab_free+0x126/0x170 slab_free_freelist_hook+0xd0/0x1e0 __kmem_cache_free+0x9d/0x1b0 open_cached_dir+0xff5/0x1240 [cifs] smb2_query_info_compound+0x5c3/0x6d0 [cifs] smb2_queryfs+0xc2/0x2c0 [cifs] This is a race between open_cached_dir() and cached_dir_lease_break() where the cache entry for the open directory handle receives a lease break while creating it. And before returning from open_cached_dir(), we put the last reference of the new @cfid because of !@cfid->has_lease. Besides the UAF, while running xfstests a lot of missed lease breaks have been noticed in tests that run several concurrent statfs(2) calls on those cached fids CIFS: VFS: \\w22-root1.gandalf.test No task to wake, unknown frame... CIFS: VFS: \\w22-root1.gandalf.test Cmd: 18 Err: 0x0 Flags: 0x1... CIFS: VFS: \\w22-root1.gandalf.test smb buf 00000000715bfe83 len 108 CIFS: VFS: Dump pending requests: CIFS: VFS: \\w22-root1.gandalf.test No task to wake, unknown frame... CIFS: VFS: \\w22-root1.gandalf.test Cmd: 18 Err: 0x0 Flags: 0x1... CIFS: VFS: \\w22-root1.gandalf.test smb buf 000000005aa7316e len 108 ... To fix both, in open_cached_dir() ensure that @cfid->has_lease is set right before sending out compounded request so that any potential lease break will be get processed by demultiplex thread while we're still caching @cfid. And, if open failed for some reason, re-check @cfid->has_lease to decide whether or not put lease reference. Cc: stable@vger.kernel.org Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit ef7134c upstream. Recently, we got a customer report that CIFS triggers oops while reconnecting to a server. [0] The workload runs on Kubernetes, and some pods mount CIFS servers in non-root network namespaces. The problem rarely happened, but it was always while the pod was dying. The root cause is wrong reference counting for network namespace. CIFS uses kernel sockets, which do not hold refcnt of the netns that the socket belongs to. That means CIFS must ensure the socket is always freed before its netns; otherwise, use-after-free happens. The repro steps are roughly: 1. mount CIFS in a non-root netns 2. drop packets from the netns 3. destroy the netns 4. unmount CIFS We can reproduce the issue quickly with the script [1] below and see the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled. When the socket is TCP, it is hard to guarantee the netns lifetime without holding refcnt due to async timers. Let's hold netns refcnt for each socket as done for SMC in commit 9744d2b ("smc: Fix use-after-free in tcp_write_timer_handler()."). Note that we need to move put_net() from cifs_put_tcp_session() to clean_demultiplex_info(); otherwise, __sock_create() still could touch a freed netns while cifsd tries to reconnect from cifs_demultiplex_thread(). Also, maybe_get_net() cannot be put just before __sock_create() because the code is not under RCU and there is a small chance that the same address happened to be reallocated to another netns. [0]: CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting... CIFS: Serverclose failed 4 times, giving up Unable to handle kernel paging request at virtual address 14de99e461f84a07 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [14de99e461f84a07] address between user and kernel address ranges Internal error: Oops: 0000000096000004 [#1] SMP Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1 Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fib_rules_lookup+0x44/0x238 lr : __fib_lookup+0x64/0xbc sp : ffff8000265db790 x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01 x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580 x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500 x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002 x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294 x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000 x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0 x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500 Call trace: fib_rules_lookup+0x44/0x238 __fib_lookup+0x64/0xbc ip_route_output_key_hash_rcu+0x2c4/0x398 ip_route_output_key_hash+0x60/0x8c tcp_v4_connect+0x290/0x488 __inet_stream_connect+0x108/0x3d0 inet_stream_connect+0x50/0x78 kernel_connect+0x6c/0xac generic_ip_connect+0x10c/0x6c8 [cifs] __reconnect_target_unlocked+0xa0/0x214 [cifs] reconnect_dfs_server+0x144/0x460 [cifs] cifs_reconnect+0x88/0x148 [cifs] cifs_readv_from_socket+0x230/0x430 [cifs] cifs_read_from_socket+0x74/0xa8 [cifs] cifs_demultiplex_thread+0xf8/0x704 [cifs] kthread+0xd0/0xd4 Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264) [1]: CIFS_CRED="/root/cred.cifs" CIFS_USER="Administrator" CIFS_PASS="Password" CIFS_IP="X.X.X.X" CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST" CIFS_MNT="/mnt/smb" DEV="enp0s3" cat <<EOF > ${CIFS_CRED} username=${CIFS_USER} password=${CIFS_PASS} domain=EXAMPLE.COM EOF unshare -n bash -c " mkdir -p ${CIFS_MNT} ip netns attach root 1 ip link add eth0 type veth peer veth0 netns root ip link set eth0 up ip -n root link set veth0 up ip addr add 192.168.0.2/24 dev eth0 ip -n root addr add 192.168.0.1/24 dev veth0 ip route add default via 192.168.0.1 dev eth0 ip netns exec root sysctl net.ipv4.ip_forward=1 ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1 touch ${CIFS_MNT}/a.txt ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE " umount ${CIFS_MNT} [2]: ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1576) generic_ip_connect (fs/smb/client/connect.c:3075) cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798) cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366) dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285) cifs_mount (fs/smb/client/connect.c:3622) cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949) smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794) vfs_get_tree (fs/super.c:1800) path_mount (fs/namespace.c:3508 fs/namespace.c:3834) __x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fixes: 26abe14 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 183a087 upstream. It looks like GPUs are used after shutdown is invoked. Thus, breaking virtio gpu in the shutdown callback is not a good idea - guest hangs attempting to finish console drawing, with these warnings: [ 20.504464] WARNING: CPU: 0 PID: 568 at drivers/gpu/drm/virtio/virtgpu_vq.c:358 virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.505685] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm kvm_intel kvm rapl iTCO_wdt iTCO_vendor_support virtio_gpu virtio_dma_buf pcspkr drm_shmem_helper i2c_i801 drm_kms_helper lpc_ich i2c_smbus virtio_balloon joydev drm fuse xfs libcrc32c ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata virtio_net ghash_clmulni_intel net_failover virtio_blk failover serio_raw dm_mirror dm_region_hash dm_log dm_mod [ 20.511847] CPU: 0 PID: 568 Comm: kworker/0:3 Kdump: loaded Tainted: G W ------- --- 5.14.0-578.6675_1757216455.el9.x86_64 #1 [ 20.513157] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-3.el9 11/17/2024 [ 20.513918] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper] [ 20.514626] RIP: 0010:virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.515332] Code: 00 00 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 51 50 00 00 65 ff 0d 42 e3 74 3f 0f 85 69 ff ff ff 0f 1f 44 00 00 e9 5f ff ff ff <0f> 0b e9 3f ff ff ff 48 83 3c 24 00 74 0e 49 8b 7f 40 48 85 ff 74 [ 20.517272] RSP: 0018:ff34f0a8c0787ad8 EFLAGS: 00010282 [ 20.517820] RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 0000000000000820 [ 20.518565] RDX: 0000000000000000 RSI: ff34f0a8c0787be0 RDI: ff218bef03a26300 [ 20.519308] RBP: ff218bef03a26300 R08: 0000000000000001 R09: ff218bef07224360 [ 20.520059] R10: 0000000000008dc0 R11: 0000000000000002 R12: ff218bef02630028 [ 20.520806] R13: ff218bef0263fb48 R14: ff218bef00cb8000 R15: ff218bef07224360 [ 20.521555] FS: 0000000000000000(0000) GS:ff218bef7ba00000(0000) knlGS:0000000000000000 [ 20.522397] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 20.522996] CR2: 000055ac4f7871c0 CR3: 000000010b9f2002 CR4: 0000000000771ef0 [ 20.523740] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 20.524477] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 20.525223] PKRU: 55555554 [ 20.525515] Call Trace: [ 20.525777] <TASK> [ 20.526003] ? show_trace_log_lvl+0x1c4/0x2df [ 20.526464] ? show_trace_log_lvl+0x1c4/0x2df [ 20.526925] ? virtio_gpu_queue_fenced_ctrl_buffer+0x82/0x2c0 [virtio_gpu] [ 20.527643] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.528282] ? __warn+0x7e/0xd0 [ 20.528621] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.529256] ? report_bug+0x100/0x140 [ 20.529643] ? handle_bug+0x3c/0x70 [ 20.530010] ? exc_invalid_op+0x14/0x70 [ 20.530421] ? asm_exc_invalid_op+0x16/0x20 [ 20.530862] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.531506] ? virtio_gpu_queue_ctrl_sgs+0x174/0x290 [virtio_gpu] [ 20.532148] virtio_gpu_queue_fenced_ctrl_buffer+0x82/0x2c0 [virtio_gpu] [ 20.532843] virtio_gpu_primary_plane_update+0x3e2/0x460 [virtio_gpu] [ 20.533520] drm_atomic_helper_commit_planes+0x108/0x320 [drm_kms_helper] [ 20.534233] drm_atomic_helper_commit_tail+0x45/0x80 [drm_kms_helper] [ 20.534914] commit_tail+0xd2/0x130 [drm_kms_helper] [ 20.535446] drm_atomic_helper_commit+0x11b/0x140 [drm_kms_helper] [ 20.536097] drm_atomic_commit+0xa4/0xe0 [drm] [ 20.536588] ? __pfx___drm_printfn_info+0x10/0x10 [drm] [ 20.537162] drm_atomic_helper_dirtyfb+0x192/0x270 [drm_kms_helper] [ 20.537823] drm_fbdev_shmem_helper_fb_dirty+0x43/0xa0 [drm_shmem_helper] [ 20.538536] drm_fb_helper_damage_work+0x87/0x160 [drm_kms_helper] [ 20.539188] process_one_work+0x194/0x380 [ 20.539612] worker_thread+0x2fe/0x410 [ 20.540007] ? __pfx_worker_thread+0x10/0x10 [ 20.540456] kthread+0xdd/0x100 [ 20.540791] ? __pfx_kthread+0x10/0x10 [ 20.541190] ret_from_fork+0x29/0x50 [ 20.541566] </TASK> [ 20.541802] ---[ end trace 0000000000000000 ]--- It looks like the shutdown is called in the middle of console drawing, so we should either wait for it to finish, or let drm handle the shutdown. This patch implements this second option: Add an option for drivers to bypass the common break+reset handling. As DRM is careful to flush/synchronize outstanding buffers, it looks like GPU can just have a NOP there. Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Fixes: 8bd2fa0 ("virtio: break and reset virtio devices on device_shutdown()") Cc: Eric Auger <eauger@redhat.com> Cc: Jocelyn Falempe <jfalempe@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <8490dbeb6f79ed039e6c11d121002618972538a3.1744293540.git.mst@redhat.com> Signed-off-by: Aaron Ahmed <aarnahmd@amazon.com>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit 711741f ] Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order and prevent the following deadlock from happening ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc3-build2+ #1301 Tainted: G S W ------------------------------------------------------ cifsd/6055 is trying to acquire lock: ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200 but task is already holding lock: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&ret_buf->chan_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_setup_session+0x81/0x4b0 cifs_get_smb_ses+0x771/0x900 cifs_mount_get_session+0x7e/0x170 cifs_mount+0x92/0x2d0 cifs_smb3_do_mount+0x161/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&ret_buf->ses_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_match_super+0x101/0x320 sget+0xab/0x270 cifs_smb3_do_mount+0x1e0/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}: check_noncircular+0x95/0xc0 check_prev_add+0x115/0x2f0 validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_signal_cifsd_for_reconnect+0x134/0x200 __cifs_reconnect+0x8f/0x500 cifs_handle_standard+0x112/0x280 cifs_demultiplex_thread+0x64d/0xbc0 kthread+0x2f7/0x310 ret_from_fork+0x2a/0x230 ret_from_fork_asm+0x1a/0x30 other info that might help us debug this: Chain exists of: &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ret_buf->chan_lock); lock(&ret_buf->ses_lock); lock(&ret_buf->chan_lock); lock(&tcp_ses->srv_lock); *** DEADLOCK *** 3 locks held by cifsd/6055: #0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200 #1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200 #2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: d7d7a66 ("cifs: avoid use of global locks for high contention data") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org> [v6.1: replaced SERVER_IS_CHAN with CIFS_SERVER_IS_CHAN; resolved contextual conflicts in cifs_signal_cifsd_for_reconnect()] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 9502dd5 upstream. After commit f7025d8 ("smb: client: allocate crypto only for primary server") and commit b0abcd6 ("smb: client: fix UAF in async decryption"), the channels started reusing AEAD TFM from primary channel to perform synchronous decryption, but that can't done as there could be multiple cifsd threads (one per channel) simultaneously accessing it to perform decryption. This fixes the following KASAN splat when running fstest generic/249 with 'vers=3.1.1,multichannel,max_channels=4,seal' against Windows Server 2022: BUG: KASAN: slab-use-after-free in gf128mul_4k_lle+0xba/0x110 Read of size 8 at addr ffff8881046c18a0 by task cifsd/986 CPU: 3 UID: 0 PID: 986 Comm: cifsd Not tainted 6.15.0-rc1 #1 PREEMPT(voluntary) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-3.fc41 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x5d/0x80 print_report+0x156/0x528 ? gf128mul_4k_lle+0xba/0x110 ? __virt_addr_valid+0x145/0x300 ? __phys_addr+0x46/0x90 ? gf128mul_4k_lle+0xba/0x110 kasan_report+0xdf/0x1a0 ? gf128mul_4k_lle+0xba/0x110 gf128mul_4k_lle+0xba/0x110 ghash_update+0x189/0x210 shash_ahash_update+0x295/0x370 ? __pfx_shash_ahash_update+0x10/0x10 ? __pfx_shash_ahash_update+0x10/0x10 ? __pfx_extract_iter_to_sg+0x10/0x10 ? ___kmalloc_large_node+0x10e/0x180 ? __asan_memset+0x23/0x50 crypto_ahash_update+0x3c/0xc0 gcm_hash_assoc_remain_continue+0x93/0xc0 crypt_message+0xe09/0xec0 [cifs] ? __pfx_crypt_message+0x10/0x10 [cifs] ? _raw_spin_unlock+0x23/0x40 ? __pfx_cifs_readv_from_socket+0x10/0x10 [cifs] decrypt_raw_data+0x229/0x380 [cifs] ? __pfx_decrypt_raw_data+0x10/0x10 [cifs] ? __pfx_cifs_read_iter_from_socket+0x10/0x10 [cifs] smb3_receive_transform+0x837/0xc80 [cifs] ? __pfx_smb3_receive_transform+0x10/0x10 [cifs] ? __pfx___might_resched+0x10/0x10 ? __pfx_smb3_is_transform_hdr+0x10/0x10 [cifs] cifs_demultiplex_thread+0x692/0x1570 [cifs] ? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs] ? rcu_is_watching+0x20/0x50 ? rcu_lockdep_current_cpu_online+0x62/0xb0 ? find_held_lock+0x32/0x90 ? kvm_sched_clock_read+0x11/0x20 ? local_clock_noinstr+0xd/0xd0 ? trace_irq_enable.constprop.0+0xa8/0xe0 ? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs] kthread+0x1fe/0x380 ? kthread+0x10f/0x380 ? __pfx_kthread+0x10/0x10 ? local_clock_noinstr+0xd/0xd0 ? ret_from_fork+0x1b/0x60 ? local_clock+0x15/0x30 ? lock_release+0x29b/0x390 ? rcu_is_watching+0x20/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x31/0x60 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Tested-by: David Howells <dhowells@redhat.com> Reported-by: Steve French <stfrench@microsoft.com> Closes: https://lore.kernel.org/r/CAH2r5mu6Yc0-RJXM3kFyBYUB09XmXBrNodOiCVR4EDrmxq5Szg@mail.gmail.com Fixes: f7025d8 ("smb: client: allocate crypto only for primary server") Fixes: b0abcd6 ("smb: client: fix UAF in async decryption") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Andrew Paniakin <apanyaki@amazon.com>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit 6d19c44 ] Hardware returns a unique identifier for a decrypted packet's xfrm state, this state is looked up in an xarray. However, the state might have been freed by the time of this lookup. Currently, if the state is not found, only a counter is incremented. The secpath (sp) extension on the skb is not removed, resulting in sp->len becoming 0. Subsequently, functions like __xfrm_policy_check() attempt to access fields such as xfrm_input_state(skb)->xso.type (which dereferences sp->xvec[sp->len - 1]) without first validating sp->len. This leads to a crash when dereferencing an invalid state pointer. This patch prevents the crash by explicitly removing the secpath extension from the skb if the xfrm state is not found after hardware decryption. This ensures downstream functions do not operate on a zero-length secpath. BUG: unable to handle page fault for address: ffffffff000002c8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 282e067 P4D 282e067 PUD 0 Oops: Oops: 0000 [#1] SMP CPU: 12 UID: 0 PID: 0 Comm: swapper/12 Not tainted 6.15.0-rc7_for_upstream_min_debug_2025_05_27_22_44 #1 NONE Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:__xfrm_policy_check+0x61a/0xa30 Code: b6 77 7f 83 e6 02 74 14 4d 8b af d8 00 00 00 41 0f b6 45 05 c1 e0 03 48 98 49 01 c5 41 8b 45 00 83 e8 01 48 98 49 8b 44 c5 10 <0f> b6 80 c8 02 00 00 83 e0 0c 3c 04 0f 84 0c 02 00 00 31 ff 80 fa RSP: 0018:ffff88885fb04918 EFLAGS: 00010297 RAX: ffffffff00000000 RBX: 0000000000000002 RCX: 0000000000000000 RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000000 RBP: ffffffff8311af80 R08: 0000000000000020 R09: 00000000c2eda353 R10: ffff88812be2bbc8 R11: 000000001faab533 R12: ffff88885fb049c8 R13: ffff88812be2bbc8 R14: 0000000000000000 R15: ffff88811896ae00 FS: 0000000000000000(0000) GS:ffff8888dca82000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffff000002c8 CR3: 0000000243050002 CR4: 0000000000372eb0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> ? try_to_wake_up+0x108/0x4c0 ? udp4_lib_lookup2+0xbe/0x150 ? udp_lib_lport_inuse+0x100/0x100 ? __udp4_lib_lookup+0x2b0/0x410 __xfrm_policy_check2.constprop.0+0x11e/0x130 udp_queue_rcv_one_skb+0x1d/0x530 udp_unicast_rcv_skb+0x76/0x90 __udp4_lib_rcv+0xa64/0xe90 ip_protocol_deliver_rcu+0x20/0x130 ip_local_deliver_finish+0x75/0xa0 ip_local_deliver+0xc1/0xd0 ? ip_protocol_deliver_rcu+0x130/0x130 ip_sublist_rcv+0x1f9/0x240 ? ip_rcv_finish_core+0x430/0x430 ip_list_rcv+0xfc/0x130 __netif_receive_skb_list_core+0x181/0x1e0 netif_receive_skb_list_internal+0x200/0x360 ? mlx5e_build_rx_skb+0x1bc/0xda0 [mlx5_core] gro_receive_skb+0xfd/0x210 mlx5e_handle_rx_cqe_mpwrq+0x141/0x280 [mlx5_core] mlx5e_poll_rx_cq+0xcc/0x8e0 [mlx5_core] ? mlx5e_handle_rx_dim+0x91/0xd0 [mlx5_core] mlx5e_napi_poll+0x114/0xab0 [mlx5_core] __napi_poll+0x25/0x170 net_rx_action+0x32d/0x3a0 ? mlx5_eq_comp_int+0x8d/0x280 [mlx5_core] ? notifier_call_chain+0x33/0xa0 handle_softirqs+0xda/0x250 irq_exit_rcu+0x6d/0xc0 common_interrupt+0x81/0xa0 </IRQ> Fixes: b2ac754 ("net/mlx5e: IPsec: Add Connect-X IPsec Rx data path offload") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Yael Chemla <ychemla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/1753256672-337784-3-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Hagar Hemdan <hagarhem@amazon.com>
heynemax
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 183a087 upstream. It looks like GPUs are used after shutdown is invoked. Thus, breaking virtio gpu in the shutdown callback is not a good idea - guest hangs attempting to finish console drawing, with these warnings: [ 20.504464] WARNING: CPU: 0 PID: 568 at drivers/gpu/drm/virtio/virtgpu_vq.c:358 virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.505685] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm kvm_intel kvm rapl iTCO_wdt iTCO_vendor_support virtio_gpu virtio_dma_buf pcspkr drm_shmem_helper i2c_i801 drm_kms_helper lpc_ich i2c_smbus virtio_balloon joydev drm fuse xfs libcrc32c ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata virtio_net ghash_clmulni_intel net_failover virtio_blk failover serio_raw dm_mirror dm_region_hash dm_log dm_mod [ 20.511847] CPU: 0 PID: 568 Comm: kworker/0:3 Kdump: loaded Tainted: G W ------- --- 5.14.0-578.6675_1757216455.el9.x86_64 #1 [ 20.513157] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-3.el9 11/17/2024 [ 20.513918] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper] [ 20.514626] RIP: 0010:virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.515332] Code: 00 00 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 51 50 00 00 65 ff 0d 42 e3 74 3f 0f 85 69 ff ff ff 0f 1f 44 00 00 e9 5f ff ff ff <0f> 0b e9 3f ff ff ff 48 83 3c 24 00 74 0e 49 8b 7f 40 48 85 ff 74 [ 20.517272] RSP: 0018:ff34f0a8c0787ad8 EFLAGS: 00010282 [ 20.517820] RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 0000000000000820 [ 20.518565] RDX: 0000000000000000 RSI: ff34f0a8c0787be0 RDI: ff218bef03a26300 [ 20.519308] RBP: ff218bef03a26300 R08: 0000000000000001 R09: ff218bef07224360 [ 20.520059] R10: 0000000000008dc0 R11: 0000000000000002 R12: ff218bef02630028 [ 20.520806] R13: ff218bef0263fb48 R14: ff218bef00cb8000 R15: ff218bef07224360 [ 20.521555] FS: 0000000000000000(0000) GS:ff218bef7ba00000(0000) knlGS:0000000000000000 [ 20.522397] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 20.522996] CR2: 000055ac4f7871c0 CR3: 000000010b9f2002 CR4: 0000000000771ef0 [ 20.523740] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 20.524477] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 20.525223] PKRU: 55555554 [ 20.525515] Call Trace: [ 20.525777] <TASK> [ 20.526003] ? show_trace_log_lvl+0x1c4/0x2df [ 20.526464] ? show_trace_log_lvl+0x1c4/0x2df [ 20.526925] ? virtio_gpu_queue_fenced_ctrl_buffer+0x82/0x2c0 [virtio_gpu] [ 20.527643] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.528282] ? __warn+0x7e/0xd0 [ 20.528621] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.529256] ? report_bug+0x100/0x140 [ 20.529643] ? handle_bug+0x3c/0x70 [ 20.530010] ? exc_invalid_op+0x14/0x70 [ 20.530421] ? asm_exc_invalid_op+0x16/0x20 [ 20.530862] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.531506] ? virtio_gpu_queue_ctrl_sgs+0x174/0x290 [virtio_gpu] [ 20.532148] virtio_gpu_queue_fenced_ctrl_buffer+0x82/0x2c0 [virtio_gpu] [ 20.532843] virtio_gpu_primary_plane_update+0x3e2/0x460 [virtio_gpu] [ 20.533520] drm_atomic_helper_commit_planes+0x108/0x320 [drm_kms_helper] [ 20.534233] drm_atomic_helper_commit_tail+0x45/0x80 [drm_kms_helper] [ 20.534914] commit_tail+0xd2/0x130 [drm_kms_helper] [ 20.535446] drm_atomic_helper_commit+0x11b/0x140 [drm_kms_helper] [ 20.536097] drm_atomic_commit+0xa4/0xe0 [drm] [ 20.536588] ? __pfx___drm_printfn_info+0x10/0x10 [drm] [ 20.537162] drm_atomic_helper_dirtyfb+0x192/0x270 [drm_kms_helper] [ 20.537823] drm_fbdev_shmem_helper_fb_dirty+0x43/0xa0 [drm_shmem_helper] [ 20.538536] drm_fb_helper_damage_work+0x87/0x160 [drm_kms_helper] [ 20.539188] process_one_work+0x194/0x380 [ 20.539612] worker_thread+0x2fe/0x410 [ 20.540007] ? __pfx_worker_thread+0x10/0x10 [ 20.540456] kthread+0xdd/0x100 [ 20.540791] ? __pfx_kthread+0x10/0x10 [ 20.541190] ret_from_fork+0x29/0x50 [ 20.541566] </TASK> [ 20.541802] ---[ end trace 0000000000000000 ]--- It looks like the shutdown is called in the middle of console drawing, so we should either wait for it to finish, or let drm handle the shutdown. This patch implements this second option: Add an option for drivers to bypass the common break+reset handling. As DRM is careful to flush/synchronize outstanding buffers, it looks like GPU can just have a NOP there. Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Fixes: 8bd2fa0 ("virtio: break and reset virtio devices on device_shutdown()") Cc: Eric Auger <eauger@redhat.com> Cc: Jocelyn Falempe <jfalempe@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <8490dbeb6f79ed039e6c11d121002618972538a3.1744293540.git.mst@redhat.com> Signed-off-by: Aaron Ahmed <aarnahmd@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[upstream 051f3ca] On AMD Family17h-based (EPYC) system, a logical NUMA node can contain upto 8 cores (16 threads) with the following topology. ---------------------------- C0 | T0 T1 | || | T0 T1 | C4 --------| || |-------- C1 | T0 T1 | L3 || L3 | T0 T1 | C5 --------| || |-------- C2 | T0 T1 | #0 || #1 | T0 T1 | C6 --------| || |-------- C3 | T0 T1 | || | T0 T1 | C7 ---------------------------- Here, there are 2 last-level (L3) caches per logical NUMA node. A socket can contain upto 4 NUMA nodes, and a system can support upto 2 sockets. With full system configuration, current scheduler creates 4 sched domains: domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NUMA (span a socket: 4 nodes) domain3 NUMA (span a system: 8 nodes) Note that there is no domain to represent cpus spaning a logical NUMA node. With this hierarchy of sched domains, the scheduler does not balance properly in the following cases: Case1: When running 8 tasks, a properly balanced system should schedule a task per logical NUMA node. This is not the case for the current scheduler. Case2: In some cases, threads are scheduled on the same cpu, while other cpus are idle. This results in run-to-run inconsistency. For example: taskset -c 0-7 sysbench --num-threads=8 --test=cpu \ --cpu-max-prime=100000 run Total execution time ranges from 25.1s to 33.5s depending on threads placement, where 25.1s is when all 8 threads are balanced properly on 8 cpus. Introducing NUMA identity node sched domain, which is based on how SRAT/SLIT table define a logical NUMA node. This results in the following hierarchy of sched domains on the same system described above. domain0 SMT (span a core) domain1 MC (span a last-level-cache) domain2 NODE (span a logical NUMA node) domain3 NUMA (span a socket: 4 nodes) domain4 NUMA (span a system: 8 nodes) This fixes the improper load balancing cases mentioned above. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@suse.de Link: http://lkml.kernel.org/r/1504768805-46716-1-git-send-email-suravee.suthikulpanit@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Balbir Singh <sblbir@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
This is a fix against the issue that crash dump kernel may hang up during booting, which can happen on any ACPI-based system with "ACPI Reclaim Memory." (kernel messages after panic kicked off kdump) (snip...) Bye! (snip...) ACPI: Core revision 20170728 pud=000000002e7d0003, *pmd=000000002e7c0003, *pte=00e8000039710707 Internal error: Oops: 96000021 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc6 #1 task: ffff000008d05180 task.stack: ffff000008cc0000 PC is at acpi_ns_lookup+0x25c/0x3c0 LR is at acpi_ds_load1_begin_op+0xa4/0x294 (snip...) Process swapper/0 (pid: 0, stack limit = 0xffff000008cc0000) Call trace: (snip...) [<ffff0000084a6764>] acpi_ns_lookup+0x25c/0x3c0 [<ffff00000849b4f8>] acpi_ds_load1_begin_op+0xa4/0x294 [<ffff0000084ad4ac>] acpi_ps_build_named_op+0xc4/0x198 [<ffff0000084ad6cc>] acpi_ps_create_op+0x14c/0x270 [<ffff0000084acfa8>] acpi_ps_parse_loop+0x188/0x5c8 [<ffff0000084ae048>] acpi_ps_parse_aml+0xb0/0x2b8 [<ffff0000084a8e10>] acpi_ns_one_complete_parse+0x144/0x184 [<ffff0000084a8e98>] acpi_ns_parse_table+0x48/0x68 [<ffff0000084a82cc>] acpi_ns_load_table+0x4c/0xdc [<ffff0000084b32f8>] acpi_tb_load_namespace+0xe4/0x264 [<ffff000008baf9b4>] acpi_load_tables+0x48/0xc0 [<ffff000008badc20>] acpi_early_init+0x9c/0xd0 [<ffff000008b70d50>] start_kernel+0x3b4/0x43c Code: b9008fb9 2a000318 36380054 32190318 (b94002c0) ---[ end trace c46ed37f9651c58e ]--- Kernel panic - not syncing: Fatal exception Rebooting in 10 seconds.. (diagnosis) * This fault is a data abort, alignment fault (ESR=0x96000021) during reading out ACPI table. * Initial ACPI tables are normally stored in system ram and marked as "ACPI Reclaim memory" by the firmware. * After the commit f56ab9a ("efi/arm: Don't mark ACPI reclaim memory as MEMBLOCK_NOMAP"), those regions are differently handled as they are "memblock-reserved", without NOMAP bit. * So they are now excluded from device tree's "usable-memory-range" which kexec-tools determines based on a current view of /proc/iomem. * When crash dump kernel boots up, it tries to accesses ACPI tables by mapping them with ioremap(), not ioremap_cache(), in acpi_os_ioremap() since they are no longer part of mapped system ram. * Given that ACPI accessor/helper functions are compiled in without unaligned access support (ACPI_MISALIGNMENT_NOT_SUPPORTED), any unaligned access to ACPI tables can cause a fatal panic. With this patch, acpi_os_ioremap() always honors memory attribute information provided by the firmware (EFI) and retaining cacheability allows the kernel safe access to ACPI tables. [09ffcb0 with changes] Signed-off-by: AKASHI Takahiro <takahiro.akashi@linaro.org> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reported-by and Tested-by: Bhupesh Sharma <bhsharma@redhat.com> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Balbir Singh <sblbir@amzn.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
This patch came from the 2.10.5 branch of AmazonFSxLustreClient repo. The patch from that repo is: mdc: hold lock while walking changelog dev list (LU-12566) Prevent the following GPF which is causing stuck processes, when running mount and umount concurrently on the same host. general protection fault: 0000 [#1] SMP PTI ... RIP: 0010:mdc_changelog_cdev_finish+0x3f/0x1b9 [mdc] ... Call Trace: mdc_precleanup+0x2a/0x3c0 [mdc] Original patch was: LU-12566 mdc: hold lock while walking changelog dev list In mdc_changelog_cdev_finish() we need chlg_registered_dev_lock while walking and changing entries on the chlog_registered_devs and ced_obds lists in chlg_registered_dev_find_by_obd(). Move the calling of chlg_registered_dev_find_by_obd() under the mutex, and add assertions to the places where the lists are walked and changed that the mutex is held. Lustre-change: https://review.whamcloud.com/35668 Lustre-commit: a260c530801db7f58efa93b774f06b0ce72649a3 Fixes: 1d40214d96dd ("LU-7659 mdc: expose changelog through char devices") Signed-off-by: Andreas Dilger <adilger@whamcloud.com> Change-Id: Ib62fdff87cde6a4bcfb9bea24a2ea72a933ebbe5 Signed-off-by: Minh Diep <mdiep@whamcloud.com> Reviewed-on: https://review.whamcloud.com/35835 Signed-off-by: Andy Strohman <astroh@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
In the Amazon Linux 4.14.y kernel the xen-blkfront driver can encounter a deadlock condition when handling certain errors from the backend: [ 60.870492] BUG: spinlock recursion on CPU#1, swapper/1/0 [ 60.874250] lock: 0xffff8881e096ac00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1 [ 60.879879] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.173-138.231.amzn2.x86_64 #1 [ 60.885249] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006 [ 60.889623] Call Trace: [ 60.891664] <IRQ> [ 60.893420] dump_stack+0x66/0x85 [ 60.895987] do_raw_spin_lock+0x8d/0xc0 [ 60.898884] blk_queue_write_cache+0x1d/0x140 [ 60.902041] xlvbd_flush+0x26/0xa0 [xen_blkfront] This is caused by the blkfront interrupt handler updating the write cache mode; blk_queue_write_cache() tries to lock the queue lock, which has already been locked by the handler. The blk_queue_write_cache() call can't be delayed to the end of the handler when it drops the lock since it locks the queue lock with spin_lock_irq(), and accidentally enables interrupts before the handler has returned. Instead add a work_struct to blkfront_info and defer the write cache update until after the handler has returned. This bug was introduced as part of the "xen-blkfront: resurrect request-based mode" patch. While the 4.14.y kernel will move away from this and use blk-mq instead soon, this is a smaller-impact fix for current users. Signed-off-by: Samuel Mendoza-Jonas <samjonas@amazon.com> Reviewed-by: Frank van der Linden <fllinden@amazon.com> Reviewed-by: Andy Strohman <astroh@amazon.com> Reviewed-by: Balbir Singh <sblbir@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
Steal time initialization requires mapping a memory region which invokes a memory allocation. Doing this at CPU starting time results in the following trace when CONFIG_DEBUG_ATOMIC_SLEEP is enabled: BUG: sleeping function called from invalid context at mm/slab.h:498 in_atomic(): 1, irqs_disabled(): 128, non_block: 0, pid: 0, name: swapper/1 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.9.0-rc5+ #1 Call trace: dump_backtrace+0x0/0x208 show_stack+0x1c/0x28 dump_stack+0xc4/0x11c ___might_sleep+0xf8/0x130 __might_sleep+0x58/0x90 slab_pre_alloc_hook.constprop.101+0xd0/0x118 kmem_cache_alloc_node_trace+0x84/0x270 __get_vm_area_node+0x88/0x210 get_vm_area_caller+0x38/0x40 __ioremap_caller+0x70/0xf8 ioremap_cache+0x78/0xb0 memremap+0x9c/0x1a8 init_stolen_time_cpu+0x54/0xf0 cpuhp_invoke_callback+0xa8/0x720 notify_cpu_starting+0xc8/0xd8 secondary_start_kernel+0x114/0x180 CPU1: Booted secondary processor 0x0000000001 [0x431f0a11] However we don't need to initialize steal time at CPU starting time. We can simply wait until CPU online time, just sacrificing a bit of accuracy by returning zero for steal time until we know better. While at it, add __init to the functions that are only called by pv_time_init() which is __init. Signed-off-by: Andrew Jones <drjones@redhat.com> Fixes: e0685fa ("arm64: Retrieve stolen time as paravirtualized guest") Cc: stable@vger.kernel.org Reviewed-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20200916154530.40809-1-drjones@redhat.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> (cherry picked from commit 75df529) Signed-off-by: hailmo <hailmo@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
…mm_struct mmap_sem is on the hot path of kernel, and it very contended, but it is abused too. It is used to protect arg_start|end and evn_start|end when reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make sense since those proc files just expect to read 4 values atomically and not related to VM, they could be set to arbitrary values by C/R. And, the mmap_sem contention may cause unexpected issue like below: INFO: task ps:14018 blocked for more than 120 seconds. Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ps D 0 14018 1 0x00000004 Call Trace: schedule+0x36/0x80 rwsem_down_read_failed+0xf0/0x150 call_rwsem_down_read_failed+0x18/0x30 down_read+0x20/0x40 proc_pid_cmdline_read+0xd9/0x4e0 __vfs_read+0x37/0x150 vfs_read+0x96/0x130 SyS_read+0x55/0xc0 entry_SYSCALL_64_fastpath+0x1a/0xc5 Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock for them to mitigate the abuse of mmap_sem. So, introduce a new spinlock in mm_struct to protect the concurrent access to arg_start|end, env_start|end and others, as well as replace write map_sem to read to protect the race condition between prctl and sys_brk which might break check_data_rlimit(), and makes prctl more friendly to other VM operations. This patch just eliminates the abuse of mmap_sem, but it can't resolve the above hung task warning completely since the later access_remote_vm() call needs acquire mmap_sem. The mmap_sem scalability issue will be solved in the future. [yang.shi@linux.alibaba.com: add comment about mmap_sem and arg_lock] Link: http://lkml.kernel.org/r/1524077799-80690-1-git-send-email-yang.shi@linux.alibaba.com Link: http://lkml.kernel.org/r/1523730291-109696-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mateusz Guzik <mguzik@redhat.com> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
Chip Select memory size reporting on AMD Family 17h was recently fixed in order to account for interleaving. However, the current method is not robust. The Chip Select Address Mask can be used to find the memory size. There are a couple of cases. 1) For single-rank and dual-rank non-interleaved, use the address mask plus 1 as the size. 2) For dual-rank interleaved, do #1 but "de-interleave" the address mask first. Always "de-interleave" the address mask in order to simplify the code flow. Bit mask manipulation is necessary to check for interleaving, so just go ahead and do the de-interleaving. In the non-interleaved case, the original and de-interleaved address masks will be the same. To de-interleave the mask, count the number of zero bits in the middle of the mask and swap them with the most significant bits. For example, Original=0xFFFF9FE, De-interleaved=0x3FFFFFE Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190821235938.118710-5-Yazen.Ghannam@amd.com (cherry picked from commit e53a3b2)
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
Commit fa765c4 upstream shutdown_pirq and startup_pirq are not taking the irq_mapping_update_lock because they can't due to lock inversion. Both are called with the irq_desc->lock being taking. The lock order, however, is first irq_mapping_update_lock and then irq_desc->lock. This opens multiple races: - shutdown_pirq can be interrupted by a function that allocates an event channel: CPU0 CPU1 shutdown_pirq { xen_evtchn_close(e) __startup_pirq { EVTCHNOP_bind_pirq -> returns just freed evtchn e set_evtchn_to_irq(e, irq) } xen_irq_info_cleanup() { set_evtchn_to_irq(e, -1) } } Assume here event channel e refers here to the same event channel number. After this race the evtchn_to_irq mapping for e is invalid (-1). - __startup_pirq races with __unbind_from_irq in a similar way. Because __startup_pirq doesn't take irq_mapping_update_lock it can grab the evtchn that __unbind_from_irq is currently freeing and cleaning up. In this case even though the event channel is allocated, its mapping can be unset in evtchn_to_irq. The fix is to first cleanup the mappings and then close the event channel. In this way, when an event channel gets allocated it's potential previous evtchn_to_irq mappings are guaranteed to be unset already. This is also the reverse order of the allocation where first the event channel is allocated and then the mappings are setup. On a 5.10 kernel prior to commit 3fcdaf3 ("xen/events: modify internal [un]bind interfaces"), we hit a BUG like the following during probing of NVMe devices. The issue is that during nvme_setup_io_queues, pci_free_irq is called for every device which results in a call to shutdown_pirq. With many nvme devices it's therefore likely to hit this race during boot because there will be multiple calls to shutdown_pirq and startup_pirq are running potentially in parallel. ------------[ cut here ]------------ blkfront: xvda: barrier or flush: disabled; persistent grants: enabled; indirect descriptors: enabled; bounce buffer: enabled kernel BUG at drivers/xen/events/events_base.c:499! invalid opcode: 0000 [#1] SMP PTI CPU: 44 PID: 375 Comm: kworker/u257:23 Not tainted 5.10.201-191.748.amzn2.x86_64 #1 Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006 Workqueue: nvme-reset-wq nvme_reset_work RIP: 0010:bind_evtchn_to_cpu+0xdf/0xf0 Code: 5d 41 5e c3 cc cc cc cc 44 89 f7 e8 2b 55 ad ff 49 89 c5 48 85 c0 0f 84 64 ff ff ff 4c 8b 68 30 41 83 fe ff 0f 85 60 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 RSP: 0000:ffffc9000d533b08 EFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006 RDX: 0000000000000028 RSI: 00000000ffffffff RDI: 00000000ffffffff RBP: ffff888107419680 R08: 0000000000000000 R09: ffffffff82d72b00 R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000001ed R13: 0000000000000000 R14: 00000000ffffffff R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff88bc8b500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000000002610001 CR4: 00000000001706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? show_trace_log_lvl+0x1c1/0x2d9 ? show_trace_log_lvl+0x1c1/0x2d9 ? set_affinity_irq+0xdc/0x1c0 ? __die_body.cold+0x8/0xd ? die+0x2b/0x50 ? do_trap+0x90/0x110 ? bind_evtchn_to_cpu+0xdf/0xf0 ? do_error_trap+0x65/0x80 ? bind_evtchn_to_cpu+0xdf/0xf0 ? exc_invalid_op+0x4e/0x70 ? bind_evtchn_to_cpu+0xdf/0xf0 ? asm_exc_invalid_op+0x12/0x20 ? bind_evtchn_to_cpu+0xdf/0xf0 ? bind_evtchn_to_cpu+0xc5/0xf0 set_affinity_irq+0xdc/0x1c0 irq_do_set_affinity+0x1d7/0x1f0 irq_setup_affinity+0xd6/0x1a0 irq_startup+0x8a/0xf0 __setup_irq+0x639/0x6d0 ? nvme_suspend+0x150/0x150 request_threaded_irq+0x10c/0x180 ? nvme_suspend+0x150/0x150 pci_request_irq+0xa8/0xf0 ? __blk_mq_free_request+0x74/0xa0 queue_request_irq+0x6f/0x80 nvme_create_queue+0x1af/0x200 nvme_create_io_queues+0xbd/0xf0 nvme_setup_io_queues+0x246/0x320 ? nvme_irq_check+0x30/0x30 nvme_reset_work+0x1c8/0x400 process_one_work+0x1b0/0x350 worker_thread+0x49/0x310 ? process_one_work+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 Modules linked in: ---[ end trace a11715de1eee1873 ]--- Fixes: d46a78b ("xen: implement pirq type event channels") Cc: stable@vger.kernel.org Co-debugged-by: Andrew Panyakin <apanyaki@amazon.com> Signed-off-by: Maximilian Heyne <mheyne@amazon.de> Reviewed-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20240124163130.31324-1-mheyne@amazon.de Signed-off-by: Juergen Gross <jgross@suse.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 7e41969 upstream We observed a null-ptr-deref in fou_gro_receive() while shutting down a host. [0] The NULL pointer is sk->sk_user_data, and the offset 8 is of the protocol in struct fou. When fou_release() is called due to netns dismantle or explicit tunnel teardown, udp_tunnel_sock_release() sets NULL to sk->sk_user_data. Then, the tunnel socket is destroyed after a single RCU grace period. So, in-flight udp4_gro_receive() could find the socket and execute the FOU GRO handler, where sk->sk_user_data could be NULL. Let's use rcu_dereference_sk_user_data() in fou_from_sock() and add NULL checks in FOU GRO handlers. [0]: BUG: kernel NULL pointer dereference, address: 0000000000000008 PF: supervisor read access in kernel mode PF: error_code(0x0000) - not-present page PGD 80000001032f4067 P4D 80000001032f4067 PUD 103240067 PMD 0 SMP PTI CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.10.216-204.855.amzn2.x86_64 #1 Hardware name: Amazon EC2 c5.large/, BIOS 1.0 10/16/2017 RIP: 0010:fou_gro_receive (net/ipv4/fou.c:233) [fou] Code: 41 5f c3 cc cc cc cc e8 e7 2e 69 f4 0f 1f 80 00 00 00 00 0f 1f 44 00 00 49 89 f8 41 54 48 89 f7 48 89 d6 49 8b 80 88 02 00 00 <0f> b6 48 08 0f b7 42 4a 66 25 fd fd 80 cc 02 66 89 42 4a 0f b6 42 RSP: 0018:ffffa330c0003d08 EFLAGS: 00010297 RAX: 0000000000000000 RBX: ffff93d9e3a6b900 RCX: 0000000000000010 RDX: ffff93d9e3a6b900 RSI: ffff93d9e3a6b900 RDI: ffff93dac2e24d08 RBP: ffff93d9e3a6b900 R08: ffff93dacbce6400 R09: 0000000000000002 R10: 0000000000000000 R11: ffffffffb5f369b0 R12: ffff93dacbce6400 R13: ffff93dac2e24d08 R14: 0000000000000000 R15: ffffffffb4edd1c0 FS: 0000000000000000(0000) GS:ffff93daee800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000008 CR3: 0000000102140001 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <IRQ> ? show_trace_log_lvl (arch/x86/kernel/dumpstack.c:259) ? __die_body.cold (arch/x86/kernel/dumpstack.c:478 arch/x86/kernel/dumpstack.c:420) ? no_context (arch/x86/mm/fault.c:752) ? exc_page_fault (arch/x86/include/asm/irqflags.h:49 arch/x86/include/asm/irqflags.h:89 arch/x86/mm/fault.c:1435 arch/x86/mm/fault.c:1483) ? asm_exc_page_fault (arch/x86/include/asm/idtentry.h:571) ? fou_gro_receive (net/ipv4/fou.c:233) [fou] udp_gro_receive (include/linux/netdevice.h:2552 net/ipv4/udp_offload.c:559) udp4_gro_receive (net/ipv4/udp_offload.c:604) inet_gro_receive (net/ipv4/af_inet.c:1549 (discriminator 7)) dev_gro_receive (net/core/dev.c:6035 (discriminator 4)) napi_gro_receive (net/core/dev.c:6170) ena_clean_rx_irq (drivers/amazon/net/ena/ena_netdev.c:1558) [ena] ena_io_poll (drivers/amazon/net/ena/ena_netdev.c:1742) [ena] napi_poll (net/core/dev.c:6847) net_rx_action (net/core/dev.c:6917) __do_softirq (arch/x86/include/asm/jump_label.h:25 include/linux/jump_label.h:200 include/trace/events/irq.h:142 kernel/softirq.c:299) asm_call_irq_on_stack (arch/x86/entry/entry_64.S:809) </IRQ> do_softirq_own_stack (arch/x86/include/asm/irq_stack.h:27 arch/x86/include/asm/irq_stack.h:77 arch/x86/kernel/irq_64.c:77) irq_exit_rcu (kernel/softirq.c:393 kernel/softirq.c:423 kernel/softirq.c:435) common_interrupt (arch/x86/kernel/irq.c:239) asm_common_interrupt (arch/x86/include/asm/idtentry.h:626) RIP: 0010:acpi_idle_do_entry (arch/x86/include/asm/irqflags.h:49 arch/x86/include/asm/irqflags.h:89 drivers/acpi/processor_idle.c:114 drivers/acpi/processor_idle.c:575) Code: 8b 15 d1 3c c4 02 ed c3 cc cc cc cc 65 48 8b 04 25 40 ef 01 00 48 8b 00 a8 08 75 eb 0f 1f 44 00 00 0f 00 2d d5 09 55 00 fb f4 <fa> c3 cc cc cc cc e9 be fc ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 RSP: 0018:ffffffffb5603e58 EFLAGS: 00000246 RAX: 0000000000004000 RBX: ffff93dac0929c00 RCX: ffff93daee833900 RDX: ffff93daee800000 RSI: ffff93daee87dc00 RDI: ffff93daee87dc64 RBP: 0000000000000001 R08: ffffffffb5e7b6c0 R09: 0000000000000044 R10: ffff93daee831b04 R11: 00000000000001cd R12: 0000000000000001 R13: ffffffffb5e7b740 R14: 0000000000000001 R15: 0000000000000000 ? sched_clock_cpu (kernel/sched/clock.c:371) acpi_idle_enter (drivers/acpi/processor_idle.c:712 (discriminator 3)) cpuidle_enter_state (drivers/cpuidle/cpuidle.c:237) cpuidle_enter (drivers/cpuidle/cpuidle.c:353) cpuidle_idle_call (kernel/sched/idle.c:158 kernel/sched/idle.c:239) do_idle (kernel/sched/idle.c:302) cpu_startup_entry (kernel/sched/idle.c:395 (discriminator 1)) start_kernel (init/main.c:1048) secondary_startup_64_no_verify (arch/x86/kernel/head_64.S:310) Modules linked in: udp_diag tcp_diag inet_diag nft_nat ipip tunnel4 dummy fou ip_tunnel nft_masq nft_chain_nat nf_nat wireguard nft_ct curve25519_x86_64 libcurve25519_generic nf_conntrack libchacha20poly1305 nf_defrag_ipv6 nf_defrag_ipv4 nft_objref chacha_x86_64 nft_counter nf_tables nfnetlink poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper mousedev psmouse button ena ptp pps_core crc32c_intel CR2: 0000000000000008 Fixes: d92283e ("fou: change to use UDP socket GRO") Reported-by: Alphonse Kurian <alkurian@amazon.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit ef7134c upstream. Recently, we got a customer report that CIFS triggers oops while reconnecting to a server. [0] The workload runs on Kubernetes, and some pods mount CIFS servers in non-root network namespaces. The problem rarely happened, but it was always while the pod was dying. The root cause is wrong reference counting for network namespace. CIFS uses kernel sockets, which do not hold refcnt of the netns that the socket belongs to. That means CIFS must ensure the socket is always freed before its netns; otherwise, use-after-free happens. The repro steps are roughly: 1. mount CIFS in a non-root netns 2. drop packets from the netns 3. destroy the netns 4. unmount CIFS We can reproduce the issue quickly with the script [1] below and see the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled. When the socket is TCP, it is hard to guarantee the netns lifetime without holding refcnt due to async timers. Let's hold netns refcnt for each socket as done for SMC in commit 9744d2b ("smc: Fix use-after-free in tcp_write_timer_handler()."). Note that we need to move put_net() from cifs_put_tcp_session() to clean_demultiplex_info(); otherwise, __sock_create() still could touch a freed netns while cifsd tries to reconnect from cifs_demultiplex_thread(). Also, maybe_get_net() cannot be put just before __sock_create() because the code is not under RCU and there is a small chance that the same address happened to be reallocated to another netns. [0]: CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting... CIFS: Serverclose failed 4 times, giving up Unable to handle kernel paging request at virtual address 14de99e461f84a07 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [14de99e461f84a07] address between user and kernel address ranges Internal error: Oops: 0000000096000004 [#1] SMP Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1 Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018 pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : fib_rules_lookup+0x44/0x238 lr : __fib_lookup+0x64/0xbc sp : ffff8000265db790 x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01 x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580 x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500 x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002 x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294 x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000 x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0 x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500 Call trace: fib_rules_lookup+0x44/0x238 __fib_lookup+0x64/0xbc ip_route_output_key_hash_rcu+0x2c4/0x398 ip_route_output_key_hash+0x60/0x8c tcp_v4_connect+0x290/0x488 __inet_stream_connect+0x108/0x3d0 inet_stream_connect+0x50/0x78 kernel_connect+0x6c/0xac generic_ip_connect+0x10c/0x6c8 [cifs] __reconnect_target_unlocked+0xa0/0x214 [cifs] reconnect_dfs_server+0x144/0x460 [cifs] cifs_reconnect+0x88/0x148 [cifs] cifs_readv_from_socket+0x230/0x430 [cifs] cifs_read_from_socket+0x74/0xa8 [cifs] cifs_demultiplex_thread+0xf8/0x704 [cifs] kthread+0xd0/0xd4 Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264) [1]: CIFS_CRED="/root/cred.cifs" CIFS_USER="Administrator" CIFS_PASS="Password" CIFS_IP="X.X.X.X" CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST" CIFS_MNT="/mnt/smb" DEV="enp0s3" cat <<EOF > ${CIFS_CRED} username=${CIFS_USER} password=${CIFS_PASS} domain=EXAMPLE.COM EOF unshare -n bash -c " mkdir -p ${CIFS_MNT} ip netns attach root 1 ip link add eth0 type veth peer veth0 netns root ip link set eth0 up ip -n root link set veth0 up ip addr add 192.168.0.2/24 dev eth0 ip -n root addr add 192.168.0.1/24 dev veth0 ip route add default via 192.168.0.1 dev eth0 ip netns exec root sysctl net.ipv4.ip_forward=1 ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1 touch ${CIFS_MNT}/a.txt ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE " umount ${CIFS_MNT} [2]: ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1576) generic_ip_connect (fs/smb/client/connect.c:3075) cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798) cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366) dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285) cifs_mount (fs/smb/client/connect.c:3622) cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949) smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794) vfs_get_tree (fs/super.c:1800) path_mount (fs/namespace.c:3508 fs/namespace.c:3834) __x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Fixes: 26abe14 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 24a9799 upstream. The UAF bug is due to smb2_reconnect_server() accessing a session that is already being teared down by another thread that is executing __cifs_put_smb_ses(). This can happen when (a) the client has connection to the server but no session or (b) another thread ends up setting @ses->ses_status again to something different than SES_EXITING. To fix this, we need to make sure to unconditionally set @ses->ses_status to SES_EXITING and prevent any other threads from setting a new status while we're still tearing it down. The following can be reproduced by adding some delay to right after the ipc is freed in __cifs_put_smb_ses() - which will give smb2_reconnect_server() worker a chance to run and then accessing @ses->ipc: kinit ... mount.cifs //srv/share /mnt/1 -o sec=krb5,nohandlecache,echo_interval=10 [disconnect srv] ls /mnt/1 &>/dev/null sleep 30 kdestroy [reconnect srv] sleep 10 umount /mnt/1 ... CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed CIFS: VFS: \\srv Send error in SessSetup = -126 CIFS: VFS: Verify user has a krb5 ticket and keyutils is installed CIFS: VFS: \\srv Send error in SessSetup = -126 general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP NOPTI CPU: 3 PID: 50 Comm: kworker/3:1 Not tainted 6.9.0-rc2 #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014 Workqueue: cifsiod smb2_reconnect_server [cifs] RIP: 0010:__list_del_entry_valid_or_report+0x33/0xf0 Code: 4f 08 48 85 d2 74 42 48 85 c9 74 59 48 b8 00 01 00 00 00 00 ad de 48 39 c2 74 61 48 b8 22 01 00 00 00 00 74 69 <48> 8b 01 48 39 f8 75 7b 48 8b 72 08 48 39 c6 0f 85 88 00 00 00 b8 RSP: 0018:ffffc900001bfd70 EFLAGS: 00010a83 RAX: dead000000000122 RBX: ffff88810da53838 RCX: 6b6b6b6b6b6b6b6b RDX: 6b6b6b6b6b6b6b6b RSI: ffffffffc02f6878 RDI: ffff88810da53800 RBP: ffff88810da53800 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff88810c064000 R13: 0000000000000001 R14: ffff88810c064000 R15: ffff8881039cc000 FS: 0000000000000000(0000) GS:ffff888157c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fe3728b1000 CR3: 000000010caa4000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? die_addr+0x36/0x90 ? exc_general_protection+0x1c1/0x3f0 ? asm_exc_general_protection+0x26/0x30 ? __list_del_entry_valid_or_report+0x33/0xf0 __cifs_put_smb_ses+0x1ae/0x500 [cifs] smb2_reconnect_server+0x4ed/0x710 [cifs] process_one_work+0x205/0x6b0 worker_thread+0x191/0x360 ? __pfx_worker_thread+0x10/0x10 kthread+0xe2/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Cc: stable@vger.kernel.org Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com> [v4.14: added ses_lock to cifs_ses; used ses->status and old statusEnum values; resolved contextual conflicts in cifs_put_smb_ses] Signed-off-by: Jay Wang <wanjay@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit c79a39d upstream. On a board running ntpd and gpsd, I'm seeing a consistent use-after-free in sys_exit() from gpsd when rebooting: pps pps1: removed ------------[ cut here ]------------ kobject: '(null)' (00000000db4bec24): is not initialized, yet kobject_put() is being called. WARNING: CPU: 2 PID: 440 at lib/kobject.c:734 kobject_put+0x120/0x150 CPU: 2 UID: 299 PID: 440 Comm: gpsd Not tainted 6.11.0-rc6-00308-gb31c44928842 #1 Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT) pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : kobject_put+0x120/0x150 lr : kobject_put+0x120/0x150 sp : ffffffc0803d3ae0 x29: ffffffc0803d3ae0 x28: ffffff8042dc9738 x27: 0000000000000001 x26: 0000000000000000 x25: ffffff8042dc9040 x24: ffffff8042dc9440 x23: ffffff80402a4620 x22: ffffff8042ef4bd0 x21: ffffff80405cb600 x20: 000000000008001b x19: ffffff8040b3b6e0 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 696e6920746f6e20 x14: 7369203a29343263 x13: 205d303434542020 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000 Call trace: kobject_put+0x120/0x150 cdev_put+0x20/0x3c __fput+0x2c4/0x2d8 ____fput+0x1c/0x38 task_work_run+0x70/0xfc do_exit+0x2a0/0x924 do_group_exit+0x34/0x90 get_signal+0x7fc/0x8c0 do_signal+0x128/0x13b4 do_notify_resume+0xdc/0x160 el0_svc+0xd4/0xf8 el0t_64_sync_handler+0x140/0x14c el0t_64_sync+0x190/0x194 ---[ end trace 0000000000000000 ]--- ...followed by more symptoms of corruption, with similar stacks: refcount_t: underflow; use-after-free. kernel BUG at lib/list_debug.c:62! Kernel panic - not syncing: Oops - BUG: Fatal exception This happens because pps_device_destruct() frees the pps_device with the embedded cdev immediately after calling cdev_del(), but, as the comment above cdev_del() notes, fops for previously opened cdevs are still callable even after cdev_del() returns. I think this bug has always been there: I can't explain why it suddenly started happening every time I reboot this particular board. In commit d953e0e ("pps: Fix a use-after free bug when unregistering a source."), George Spelvin suggested removing the embedded cdev. That seems like the simplest way to fix this, so I've implemented his suggestion, using __register_chrdev() with pps_idr becoming the source of truth for which minor corresponds to which device. But now that pps_idr defines userspace visibility instead of cdev_add(), we need to be sure the pps->dev refcount can't reach zero while userspace can still find it again. So, the idr_remove() call moves to pps_unregister_cdev(), and pps_idr now holds a reference to pps->dev. pps_core: source serial1 got cdev (251:1) <...> pps pps1: removed pps_core: unregistering pps1 pps_core: deallocating pps1 Fixes: d953e0e ("pps: Fix a use-after free bug when unregistering a source.") Cc: stable@vger.kernel.org Signed-off-by: Calvin Owens <calvin@wbinvd.org> Reviewed-by: Michal Schmidt <mschmidt@redhat.com> Link: https://lore.kernel.org/r/a17975fd5ae99385791929e563f72564edbcf28f.1731383727.git.calvin@wbinvd.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [ Resolving merge conflict by keeping the original code only changing the reference in dbg_*() calls. ] Signed-off-by: Stanislav Uschakow <suschako@amazon.de>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit f4f8a78 ] The opt_num field is controlled by user mode and is not currently validated inside the kernel. An attacker can take advantage of this to trigger an OOB read and potentially leak information. BUG: KASAN: slab-out-of-bounds in nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 Read of size 2 at addr ffff88804bc64272 by task poc/6431 CPU: 1 PID: 6431 Comm: poc Not tainted 6.0.0-rc4 #1 Call Trace: nf_osf_match_one+0xbed/0xd10 net/netfilter/nfnetlink_osf.c:88 nf_osf_find+0x186/0x2f0 net/netfilter/nfnetlink_osf.c:281 nft_osf_eval+0x37f/0x590 net/netfilter/nft_osf.c:47 expr_call_ops_eval net/netfilter/nf_tables_core.c:214 nft_do_chain+0x2b0/0x1490 net/netfilter/nf_tables_core.c:264 nft_do_chain_ipv4+0x17c/0x1f0 net/netfilter/nft_chain_filter.c:23 [..] Also add validation to genre, subtype and version fields. Fixes: 11eeef4 ("netfilter: passive OS fingerprint xtables match") Reported-by: Lucas Leong <wmliang@infosec.exchange> Signed-off-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Sasha Levin <sashal@kernel.org> [mheyne: contextual change: xt_osf_add_callback moved later to nfnl_osf_add_callback.] Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit 79ebe91 ] Open /dev/nbdX first, the config_refs will be 1 and the pointers in nbd_device are still null. Disconnect /dev/nbdX, then reference a null recv_workq. The protection by config_refs in nbd_genl_disconnect is useless. [ 656.366194] BUG: kernel NULL pointer dereference, address: 0000000000000020 [ 656.368943] #PF: supervisor write access in kernel mode [ 656.369844] #PF: error_code(0x0002) - not-present page [ 656.370717] PGD 10cc87067 P4D 10cc87067 PUD 1074b4067 PMD 0 [ 656.371693] Oops: 0002 [#1] SMP [ 656.372242] CPU: 5 PID: 7977 Comm: nbd-client Not tainted 5.11.0-rc5-00040-g76c057c84d28 #1 [ 656.373661] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 [ 656.375904] RIP: 0010:mutex_lock+0x29/0x60 [ 656.376627] Code: 00 0f 1f 44 00 00 55 48 89 fd 48 83 05 6f d7 fe 08 01 e8 7a c3 ff ff 48 83 05 6a d7 fe 08 01 31 c0 65 48 8b 14 25 00 6d 01 00 <f0> 48 0f b1 55 d [ 656.378934] RSP: 0018:ffffc900005eb9b0 EFLAGS: 00010246 [ 656.379350] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 656.379915] RDX: ffff888104cf2600 RSI: ffffffffaae8f452 RDI: 0000000000000020 [ 656.380473] RBP: 0000000000000020 R08: 0000000000000000 R09: ffff88813bd6b318 [ 656.381039] R10: 00000000000000c7 R11: fefefefefefefeff R12: ffff888102710b40 [ 656.381599] R13: ffffc900005eb9e0 R14: ffffffffb2930680 R15: ffff88810770ef00 [ 656.382166] FS: 00007fdf117ebb40(0000) GS:ffff88813bd40000(0000) knlGS:0000000000000000 [ 656.382806] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 656.383261] CR2: 0000000000000020 CR3: 0000000100c84000 CR4: 00000000000006e0 [ 656.383819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 656.384370] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 656.384927] Call Trace: [ 656.385111] flush_workqueue+0x92/0x6c0 [ 656.385395] nbd_disconnect_and_put+0x81/0xd0 [ 656.385716] nbd_genl_disconnect+0x125/0x2a0 [ 656.386034] genl_family_rcv_msg_doit.isra.0+0x102/0x1b0 [ 656.386422] genl_rcv_msg+0xfc/0x2b0 [ 656.386685] ? nbd_ioctl+0x490/0x490 [ 656.386954] ? genl_family_rcv_msg_doit.isra.0+0x1b0/0x1b0 [ 656.387354] netlink_rcv_skb+0x62/0x180 [ 656.387638] genl_rcv+0x34/0x60 [ 656.387874] netlink_unicast+0x26d/0x590 [ 656.388162] netlink_sendmsg+0x398/0x6c0 [ 656.388451] ? netlink_rcv_skb+0x180/0x180 [ 656.388750] ____sys_sendmsg+0x1da/0x320 [ 656.389038] ? ____sys_recvmsg+0x130/0x220 [ 656.389334] ___sys_sendmsg+0x8e/0xf0 [ 656.389605] ? ___sys_recvmsg+0xa2/0xf0 [ 656.389889] ? handle_mm_fault+0x1671/0x21d0 [ 656.390201] __sys_sendmsg+0x6d/0xe0 [ 656.390464] __x64_sys_sendmsg+0x23/0x30 [ 656.390751] do_syscall_64+0x45/0x70 [ 656.391017] entry_SYSCALL_64_after_hwframe+0x44/0xa9 To fix it, just add if (nbd->recv_workq) to nbd_disconnect_and_put(). Fixes: e9e006f ("nbd: fix max number of supported devs") Signed-off-by: Sun Ke <sunke32@huawei.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210512114331.1233964-2-sunke32@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> [mheyne: fixed some contextual problems] Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit 3cffa2d ] Commit 9eed321 ("net: lapbether: only support ethernet devices") has been able to keep syzbot away from net/lapb, until today. In the following splat [1], the issue is that a lapbether device has been created on a bonding device without members. Then adding a non ARPHRD_ETHER member forced the bonding master to change its type. The fix is to make sure we call dev_close() in bond_setup_by_slave() so that the potential linked lapbether devices (or any other devices having assumptions on the physical device) are removed. A similar bug has been addressed in commit 40baec2 ("bonding: fix panic on non-ARPHRD_ETHER enslave failure") [1] skbuff: skb_under_panic: text:ffff800089508810 len:44 put:40 head:ffff0000c78e7c00 data:ffff0000c78e7bea tail:0x16 end:0x140 dev:bond0 kernel BUG at net/core/skbuff.c:192 ! Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 0 PID: 6007 Comm: syz-executor383 Not tainted 6.6.0-rc3-syzkaller-gbf6547d8715b #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/04/2023 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : skb_panic net/core/skbuff.c:188 [inline] pc : skb_under_panic+0x13c/0x140 net/core/skbuff.c:202 lr : skb_panic net/core/skbuff.c:188 [inline] lr : skb_under_panic+0x13c/0x140 net/core/skbuff.c:202 sp : ffff800096a06aa0 x29: ffff800096a06ab0 x28: ffff800096a06ba0 x27: dfff800000000000 x26: ffff0000ce9b9b50 x25: 0000000000000016 x24: ffff0000c78e7bea x23: ffff0000c78e7c00 x22: 000000000000002c x21: 0000000000000140 x20: 0000000000000028 x19: ffff800089508810 x18: ffff800096a06100 x17: 0000000000000000 x16: ffff80008a629a3c x15: 0000000000000001 x14: 1fffe00036837a32 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000201 x10: 0000000000000000 x9 : cb50b496c519aa00 x8 : cb50b496c519aa00 x7 : 0000000000000001 x6 : 0000000000000001 x5 : ffff800096a063b8 x4 : ffff80008e280f80 x3 : ffff8000805ad11c x2 : 0000000000000001 x1 : 0000000100000201 x0 : 0000000000000086 Call trace: skb_panic net/core/skbuff.c:188 [inline] skb_under_panic+0x13c/0x140 net/core/skbuff.c:202 skb_push+0xf0/0x108 net/core/skbuff.c:2446 ip6gre_header+0xbc/0x738 net/ipv6/ip6_gre.c:1384 dev_hard_header include/linux/netdevice.h:3136 [inline] lapbeth_data_transmit+0x1c4/0x298 drivers/net/wan/lapbether.c:257 lapb_data_transmit+0x8c/0xb0 net/lapb/lapb_iface.c:447 lapb_transmit_buffer+0x178/0x204 net/lapb/lapb_out.c:149 lapb_send_control+0x220/0x320 net/lapb/lapb_subr.c:251 __lapb_disconnect_request+0x9c/0x17c net/lapb/lapb_iface.c:326 lapb_device_event+0x288/0x4e0 net/lapb/lapb_iface.c:492 notifier_call_chain+0x1a4/0x510 kernel/notifier.c:93 raw_notifier_call_chain+0x3c/0x50 kernel/notifier.c:461 call_netdevice_notifiers_info net/core/dev.c:1970 [inline] call_netdevice_notifiers_extack net/core/dev.c:2008 [inline] call_netdevice_notifiers net/core/dev.c:2022 [inline] __dev_close_many+0x1b8/0x3c4 net/core/dev.c:1508 dev_close_many+0x1e0/0x470 net/core/dev.c:1559 dev_close+0x174/0x250 net/core/dev.c:1585 lapbeth_device_event+0x2e4/0x958 drivers/net/wan/lapbether.c:466 notifier_call_chain+0x1a4/0x510 kernel/notifier.c:93 raw_notifier_call_chain+0x3c/0x50 kernel/notifier.c:461 call_netdevice_notifiers_info net/core/dev.c:1970 [inline] call_netdevice_notifiers_extack net/core/dev.c:2008 [inline] call_netdevice_notifiers net/core/dev.c:2022 [inline] __dev_close_many+0x1b8/0x3c4 net/core/dev.c:1508 dev_close_many+0x1e0/0x470 net/core/dev.c:1559 dev_close+0x174/0x250 net/core/dev.c:1585 bond_enslave+0x2298/0x30cc drivers/net/bonding/bond_main.c:2332 bond_do_ioctl+0x268/0xc64 drivers/net/bonding/bond_main.c:4539 dev_ifsioc+0x754/0x9ac dev_ioctl+0x4d8/0xd34 net/core/dev_ioctl.c:786 sock_do_ioctl+0x1d4/0x2d0 net/socket.c:1217 sock_ioctl+0x4e8/0x834 net/socket.c:1322 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __arm64_sys_ioctl+0x14c/0x1c8 fs/ioctl.c:857 __invoke_syscall arch/arm64/kernel/syscall.c:37 [inline] invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51 el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136 do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155 el0_svc+0x58/0x16c arch/arm64/kernel/entry-common.c:678 el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:591 Code: aa1803e6 aa1903e7 a90023f5 94785b8b (d4210000) Fixes: 872254d ("net/bonding: Enable bonding to enslave non ARPHRD_ETHER") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://lore.kernel.org/r/20231109180102.4085183-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org> [mheyne: dev_open doesn't support extack argument yet. So dropping the NULL argument from the backport] Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 3534e5a upstream. Fault inject on pool metadata device reports: BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80 Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950 CPU: 7 PID: 950 Comm: dmsetup Tainted: G W 5.19.0-rc6 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 print_address_description.constprop.0.cold+0xeb/0x3f4 kasan_report.cold+0xe6/0x147 dm_pool_register_metadata_threshold+0x40/0x80 pool_ctr+0xa0a/0x1150 dm_table_add_target+0x2c8/0x640 table_load+0x1fd/0x430 ctl_ioctl+0x2c4/0x5a0 dm_ctl_ioctl+0xa/0x10 __x64_sys_ioctl+0xb3/0xd0 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 This can be easily reproduced using: echo offline > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10 dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0" If a metadata commit fails, the transaction will be aborted and the metadata space maps will be destroyed. If a DM table reload then happens for this failed thin-pool, a use-after-free will occur in dm_sm_register_threshold_callback (called from dm_pool_register_metadata_threshold). Fix this by in dm_pool_register_metadata_threshold() by returning the -EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr() with a new error message: "Error registering metadata threshold". Fixes: ac8c3f3 ("dm thin: generate event when metadata threshold passed") Cc: stable@vger.kernel.org Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Luo Meng <luomeng12@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> [v4.14: resolved contextual conflicts from different lock functions in dm_pool_register_metadata_threshold] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
commit 183a087 upstream. It looks like GPUs are used after shutdown is invoked. Thus, breaking virtio gpu in the shutdown callback is not a good idea - guest hangs attempting to finish console drawing, with these warnings: [ 20.504464] WARNING: CPU: 0 PID: 568 at drivers/gpu/drm/virtio/virtgpu_vq.c:358 virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.505685] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink vfat fat intel_rapl_msr intel_rapl_common intel_uncore_frequency_common nfit libnvdimm kvm_intel kvm rapl iTCO_wdt iTCO_vendor_support virtio_gpu virtio_dma_buf pcspkr drm_shmem_helper i2c_i801 drm_kms_helper lpc_ich i2c_smbus virtio_balloon joydev drm fuse xfs libcrc32c ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata virtio_net ghash_clmulni_intel net_failover virtio_blk failover serio_raw dm_mirror dm_region_hash dm_log dm_mod [ 20.511847] CPU: 0 PID: 568 Comm: kworker/0:3 Kdump: loaded Tainted: G W ------- --- 5.14.0-578.6675_1757216455.el9.x86_64 #1 [ 20.513157] Hardware name: Red Hat KVM/RHEL, BIOS edk2-20241117-3.el9 11/17/2024 [ 20.513918] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper] [ 20.514626] RIP: 0010:virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.515332] Code: 00 00 48 85 c0 74 0c 48 8b 78 08 48 89 ee e8 51 50 00 00 65 ff 0d 42 e3 74 3f 0f 85 69 ff ff ff 0f 1f 44 00 00 e9 5f ff ff ff <0f> 0b e9 3f ff ff ff 48 83 3c 24 00 74 0e 49 8b 7f 40 48 85 ff 74 [ 20.517272] RSP: 0018:ff34f0a8c0787ad8 EFLAGS: 00010282 [ 20.517820] RAX: 00000000fffffffb RBX: 0000000000000000 RCX: 0000000000000820 [ 20.518565] RDX: 0000000000000000 RSI: ff34f0a8c0787be0 RDI: ff218bef03a26300 [ 20.519308] RBP: ff218bef03a26300 R08: 0000000000000001 R09: ff218bef07224360 [ 20.520059] R10: 0000000000008dc0 R11: 0000000000000002 R12: ff218bef02630028 [ 20.520806] R13: ff218bef0263fb48 R14: ff218bef00cb8000 R15: ff218bef07224360 [ 20.521555] FS: 0000000000000000(0000) GS:ff218bef7ba00000(0000) knlGS:0000000000000000 [ 20.522397] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 20.522996] CR2: 000055ac4f7871c0 CR3: 000000010b9f2002 CR4: 0000000000771ef0 [ 20.523740] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 20.524477] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 20.525223] PKRU: 55555554 [ 20.525515] Call Trace: [ 20.525777] <TASK> [ 20.526003] ? show_trace_log_lvl+0x1c4/0x2df [ 20.526464] ? show_trace_log_lvl+0x1c4/0x2df [ 20.526925] ? virtio_gpu_queue_fenced_ctrl_buffer+0x82/0x2c0 [virtio_gpu] [ 20.527643] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.528282] ? __warn+0x7e/0xd0 [ 20.528621] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.529256] ? report_bug+0x100/0x140 [ 20.529643] ? handle_bug+0x3c/0x70 [ 20.530010] ? exc_invalid_op+0x14/0x70 [ 20.530421] ? asm_exc_invalid_op+0x16/0x20 [ 20.530862] ? virtio_gpu_queue_ctrl_sgs+0x236/0x290 [virtio_gpu] [ 20.531506] ? virtio_gpu_queue_ctrl_sgs+0x174/0x290 [virtio_gpu] [ 20.532148] virtio_gpu_queue_fenced_ctrl_buffer+0x82/0x2c0 [virtio_gpu] [ 20.532843] virtio_gpu_primary_plane_update+0x3e2/0x460 [virtio_gpu] [ 20.533520] drm_atomic_helper_commit_planes+0x108/0x320 [drm_kms_helper] [ 20.534233] drm_atomic_helper_commit_tail+0x45/0x80 [drm_kms_helper] [ 20.534914] commit_tail+0xd2/0x130 [drm_kms_helper] [ 20.535446] drm_atomic_helper_commit+0x11b/0x140 [drm_kms_helper] [ 20.536097] drm_atomic_commit+0xa4/0xe0 [drm] [ 20.536588] ? __pfx___drm_printfn_info+0x10/0x10 [drm] [ 20.537162] drm_atomic_helper_dirtyfb+0x192/0x270 [drm_kms_helper] [ 20.537823] drm_fbdev_shmem_helper_fb_dirty+0x43/0xa0 [drm_shmem_helper] [ 20.538536] drm_fb_helper_damage_work+0x87/0x160 [drm_kms_helper] [ 20.539188] process_one_work+0x194/0x380 [ 20.539612] worker_thread+0x2fe/0x410 [ 20.540007] ? __pfx_worker_thread+0x10/0x10 [ 20.540456] kthread+0xdd/0x100 [ 20.540791] ? __pfx_kthread+0x10/0x10 [ 20.541190] ret_from_fork+0x29/0x50 [ 20.541566] </TASK> [ 20.541802] ---[ end trace 0000000000000000 ]--- It looks like the shutdown is called in the middle of console drawing, so we should either wait for it to finish, or let drm handle the shutdown. This patch implements this second option: Add an option for drivers to bypass the common break+reset handling. As DRM is careful to flush/synchronize outstanding buffers, it looks like GPU can just have a NOP there. Reviewed-by: Eric Auger <eric.auger@redhat.com> Tested-by: Eric Auger <eric.auger@redhat.com> Fixes: 8bd2fa0 ("virtio: break and reset virtio devices on device_shutdown()") Cc: Eric Auger <eauger@redhat.com> Cc: Jocelyn Falempe <jfalempe@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Message-Id: <8490dbeb6f79ed039e6c11d121002618972538a3.1744293540.git.mst@redhat.com> [Resolved contextual conflicts.] Signed-off-by: Aaron Ahmed <aarnahmd@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit 22664c0 ] Previously when destroying a QP/RQ, the result of the firmware destruction function was ignored and upper layers weren't informed about the failure. Which in turn could lead to various problems since when upper layer isn't aware of the failure it continues its operation thinking that the related QP/RQ was successfully destroyed while it actually wasn't, which could lead to the below kernel WARN. Currently, we return the correct firmware destruction status to upper layers which in case of the RQ would be mlx5_ib_destroy_wq() which was already capable of handling RQ destruction failure or in case of a QP to destroy_qp_common(), which now would actually warn upon qp destruction failure. WARNING: CPU: 3 PID: 995 at drivers/infiniband/core/rdma_core.c:940 uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_umad ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core overlay mlx5_core fuse CPU: 3 PID: 995 Comm: python3 Not tainted 5.16.0-rc5+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:uverbs_destroy_ufile_hw+0xcb/0xe0 [ib_uverbs] Code: 41 5c 41 5d 41 5e e9 44 34 f0 e0 48 89 df e8 4c 77 ff ff 49 8b 86 10 01 00 00 48 85 c0 74 a1 4c 89 e7 ff d0 eb 9a 0f 0b eb c1 <0f> 0b be 04 00 00 00 48 89 df e8 b6 f6 ff ff e9 75 ff ff ff 90 0f RSP: 0018:ffff8881533e3e78 EFLAGS: 00010287 RAX: ffff88811b2cf3e0 RBX: ffff888106209700 RCX: 0000000000000000 RDX: ffff888106209780 RSI: ffff8881533e3d30 RDI: ffff888109b101a0 RBP: 0000000000000001 R08: ffff888127cb381c R09: 0de9890000000009 R10: ffff888127cb3800 R11: 0000000000000000 R12: ffff888106209780 R13: ffff888106209750 R14: ffff888100f20660 R15: 0000000000000000 FS: 00007f8be353b740(0000) GS:ffff88852c980000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8bd5b117c0 CR3: 000000012cd8a004 CR4: 0000000000370ea0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ib_uverbs_close+0x1a/0x90 [ib_uverbs] __fput+0x82/0x230 task_work_run+0x59/0x90 exit_to_user_mode_prepare+0x138/0x140 syscall_exit_to_user_mode+0x1d/0x50 ? __x64_sys_close+0xe/0x40 do_syscall_64+0x4a/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f8be3ae0abb Code: 03 00 00 00 0f 05 48 3d 00 f0 ff ff 77 41 c3 48 83 ec 18 89 7c 24 0c e8 83 43 f9 ff 8b 7c 24 0c 41 89 c0 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 89 44 24 0c e8 c1 43 f9 ff 8b 44 RSP: 002b:00007ffdb51909c0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000557bb7f7c020 RCX: 00007f8be3ae0abb RDX: 0000557bb7c74010 RSI: 0000557bb7f14ca0 RDI: 0000000000000005 RBP: 0000557bb7fbd598 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 0000557bb7fbd5b8 R13: 0000557bb7fbd5a8 R14: 0000000000001000 R15: 0000557bb7f7c020 </TASK> Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://lore.kernel.org/r/c6df677f931d18090bafbe7f7dbb9524047b7d9b.1685953497.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> Stable-dep-of: 5d2ea5a ("RDMA/mlx5: Fix error flow upon firmware failure for RQ destruction") [v4.14: partially picked up changes of returning firmware result in mlx5_core_destroy_rq() which is called in mlx5_core_destroy_rq_tracked()] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit 5d2ea5a ] Upon RQ destruction if the firmware command fails which is the last resource to be destroyed some SW resources were already cleaned regardless of the failure. Now properly rollback the object to its original state upon such failure. In order to avoid a use-after free in case someone tries to destroy the object again, which results in the following kernel trace: refcount_t: underflow; use-after-free. WARNING: CPU: 0 PID: 37589 at lib/refcount.c:28 refcount_warn_saturate+0xf4/0x148 Modules linked in: rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) rfkill mlx5_core(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) psample mlxfw(OE) mlx_compat(OE) macsec tls pci_hyperv_intf sunrpc vfat fat virtio_net net_failover failover fuse loop nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vmw_vmci vsock xfs crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_console virtio_gpu virtio_blk virtio_dma_buf virtio_mmio dm_mirror dm_region_hash dm_log dm_mod xpmem(OE) CPU: 0 UID: 0 PID: 37589 Comm: python3 Kdump: loaded Tainted: G OE ------- --- 6.12.0-54.el10.aarch64 #1 Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : refcount_warn_saturate+0xf4/0x148 lr : refcount_warn_saturate+0xf4/0x148 sp : ffff80008b81b7e0 x29: ffff80008b81b7e0 x28: ffff000133d51600 x27: 0000000000000001 x26: 0000000000000000 x25: 00000000ffffffea x24: ffff00010ae80f00 x23: ffff00010ae80f80 x22: ffff0000c66e5d08 x21: 0000000000000000 x20: ffff0000c66e0000 x19: ffff00010ae80340 x18: 0000000000000006 x17: 0000000000000000 x16: 0000000000000020 x15: ffff80008b81b37f x14: 0000000000000000 x13: 2e656572662d7265 x12: ffff80008283ef78 x11: ffff80008257efd0 x10: ffff80008283efd0 x9 : ffff80008021ed90 x8 : 0000000000000001 x7 : 00000000000bffe8 x6 : c0000000ffff7fff x5 : ffff0001fb8e3408 x4 : 0000000000000000 x3 : ffff800179993000 x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000133d51600 Call trace: refcount_warn_saturate+0xf4/0x148 mlx5_core_put_rsc+0x88/0xa0 [mlx5_ib] mlx5_core_destroy_rq_tracked+0x64/0x98 [mlx5_ib] mlx5_ib_destroy_wq+0x34/0x80 [mlx5_ib] ib_destroy_wq_user+0x30/0xc0 [ib_core] uverbs_free_wq+0x28/0x58 [ib_uverbs] destroy_hw_idr_uobject+0x34/0x78 [ib_uverbs] uverbs_destroy_uobject+0x48/0x240 [ib_uverbs] __uverbs_cleanup_ufile+0xd4/0x1a8 [ib_uverbs] uverbs_destroy_ufile_hw+0x48/0x120 [ib_uverbs] ib_uverbs_close+0x2c/0x100 [ib_uverbs] __fput+0xd8/0x2f0 __fput_sync+0x50/0x70 __arm64_sys_close+0x40/0x90 invoke_syscall.constprop.0+0x74/0xd0 do_el0_svc+0x48/0xe8 el0_svc+0x44/0x1d0 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x1a4/0x1a8 Fixes: e2013b2 ("net/mlx5_core: Add RQ and SQ event handling") Signed-off-by: Patrisious Haddad <phaddad@nvidia.com> Link: https://patch.msgid.link/3181433ccdd695c63560eeeb3f0c990961732101.1745839855.git.leon@kernel.org Signed-off-by: Leon Romanovsky <leon@kernel.org> [v4.14: replaced struct mlx5_ib_dev by mlx5_core_dev; accessed qp_table by &dev->priv in modify_resource_common_state(); deleted return lines for void mlx5_core_destroy_rq_tracked()] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
…deomode_to_var [ upstream 17186f1 ] If fb_add_videomode() in do_register_framebuffer() fails to allocate memory for fb_videomode, it will later lead to a null-ptr dereference in fb_videomode_to_var(), as the fb_info is registered while not having the mode in modelist that is expected to be there, i.e. the one that is described in fb_info->var. ================================================================ general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 1 PID: 30371 Comm: syz-executor.1 Not tainted 5.10.226-syzkaller #0 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:fb_videomode_to_var+0x24/0x610 drivers/video/fbdev/core/modedb.c:901 Call Trace: display_to_var+0x3a/0x7c0 drivers/video/fbdev/core/fbcon.c:929 fbcon_resize+0x3e2/0x8f0 drivers/video/fbdev/core/fbcon.c:2071 resize_screen drivers/tty/vt/vt.c:1176 [inline] vc_do_resize+0x53a/0x1170 drivers/tty/vt/vt.c:1263 fbcon_modechanged+0x3ac/0x6e0 drivers/video/fbdev/core/fbcon.c:2720 fbcon_update_vcs+0x43/0x60 drivers/video/fbdev/core/fbcon.c:2776 do_fb_ioctl+0x6d2/0x740 drivers/video/fbdev/core/fbmem.c:1128 fb_ioctl+0xe7/0x150 drivers/video/fbdev/core/fbmem.c:1203 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x19a/0x210 fs/ioctl.c:739 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x67/0xd1 ================================================================ Even though fbcon_init() checks beforehand if fb_match_mode() in var_to_display() fails, it can not prevent the panic because fbcon_init() does not return error code. Considering this and the comment in the code about fb_match_mode() returning NULL - "This should not happen" - it is better to prevent registering the fb_info if its mode was not set successfully. Also move fb_add_videomode() closer to the beginning of do_register_framebuffer() to avoid having to do the cleanup on fail. Found by Linux Verification Center (linuxtesting.org) with Syzkaller. Fixes: 1da177e ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Murad Masimov <m.masimov@mt-integration.ru> Signed-off-by: Helge Deller <deller@gmx.de> [ not through cherry-pick, pick up the only necessary CVE fix code ] Signed-off-by: Yifei Ma <yifeima@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit 10876da ] syzkaller reported a null-ptr-deref in sock_omalloc() while allocating a CALIPSO option. [0] The NULL is of struct sock, which was fetched by sk_to_full_sk() in calipso_req_setattr(). Since commit a1a5344 ("tcp: avoid two atomic ops for syncookies"), reqsk->rsk_listener could be NULL when SYN Cookie is returned to its client, as hinted by the leading SYN Cookie log. Here are 3 options to fix the bug: 1) Return 0 in calipso_req_setattr() 2) Return an error in calipso_req_setattr() 3) Alaways set rsk_listener 1) is no go as it bypasses LSM, but 2) effectively disables SYN Cookie for CALIPSO. 3) is also no go as there have been many efforts to reduce atomic ops and make TCP robust against DDoS. See also commit 3b24d85 ("tcp/dccp: do not touch listener sk_refcnt under synflood"). As of the blamed commit, SYN Cookie already did not need refcounting, and no one has stumbled on the bug for 9 years, so no CALIPSO user will care about SYN Cookie. Let's return an error in calipso_req_setattr() and calipso_req_delattr() in the SYN Cookie case. This can be reproduced by [1] on Fedora and now connect() of nc times out. [0]: TCP: request_sock_TCPv6: Possible SYN flooding on port [::]:20002. Sending cookies. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] CPU: 3 UID: 0 PID: 12262 Comm: syz.1.2611 Not tainted 6.14.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_pnet include/net/net_namespace.h:406 [inline] RIP: 0010:sock_net include/net/sock.h:655 [inline] RIP: 0010:sock_kmalloc+0x35/0x170 net/core/sock.c:2806 Code: 89 d5 41 54 55 89 f5 53 48 89 fb e8 25 e3 c6 fd e8 f0 91 e3 00 48 8d 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 01 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b RSP: 0018:ffff88811af89038 EFLAGS: 00010216 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff888105266400 RDX: 0000000000000006 RSI: ffff88800c890000 RDI: 0000000000000030 RBP: 0000000000000050 R08: 0000000000000000 R09: ffff88810526640e R10: ffffed1020a4cc81 R11: ffff88810526640f R12: 0000000000000000 R13: 0000000000000820 R14: ffff888105266400 R15: 0000000000000050 FS: 00007f0653a07640(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f863ba096f4 CR3: 00000000163c0005 CR4: 0000000000770ef0 PKRU: 80000000 Call Trace: <IRQ> ipv6_renew_options+0x279/0x950 net/ipv6/exthdrs.c:1288 calipso_req_setattr+0x181/0x340 net/ipv6/calipso.c:1204 calipso_req_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:597 netlbl_req_setattr+0x18a/0x440 net/netlabel/netlabel_kapi.c:1249 selinux_netlbl_inet_conn_request+0x1fb/0x320 security/selinux/netlabel.c:342 selinux_inet_conn_request+0x1eb/0x2c0 security/selinux/hooks.c:5551 security_inet_conn_request+0x50/0xa0 security/security.c:4945 tcp_v6_route_req+0x22c/0x550 net/ipv6/tcp_ipv6.c:825 tcp_conn_request+0xec8/0x2b70 net/ipv4/tcp_input.c:7275 tcp_v6_conn_request+0x1e3/0x440 net/ipv6/tcp_ipv6.c:1328 tcp_rcv_state_process+0xafa/0x52b0 net/ipv4/tcp_input.c:6781 tcp_v6_do_rcv+0x8a6/0x1a40 net/ipv6/tcp_ipv6.c:1667 tcp_v6_rcv+0x505e/0x5b50 net/ipv6/tcp_ipv6.c:1904 ip6_protocol_deliver_rcu+0x17c/0x1da0 net/ipv6/ip6_input.c:436 ip6_input_finish+0x103/0x180 net/ipv6/ip6_input.c:480 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_input+0x13c/0x6b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:469 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] ip6_rcv_finish+0xb6/0x490 net/ipv6/ip6_input.c:69 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ipv6_rcv+0xf9/0x490 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core+0x12e/0x1f0 net/core/dev.c:5896 __netif_receive_skb+0x1d/0x170 net/core/dev.c:6009 process_backlog+0x41e/0x13b0 net/core/dev.c:6357 __napi_poll+0xbd/0x710 net/core/dev.c:7191 napi_poll net/core/dev.c:7260 [inline] net_rx_action+0x9de/0xde0 net/core/dev.c:7382 handle_softirqs+0x19a/0x770 kernel/softirq.c:561 do_softirq.part.0+0x36/0x70 kernel/softirq.c:462 </IRQ> <TASK> do_softirq arch/x86/include/asm/preempt.h:26 [inline] __local_bh_enable_ip+0xf1/0x110 kernel/softirq.c:389 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline] __dev_queue_xmit+0xc2a/0x3c40 net/core/dev.c:4679 dev_queue_xmit include/linux/netdevice.h:3313 [inline] neigh_hh_output include/net/neighbour.h:523 [inline] neigh_output include/net/neighbour.h:537 [inline] ip6_finish_output2+0xd69/0x1f80 net/ipv6/ip6_output.c:141 __ip6_finish_output net/ipv6/ip6_output.c:215 [inline] ip6_finish_output+0x5dc/0xd60 net/ipv6/ip6_output.c:226 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip6_output+0x24b/0x8d0 net/ipv6/ip6_output.c:247 dst_output include/net/dst.h:459 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_xmit+0xbbc/0x20d0 net/ipv6/ip6_output.c:366 inet6_csk_xmit+0x39a/0x720 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x1a7b/0x3b40 net/ipv4/tcp_output.c:1471 tcp_transmit_skb net/ipv4/tcp_output.c:1489 [inline] tcp_send_syn_data net/ipv4/tcp_output.c:4059 [inline] tcp_connect+0x1c0c/0x4510 net/ipv4/tcp_output.c:4148 tcp_v6_connect+0x156c/0x2080 net/ipv6/tcp_ipv6.c:333 __inet_stream_connect+0x3a7/0xed0 net/ipv4/af_inet.c:677 tcp_sendmsg_fastopen+0x3e2/0x710 net/ipv4/tcp.c:1039 tcp_sendmsg_locked+0x1e82/0x3570 net/ipv4/tcp.c:1091 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1358 inet6_sendmsg+0xb9/0x150 net/ipv6/af_inet6.c:659 sock_sendmsg_nosec net/socket.c:718 [inline] __sock_sendmsg+0xf4/0x2a0 net/socket.c:733 __sys_sendto+0x29a/0x390 net/socket.c:2187 __do_sys_sendto net/socket.c:2194 [inline] __se_sys_sendto net/socket.c:2190 [inline] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2190 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc3/0x1d0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f06553c47ed Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f0653a06fc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f0655605fa0 RCX: 00007f06553c47ed RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000b RBP: 00007f065545db38 R08: 0000200000000140 R09: 000000000000001c R10: f7384d4ea84b01bd R11: 0000000000000246 R12: 0000000000000000 R13: 00007f0655605fac R14: 00007f0655606038 R15: 00007f06539e7000 </TASK> Modules linked in: [1]: dnf install -y selinux-policy-targeted policycoreutils netlabel_tools procps-ng nmap-ncat mount -t selinuxfs none /sys/fs/selinux load_policy netlabelctl calipso add pass doi:1 netlabelctl map del default netlabelctl map add default address:::1 protocol:calipso,1 sysctl net.ipv4.tcp_syncookies=2 nc -l ::1 80 & nc ::1 80 Fixes: e1adea9 ("calipso: Allow request sockets to be relabelled by the lsm.") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: John Cheung <john.cs.hey@gmail.com> Closes: https://lore.kernel.org/netdev/CAP=Rh=MvfhrGADy+-WJiftV2_WzMH4VEhEFmeT28qY+4yxNu4w@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250617224125.17299-1-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mark Bundschuh <mkbund@amazon.com> Signed-off-by: Nathan Gao <zcgao@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Sep 17, 2025
[ Upstream commit c457dc1 ] When memory is insufficient, the allocation of nfs_lock_context in nfs_get_lock_context() fails and returns -ENOMEM. If we mistakenly treat an nfs4_unlockdata structure (whose l_ctx member has been set to -ENOMEM) as valid and proceed to execute rpc_run_task(), this will trigger a NULL pointer dereference in nfs4_locku_prepare. For example: BUG: kernel NULL pointer dereference, address: 000000000000000c PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP PTI CPU: 15 UID: 0 PID: 12 Comm: kworker/u64:0 Not tainted 6.15.0-rc2-dirty #60 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 Workqueue: rpciod rpc_async_schedule RIP: 0010:nfs4_locku_prepare+0x35/0xc2 Code: 89 f2 48 89 fd 48 c7 c7 68 69 ef b5 53 48 8b 8e 90 00 00 00 48 89 f3 RSP: 0018:ffffbbafc006bdb8 EFLAGS: 00010246 RAX: 000000000000004b RBX: ffff9b964fc1fa00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: fffffffffffffff4 RDI: ffff9ba53fddbf40 RBP: ffff9ba539934000 R08: 0000000000000000 R09: ffffbbafc006bc38 R10: ffffffffb6b689c8 R11: 0000000000000003 R12: ffff9ba539934030 R13: 0000000000000001 R14: 0000000004248060 R15: ffffffffb56d1c30 FS: 0000000000000000(0000) GS:ffff9ba5881f0000(0000) knlGS:00000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000000c CR3: 000000093f244000 CR4: 00000000000006f0 Call Trace: <TASK> __rpc_execute+0xbc/0x480 rpc_async_schedule+0x2f/0x40 process_one_work+0x232/0x5d0 worker_thread+0x1da/0x3d0 ? __pfx_worker_thread+0x10/0x10 kthread+0x10d/0x240 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x34/0x50 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: CR2: 000000000000000c ---[ end trace 0000000000000000 ]--- Free the allocated nfs4_unlockdata when nfs_get_lock_context() fails and return NULL to terminate subsequent rpc_run_task, preventing NULL pointer dereference. Fixes: f30cb75 ("NFS: Always wait for I/O completion before unlock") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Link: https://lore.kernel.org/r/20250417072508.3850532-1-lilingfeng3@huawei.com Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Sasha Levin <sashal@kernel.org> [Resolve contextual conflicts in nfs4_alloc_unlockdata function entry] Signed-off-by: Jay Wang <wanjay@amazon.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR attempts to address some perf issues due to SVE state trap in the kernel when doing a context switch. This patch is back port of patch at https://lore.kernel.org/linux-arm-kernel/20220620124158.482039-6-broonie@kernel.org/T/.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.