forked from gregkh/linux
-
Notifications
You must be signed in to change notification settings - Fork 24
Amazon 5.10.y/btrfs soft lockup #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
vewe-richard
wants to merge
737
commits into
amazonlinux:amazon-5.10.y/master
from
vewe-richard:amazon-5.10.y/btrfs-soft-lockup
Closed
Amazon 5.10.y/btrfs soft lockup #2
vewe-richard
wants to merge
737
commits into
amazonlinux:amazon-5.10.y/master
from
vewe-richard:amazon-5.10.y/btrfs-soft-lockup
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The huge_ptep_set_access_flags() can not make the huge pte old according to the discussion [1], that means we will always mornitor the young state of the hugetlb though we stopped accessing the hugetlb, as a result DAMON will get inaccurate accessing statistics. So changing to use set_huge_pte_at() to make the huge pte old to fix this issue. [1] https://lore.kernel.org/all/Yqy97gXI4Nqb7dYo@arm.com/ Link: https://lkml.kernel.org/r/1655692482-28797-1-git-send-email-baolin.wang@linux.alibaba.com Fixes: 49f4203 ("mm/damon: add access checking for hugetlb pages") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_reclaim_init() allocates a memory chunk for ctx with damon_new_ctx(). When damon_select_ops() fails, ctx is not released, which will lead to a memory leak. We should release the ctx with damon_destroy_ctx() when damon_select_ops() fails to fix the memory leak. Link: https://lkml.kernel.org/r/20220714063746.2343549-1-niejianglei2021@163.com Fixes: 4d69c34 ("mm/damon/reclaim: use damon_select_ops() instead of damon_{v,p}a_set_operations()") Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When user tries to create a DAMON context via the DAMON debugfs interface
with a name of an already existing context, the context directory creation
fails but a new context is created and added in the internal data
structure, due to absence of the directory creation success check. As a
result, memory could leak and DAMON cannot be turned on. An example test
case is as below:
# cd /sys/kernel/debug/damon/
# echo "off" > monitor_on
# echo paddr > target_ids
# echo "abc" > mk_context
# echo "abc" > mk_context
# echo $$ > abc/target_ids
# echo "on" > monitor_on <<< fails
Return value of 'debugfs_create_dir()' is expected to be ignored in
general, but this is an exceptional case as DAMON feature is depending
on the debugfs functionality and it has the potential duplicate name
issue. This commit therefore fixes the issue by checking the directory
creation failure and immediately return the error in the case.
Link: https://lkml.kernel.org/r/20220821180853.2400-1-sj@kernel.org
Fixes: 75c1c2b ("mm/damon/dbgfs: support multiple contexts")
Signed-off-by: Badari Pulavarty <badari.pulavarty@intel.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> [ 5.15.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This reverts commit 509c2c9. Commit "1a072f13b2dc Mitigate unbalanced RETs on vmexit via serialising wrmsr" addresses this with less performance impact. [ Hailmo: Resolved conflicts when rebasing onto 5.10.190 and adding SRSO and GDS support ] Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
…-rwx-segments" This reverts commit 8f4f2c9. This causes arm64 debug builds to fail with: *** ERROR: No build ID note found in /builddir/build/BUILDROOT/kernel-5.15.63-32.131.amzn2.aarch64/usr/lib/debug/lib/modules/5.15.63-32.131.amzn2.aarch64/vmlinux This is due to the notes section which contains the build id being missing from the linux elf. Revert this commit until this can be remedied. Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Source: https://github.com/amzn/amzn-drivers/ Change Log: ## r2.8.0 release notes **Notes** * The driver is now dependent on the ptp module for loading See README for more details. **New Features** * Add support for PTP HW clock * Add support for SRD metrics Feature's enablement and documentation would be in future release **Bug Fixes** * Fix potential sign extension issue * Reduce memory footprint of some structs * Fix updating rx_copybreak issue * Fix xdp drops handling due to multibuf packets * Handle ena_calc_io_queue_size() possible errors * Destroy correct amount of xdp queues upon failure **Minor Changes** * Remove wide LLQ comment on supported versions * Backport uapi/bpf.h inclusion * Add a counter for driver's reset failures * Take xdp packets stats into account in ena_get_stats64() * Make queue stats code cleaner by removing if block * Remove redundant empty line * Remove confusing comment * Remove flag reading code duplication * Replace ENA local ENA_NAPI_BUDGET to global NAPI_POLL_WEIGHT * Change default print level for netif_ prints * Relocate skb_tx_timestamp() to improve time stamping accuracy * Backport bpf_warn_invalid_xdp_action() change * Fix incorrect indentation using spaces * Driver now compiles with Linux kernel 5.19 Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com>
Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
When calling debugfs_lookup() the result must have dput() called on it, otherwise the memory will leak over time. Fix this up by properly calling dput(). Link: https://lkml.kernel.org/r/20220902191149.112434-1-sj@kernel.org Fixes: 75c1c2b ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When damon_sysfs_add_target couldn't find proper task, New allocated damon_target structure isn't registered yet, So, it's impossible to free new allocated one by damon_sysfs_destroy_targets. By calling damon_add_target as soon as allocating new target, Fix this possible memory leak. Link: https://lkml.kernel.org/r/20220926160611.48536-1-sj@kernel.org Fixes: a61ea56 ("mm/damon/sysfs: link DAMON for virtual address spaces monitoring") Signed-off-by: Levi Yun <ppbuk5246@gmail.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.17.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: SeongJae Park <sjpark@amazon.com>
commit 9e7a4d9 upstream. Usage of spin locks was not allowed for tracing programs due to insufficient preemption checks. The verifier does not currently prevent LSM programs from using spin locks, but the helpers are not exposed via bpf_lsm_func_proto. Based on the discussion in [1], non-sleepable LSM programs should be able to use bpf_spin_{lock, unlock}. Sleepable LSM programs can be preempted which means that allowng spin locks will need more work (disabling preemption and the verifier ensuring that no sleepable helpers are called when a spin lock is held). [1]: https://lore.kernel.org/bpf/20201103153132.2717326-1-kpsingh@chromium.org/T/#md601a053229287659071600d3483523f752cd2fb Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201106103747.2780972-2-kpsingh@chromium.org Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
commit 4cf1bc1 upstream. Similar to bpf_local_storage for sockets and inodes add local storage for task_struct. The life-cycle of storage is managed with the life-cycle of the task_struct. i.e. the storage is destroyed along with the owning task with a callback to the bpf_task_storage_free from the task_free LSM hook. The BPF LSM allocates an __rcu pointer to the bpf_local_storage in the security blob which are now stackable and can co-exist with other LSMs. The userspace map operations can be done by using a pid fd as a key passed to the lookup, update and delete operations. Signed-off-by: KP Singh <kpsingh@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20201106103747.2780972-3-kpsingh@chromium.org Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Instead of putting io_uring's registered files in unix_gc() we want it to be done by io_uring itself. The trick here is to consider io_uring registered files for cycle detection but not actually putting them down. Because io_uring can't register other ring instances, this will remove all refs to the ring file triggering the ->release path and clean up with io_ring_ctx_free(). Cc: stable@vger.kernel.org Fixes: 6b06314 ("io_uring: add file set registration") Reported-and-tested-by: David Bouman <dbouman03@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> [axboe: add kerneldoc comment to skb, fold in skb leak fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
Since commit 0f91d13 ("mm/damon: simplify stop mechanism") delete kdamond_stop and change to use kthread stop mechanism, these obsolete comments should be removed accordingly. Link: https://lkml.kernel.org/r/20220531020421.46849-1-zhouchengming@bytedance.com Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The kernel is in lockdown mode when secureboot is enabled and hence debugfs cannot be used. Add support for this and other general cases where debugfs cannot be read and communicate the same to the user before running tests. Signed-off-by: Gautam <gautammenghani201@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
… due to online tuning support Patch series "mm/damon: trivial cleanups". This patchset contains trivial cleansups for DAMON code. This patch (of 6): Commit 81a8418 ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter") has documented the 'commit_inputs' parameter which allows online parameter update, but it didn't remove a paragraph saying the online parameter update is impossible. This commit removes the obsolete paragraph. Link: https://lkml.kernel.org/r/20220606182310.48781-1-sj@kernel.org Link: https://lkml.kernel.org/r/20220606182310.48781-2-sj@kernel.org Fixes: 81a8418 ("Docs/admin-guide/mm/damon/reclaim: document 'commit_inputs' parameter") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The function for knowing if given monitoring context's targets will have pid or not is defined and used in dbgfs only. However, the logic is also needed for sysfs. This commit moves the code to damon.h and makes both dbgfs and sysfs to use it. Link: https://lkml.kernel.org/r/20220606182310.48781-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM's handling of 'commit_inputs' parameter is duplicated in 'after_aggregation()' and 'after_wmarks_check()' callbacks. This commit deduplicates the code for better maintenance. Link: https://lkml.kernel.org/r/20220606182310.48781-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON sysfs interface's DAMON context building and its online parameter update have duplicated code. This commit removes the duplicate. Link: https://lkml.kernel.org/r/20220606182310.48781-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMON_RECLAIM's 'enabled' parameter store callback ('enabled_store()')
schedules the parameter check timer ('damon_reclaim_timer') if the
parameter is set as 'Y'. Then, the timer schedules itself to check if
user has set the parameter as 'N'. It's unnecessarily complex.
This commit makes it simpler by making the parameter store callback to
schedule the timer regardless of the parameter value and disabling the
timer's self scheduling.
Link: https://lkml.kernel.org/r/20220606182310.48781-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit adds 'damon_reclaim_' prefix to 'enabled_store()', so that we can distinguish it easily from the stack trace using 'faddr2line.sh' like tools. Link: https://lkml.kernel.org/r/20220606182310.48781-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…and 'damos_action' values Patch series "Extend DAMOS for Proactive LRU-lists Sorting". Introduction ============ In short, this patchset 1) extends DAMON-based Operation Schemes (DAMOS) for low overhead data access pattern based LRU-lists sorting, and 2) implements a static kernel module for easy use of conservatively-tuned version of that using the extended DAMOS capability. Background ---------- As page-granularity access checking overhead could be significant on huge systems, LRU lists are normally not proactively sorted but partially and reactively sorted for special events including specific user requests, system calls and memory pressure. As a result, LRU lists are sometimes not so perfectly prepared to be used as a trustworthy access pattern source for some situations including reclamation target pages selection under sudden memory pressure. DAMON-based Proactive LRU-lists Sorting --------------------------------------- Because DAMON can identify access patterns of best-effort accuracy while inducing only user-specified range of overhead, using DAMON for Proactive LRU-lists Sorting (PLRUS) could be helpful for this situation. The idea is quite simple. Find hot pages and cold pages using DAMON, and prioritize hot pages while deprioritizing cold pages on their LRU-lists. This patchset extends DAMON to support such schemes by introducing a couple of new DAMOS actions for prioritizing and deprioritizing memory regions of specific access patterns on their LRU-lists. In detail, this patchset simply uses 'mark_page_accessed()' and 'deactivate_page()' functions for prioritization and deprioritization of pages on their LRU lists, respectively. To make the scheme easy to use without complex tuning for common situations, this patchset further implements a static kernel module called 'DAMON_LRU_SORT' using the extended DAMOS functionality. It proactively sorts LRU-lists using DAMON with conservatively chosen default hotness/coldness thresholds and small CPU usage quota limit. That is, the module under its default parameters will make no harm for common situation but provide some level of benefit for systems having clear hot/cold access pattern under only memory pressure while consuming only limited small portion of CPU time. Related Works ------------- Proactive reclamation is well known to be helpful for reducing non-optimal reclamation target selection caused performance drops. However, proactive reclamation is not a best option for some cases, because it could incur additional I/O. For an example, it could be prohitive for systems using storage devices that total number of writes is limited, or cloud block storages that charges every I/O. Some proactive reclamation approaches[1,2] induce a level of memory pressure using memcg files or swappiness while monitoring PSI. As reclamation target selection is still relying on the original LRU-lists mechanism, using DAMON-based proactive reclamation before inducing the proactive reclamation could allow more memory saving with same level of performance overhead, or less performance overhead with same level of memory saving. [1] https://blogs.oracle.com/linux/post/anticipating-your-memory-needs [2] https://www.pdl.cmu.edu/ftp/NVM/tmo_asplos22.pdf Evaluation ========== In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page faults reduction, and 3.74% speedup under memory pressure. Setup ----- To show the effect of PLRUS, I run PARSEC3 and SPLASH-2X benchmarks under below variant systems and measure a few metrics including the runtime of each workload, number of system-wide major page faults, and system-wide memory PSI (some). - orig: v5.18-rc4 based mm-unstable kernel + this patchset, but no DAMON scheme applied. - mprs: Same to 'orig' but artificial memory pressure is induced. - plrus: Same to 'mprs' but a radically tuned PLRUS scheme is applied to the entire physical address space of the system. For the artificial memory pressure, I set 'memory.limit_in_bytes' to 75% of the running workload's peak RSS, wait 1 second, remove the pressure by setting it to 200% of the peak RSS, wait 10 seconds, and repeat the procedure until the workload finishes[1]. I use zram based swap device. The tests are automated[2]. [1] https://github.com/awslabs/damon-tests/blob/next/perf/runners/back/0009_memcg_pressure.sh [2] https://github.com/awslabs/damon-tests/blob/next/perf/full_once_config.sh Radically Tuned PLRUS --------------------- To show effect of PLRUS on the PARSEC3/SPLASH-2X workloads which runs for no long time, we use radically tuned version of PLRUS. The version asks DAMON to do the proactive LRU-lists sorting as below. 1. Find any memory regions shown some accesses (approximately >=20 accesses per 100 sampling) and prioritize pages of the regions on their LRU lists using up to 2% CPU time. Under the CPU time limit, prioritize regions having higher access frequency and kept the access frequency longer first. 2. Find any memory regions shown no access for at least >=5 seconds and deprioritize pages of the rgions on their LRU lists using up to 2% CPU time. Under the CPU time limit, deprioritize regions that not accessed for longer time first. Results ------- I repeat the tests 25 times and calculate average of the measured numbers. The results are as below: metric orig mprs plrus plrus/mprs runtime_seconds 190.06 292.83 281.87 0.96 pgmajfaults 852.55 8769420.00 7525040.00 0.86 memory_psi_some_us 106911.00 6943420.00 6220920.00 0.90 The first row is for legend. The first cell shows the metric that the following cells of the row shows. Second, third, and fourth cells show the metrics under the configs shown at the first row of the cell, and the fifth cell shows the metric under 'plrus' divided by the metric under 'mprs'. Second row shows the averaged runtime of the workloads in seconds. Third row shows the number of system-wide major page faults while the test was ongoing. Fourth row shows the system-wide memory pressure stall for some processes in microseconds while the test was ongoing. In short, PLRUS achieves 10% memory PSI (some) reduction, 14% major page faults reduction, and 3.74% speedup under memory pressure. We also confirmed the CPU usage of kdamond was 2.61% of single CPU, which is below 4% as expected. Sequence of Patches =================== The first and second patch cleans up DAMON debugfs interface and DAMOS_PAGEOUT handling code of physical address space monitoring operations implementation for easier extension of the code. The thrid and fourth patches implement a new DAMOS action called 'lru_prio', which prioritizes pages under memory regions which have a user-specified access pattern, and document it, respectively. The fifth and sixth patches implement yet another new DAMOS action called 'lru_deprio', which deprioritizes pages under memory regions which have a user-specified access pattern, and document it, respectively. The seventh patch implements a static kernel module called 'damon_lru_sort', which utilizes the DAMON-based proactive LRU-lists sorting under conservatively chosen default parameter. Finally, the eighth patch documents 'damon_lru_sort'. This patch (of 8): DAMON debugfs interface assumes users will write 'damos_action' value directly to the 'schemes' file. This makes adding new 'damos_action' in the middle of its definition breaks the backward compatibility of DAMON debugfs interface, as values of some 'damos_action' could be changed. To mitigate the situation, this commit adds mappings between the user inputs and 'damos_action' value and makes DAMON debugfs code uses those. Link: https://lkml.kernel.org/r/20220613192301.8817-1-sj@kernel.org Link: https://lkml.kernel.org/r/20220613192301.8817-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit moves code for 'DAMOS_PAGEOUT' handling of the physical address space monitoring operations set to a separate function so that its caller, 'damon_pa_apply_scheme()', can be more easily extended for additional DAMOS actions later. Link: https://lkml.kernel.org/r/20220613192301.8817-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit adds a new DAMOS action called 'LRU_PRIO' for the physical address space. The action prioritizes pages in the memory regions of the user-specified target access pattern on their LRU lists. This is hence supposed to be used for frequently accessed (hot) memory regions so that hot pages could be more likely protected under memory pressure. Internally, it simply calls 'mark_page_accessed()'. Link: https://lkml.kernel.org/r/20220613192301.8817-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit documents the 'lru_prio' scheme action for DAMON sysfs interface. Link: https://lkml.kernel.org/r/20220613192301.8817-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit adds a new DAMON-based operation scheme action called 'LRU_DEPRIO' for physical address space. The action deprioritizes pages in the memory area of the target access pattern on their LRU lists. This is hence supposed to be used for rarely accessed (cold) memory regions so that cold pages could be more likely reclaimed first under memory pressure. Internally, it simply calls 'lru_deactivate()'. Using this with 'LRU_PRIO' action for hot pages, users can proactively sort LRU lists based on the access pattern. That is, it can make the LRU lists somewhat more trustworthy source of access temperature. As a result, efficiency of LRU-lists based mechanisms including the reclamation target selection could be improved. Link: https://lkml.kernel.org/r/20220613192301.8817-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit documents the 'LRU_DEPRIO' scheme action for DAMON sysfs interface.` Link: https://lkml.kernel.org/r/20220613192301.8817-8-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Users can do data access-aware LRU-lists sorting using 'LRU_PRIO' and 'LRU_DEPRIO' DAMOS actions. However, finding best parameters including the hotness/coldness thresholds, CPU quota, and watermarks could be challenging for some users. To make the scheme easy to be used without complex tuning for common situations, this commit implements a static kernel module called 'DAMON_LRU_SORT' using the 'LRU_PRIO' and 'LRU_DEPRIO' DAMOS actions. It proactively sorts LRU-lists using DAMON with conservatively chosen default values of the parameters. That is, the module under its default parameters will make no harm for common situations but provide some level of efficiency improvements for systems having clear hot/cold access pattern under a level of memory pressure while consuming only a limited small portion of CPU time. Link: https://lkml.kernel.org/r/20220613192301.8817-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This commit documents the usage of DAMON_LRU_SORT for admins. Link: https://lkml.kernel.org/r/20220613192301.8817-10-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_lru_sort_init() returns an error when damon_select_ops() fails without freeing 'ctx' which allocated before. This commit fixes the potential memory leak by freeing 'ctx' under the situation. Link: https://lkml.kernel.org/r/20220714170458.49727-1-sj@kernel.org Fixes: 40e983c ("mm/damon: introduce DAMON-based LRU-lists Sorting") Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The workflow example code is not working since it got the file names wrong. So fix this. Fixes: b184027 ("Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface") Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Kairui Song <kasong@tencent.com> Link: https://lore.kernel.org/r/20220823114053.53305-1-ryncsn@gmail.com Signed-off-by: Jonathan Corbet <corbet@lwn.net>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
As reported by syzbot and experienced by Pavel, using cpus_read_lock() in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem and possibly others). Instead, shrink the preempt disable region by iterating all CPUs and checking the online status for each individual CPU while having preemption disabled. Fixes: 8850cb6 ("sched: Simplify wake_up_*idle*()") Reported-by: syzbot+d5b23b18d2f4feae8a67@syzkaller.appspotmail.com Reported-by: Pavel Machek <pavel@ucw.cz> Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Qian Cai <quic_qiancai@quicinc.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
commit f79a609 upstream. log_max_qp in driver's default profile #2 was set to 18, but FW actually supports 17 at the most - a situation that led to the concerning print when the driver is loaded: "log_max_qp value in current profile is 18, changing to HCA capabaility limit (17)" The expected behavior from mlx5_profile #2 is to match the maximum FW capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is initialized to a defined static value (0xff) - which basically means that when loading this profile, log_max_qp value will be what the currently installed FW supports at most. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> [v5.10: replaced prof->log_max_qp with profile[prof_sel].log_max_qp] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
[ Upstream commit 9afb4b2 ] To clear the flow table on flow table free, the following sequence normally happens in order: 1) gc_step work is stopped to disable any further stats/del requests. 2) All flow table entries are set to teardown state. 3) Run gc_step which will queue HW del work for each flow table entry. 4) Waiting for the above del work to finish (flush). 5) Run gc_step again, deleting all entries from the flow table. 6) Flow table is freed. But if a flow table entry already has pending HW stats or HW add work step 3 will not queue HW del work (it will be skipped), step 4 will wait for the pending add/stats to finish, and step 5 will queue HW del work which might execute after freeing of the flow table. To fix the above, this patch flushes the pending work, then it sets the teardown flag to all flows in the flowtable and it forces a garbage collector run to queue work to remove the flows from hardware, then it flushes this new pending work and (finally) it forces another garbage collector run to remove the entry from the software flowtable. Stack trace: [47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460 [47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704 [47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2 [47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) [47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table] [47773.889727] Call Trace: [47773.890214] dump_stack+0xbb/0x107 [47773.890818] print_address_description.constprop.0+0x18/0x140 [47773.892990] kasan_report.cold+0x7c/0xd8 [47773.894459] kasan_check_range+0x145/0x1a0 [47773.895174] down_read+0x99/0x460 [47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table] [47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table] [47773.913372] process_one_work+0x8ac/0x14e0 [47773.921325] [47773.921325] Allocated by task 592159: [47773.922031] kasan_save_stack+0x1b/0x40 [47773.922730] __kasan_kmalloc+0x7a/0x90 [47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct] [47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct] [47773.925207] tcf_action_init_1+0x45b/0x700 [47773.925987] tcf_action_init+0x453/0x6b0 [47773.926692] tcf_exts_validate+0x3d0/0x600 [47773.927419] fl_change+0x757/0x4a51 [cls_flower] [47773.928227] tc_new_tfilter+0x89a/0x2070 [47773.936652] [47773.936652] Freed by task 543704: [47773.937303] kasan_save_stack+0x1b/0x40 [47773.938039] kasan_set_track+0x1c/0x30 [47773.938731] kasan_set_free_info+0x20/0x30 [47773.939467] __kasan_slab_free+0xe7/0x120 [47773.940194] slab_free_freelist_hook+0x86/0x190 [47773.941038] kfree+0xce/0x3a0 [47773.941644] tcf_ct_flow_table_cleanup_work Original patch description and stack trace by Paul Blakey. Fixes: c29f74e ("netfilter: nf_flow_table: hardware offload support") Reported-by: Paul Blakey <paulb@nvidia.com> Tested-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Nathan Gao <zcgao@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
commit c886d70 upstream. Dipanjan reported a syzbot splat at close time: WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153 inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Modules linked in: uio_ivshmem(OE) uio(E) CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G OE 5.19.0-rc6-g2eae0556bb9d #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91 f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00 RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002 RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000 R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400 R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258 FS: 0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0 Call Trace: <TASK> __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067 sk_destruct+0xbd/0xe0 net/core/sock.c:2112 __sk_free+0xef/0x3d0 net/core/sock.c:2123 sk_free+0x78/0xa0 net/core/sock.c:2134 sock_put include/net/sock.h:1927 [inline] __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351 __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828 mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586 process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289 worker_thread+0x623/0x1070 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302 </TASK> The root cause of the problem is that an mptcp-level (re)transmit can race with mptcp_close() and the packet scheduler checks the subflow state before acquiring the socket lock: we can try to (re)transmit on an already closed ssk. Fix the issue checking again the subflow socket status under the subflow socket lock protection. Additionally add the missing check for the fallback-to-tcp case. Fixes: d5f4919 ("mptcp: allow picking different xmit subflows") Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> [mheyne@: Fixed a couple of conflicts due to some missing commit: - Commit d9ca1de ("mptcp: move page frag allocation in mptcp_sendmsg()") reworked a couple of functions. Before this commit there used to be a check for EAGAIN after mptcp_sendmsg_frag. This is equivalent to the "continue" from this backport and therefore is dropped. Other commits that affected the context: - 5cf92bb ("mptcp: re-enable sndbuf autotune") - b713d00 ("mptcp: really share subflow snd_wnd") - f70cad1 ("mptcp: stop relying on tcp_tx_skb_cache")] Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
As reported by syzbot and experienced by Pavel, using cpus_read_lock() in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem and possibly others). Instead, shrink the preempt disable region by iterating all CPUs and checking the online status for each individual CPU while having preemption disabled. Fixes: 8850cb6 ("sched: Simplify wake_up_*idle*()") Reported-by: syzbot+d5b23b18d2f4feae8a67@syzkaller.appspotmail.com Reported-by: Pavel Machek <pavel@ucw.cz> Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Qian Cai <quic_qiancai@quicinc.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
commit f79a609 upstream. log_max_qp in driver's default profile #2 was set to 18, but FW actually supports 17 at the most - a situation that led to the concerning print when the driver is loaded: "log_max_qp value in current profile is 18, changing to HCA capabaility limit (17)" The expected behavior from mlx5_profile #2 is to match the maximum FW capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is initialized to a defined static value (0xff) - which basically means that when loading this profile, log_max_qp value will be what the currently installed FW supports at most. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> [v5.10: replaced prof->log_max_qp with profile[prof_sel].log_max_qp] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
[ Upstream commit 9afb4b2 ] To clear the flow table on flow table free, the following sequence normally happens in order: 1) gc_step work is stopped to disable any further stats/del requests. 2) All flow table entries are set to teardown state. 3) Run gc_step which will queue HW del work for each flow table entry. 4) Waiting for the above del work to finish (flush). 5) Run gc_step again, deleting all entries from the flow table. 6) Flow table is freed. But if a flow table entry already has pending HW stats or HW add work step 3 will not queue HW del work (it will be skipped), step 4 will wait for the pending add/stats to finish, and step 5 will queue HW del work which might execute after freeing of the flow table. To fix the above, this patch flushes the pending work, then it sets the teardown flag to all flows in the flowtable and it forces a garbage collector run to queue work to remove the flows from hardware, then it flushes this new pending work and (finally) it forces another garbage collector run to remove the entry from the software flowtable. Stack trace: [47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460 [47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704 [47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2 [47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) [47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table] [47773.889727] Call Trace: [47773.890214] dump_stack+0xbb/0x107 [47773.890818] print_address_description.constprop.0+0x18/0x140 [47773.892990] kasan_report.cold+0x7c/0xd8 [47773.894459] kasan_check_range+0x145/0x1a0 [47773.895174] down_read+0x99/0x460 [47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table] [47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table] [47773.913372] process_one_work+0x8ac/0x14e0 [47773.921325] [47773.921325] Allocated by task 592159: [47773.922031] kasan_save_stack+0x1b/0x40 [47773.922730] __kasan_kmalloc+0x7a/0x90 [47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct] [47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct] [47773.925207] tcf_action_init_1+0x45b/0x700 [47773.925987] tcf_action_init+0x453/0x6b0 [47773.926692] tcf_exts_validate+0x3d0/0x600 [47773.927419] fl_change+0x757/0x4a51 [cls_flower] [47773.928227] tc_new_tfilter+0x89a/0x2070 [47773.936652] [47773.936652] Freed by task 543704: [47773.937303] kasan_save_stack+0x1b/0x40 [47773.938039] kasan_set_track+0x1c/0x30 [47773.938731] kasan_set_free_info+0x20/0x30 [47773.939467] __kasan_slab_free+0xe7/0x120 [47773.940194] slab_free_freelist_hook+0x86/0x190 [47773.941038] kfree+0xce/0x3a0 [47773.941644] tcf_ct_flow_table_cleanup_work Original patch description and stack trace by Paul Blakey. Fixes: c29f74e ("netfilter: nf_flow_table: hardware offload support") Reported-by: Paul Blakey <paulb@nvidia.com> Tested-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Nathan Gao <zcgao@amazon.com>
nathan-zcgao
pushed a commit
that referenced
this pull request
Jan 27, 2026
commit c886d70 upstream. Dipanjan reported a syzbot splat at close time: WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153 inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Modules linked in: uio_ivshmem(OE) uio(E) CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G OE 5.19.0-rc6-g2eae0556bb9d #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91 f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00 RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002 RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000 R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400 R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258 FS: 0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0 Call Trace: <TASK> __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067 sk_destruct+0xbd/0xe0 net/core/sock.c:2112 __sk_free+0xef/0x3d0 net/core/sock.c:2123 sk_free+0x78/0xa0 net/core/sock.c:2134 sock_put include/net/sock.h:1927 [inline] __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351 __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828 mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586 process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289 worker_thread+0x623/0x1070 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302 </TASK> The root cause of the problem is that an mptcp-level (re)transmit can race with mptcp_close() and the packet scheduler checks the subflow state before acquiring the socket lock: we can try to (re)transmit on an already closed ssk. Fix the issue checking again the subflow socket status under the subflow socket lock protection. Additionally add the missing check for the fallback-to-tcp case. Fixes: d5f4919 ("mptcp: allow picking different xmit subflows") Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> [mheyne@: Fixed a couple of conflicts due to some missing commit: - Commit d9ca1de ("mptcp: move page frag allocation in mptcp_sendmsg()") reworked a couple of functions. Before this commit there used to be a check for EAGAIN after mptcp_sendmsg_frag. This is equivalent to the "continue" from this backport and therefore is dropped. Other commits that affected the context: - 5cf92bb ("mptcp: re-enable sndbuf autotune") - b713d00 ("mptcp: really share subflow snd_wnd") - f70cad1 ("mptcp: stop relying on tcp_tx_skb_cache")] Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
[ Upstream commit a80c9d9 ] A null-ptr-deref was reported in the SCTP transmit path when SCTP-AUTH key initialization fails: ================================================================== KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G W 6.6.0 #2 RIP: 0010:sctp_packet_bundle_auth net/sctp/output.c:264 [inline] RIP: 0010:sctp_packet_append_chunk+0xb36/0x1260 net/sctp/output.c:401 Call Trace: sctp_packet_transmit_chunk+0x31/0x250 net/sctp/output.c:189 sctp_outq_flush_data+0xa29/0x26d0 net/sctp/outqueue.c:1111 sctp_outq_flush+0xc80/0x1240 net/sctp/outqueue.c:1217 sctp_cmd_interpreter.isra.0+0x19a5/0x62c0 net/sctp/sm_sideeffect.c:1787 sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x1a3/0x670 net/sctp/sm_sideeffect.c:1169 sctp_assoc_bh_rcv+0x33e/0x640 net/sctp/associola.c:1052 sctp_inq_push+0x1dd/0x280 net/sctp/inqueue.c:88 sctp_rcv+0x11ae/0x3100 net/sctp/input.c:243 sctp6_rcv+0x3d/0x60 net/sctp/ipv6.c:1127 The issue is triggered when sctp_auth_asoc_init_active_key() fails in sctp_sf_do_5_1C_ack() while processing an INIT_ACK. In this case, the command sequence is currently: - SCTP_CMD_PEER_INIT - SCTP_CMD_TIMER_STOP (T1_INIT) - SCTP_CMD_TIMER_START (T1_COOKIE) - SCTP_CMD_NEW_STATE (COOKIE_ECHOED) - SCTP_CMD_ASSOC_SHKEY - SCTP_CMD_GEN_COOKIE_ECHO If SCTP_CMD_ASSOC_SHKEY fails, asoc->shkey remains NULL, while asoc->peer.auth_capable and asoc->peer.peer_chunks have already been set by SCTP_CMD_PEER_INIT. This allows a DATA chunk with auth = 1 and shkey = NULL to be queued by sctp_datamsg_from_user(). Since command interpretation stops on failure, no COOKIE_ECHO should been sent via SCTP_CMD_GEN_COOKIE_ECHO. However, the T1_COOKIE timer has already been started, and it may enqueue a COOKIE_ECHO into the outqueue later. As a result, the DATA chunk can be transmitted together with the COOKIE_ECHO in sctp_outq_flush_data(), leading to the observed issue. Similar to the other places where it calls sctp_auth_asoc_init_active_key() right after sctp_process_init(), this patch moves the SCTP_CMD_ASSOC_SHKEY immediately after SCTP_CMD_PEER_INIT, before stopping T1_INIT and starting T1_COOKIE. This ensures that if shared key generation fails, authenticated DATA cannot be sent. It also allows the T1_INIT timer to retransmit INIT, giving the client another chance to process INIT_ACK and retry key setup. Fixes: 730fc3d ("[SCTP]: Implete SCTP-AUTH parameter processing") Reported-by: Zhen Chen <chenzhen126@huawei.com> Tested-by: Zhen Chen <chenzhen126@huawei.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/44881224b375aa8853f5e19b4055a1a56d895813.1768324226.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
commit ca1a47c upstream. Patch series "mm/hugetlb: fixes for PMD table sharing (incl. using mmu_gather)", v3. One functional fix, one performance regression fix, and two related comment fixes. I cleaned up my prototype I recently shared [1] for the performance fix, deferring most of the cleanups I had in the prototype to a later point. While doing that I identified the other things. The goal of this patch set is to be backported to stable trees "fairly" easily. At least patch #1 and #4. Patch #1 fixes hugetlb_pmd_shared() not detecting any sharing Patch #2 + #3 are simple comment fixes that patch #4 interacts with. Patch #4 is a fix for the reported performance regression due to excessive IPI broadcasts during fork()+exit(). The last patch is all about TLB flushes, IPIs and mmu_gather. Read: complicated There are plenty of cleanups in the future to be had + one reasonable optimization on x86. But that's all out of scope for this series. Runtime tested, with a focus on fixing the performance regression using the original reproducer [2] on x86. This patch (of 4): We switched from (wrongly) using the page count to an independent shared count. Now, shared page tables have a refcount of 1 (excluding speculative references) and instead use ptdesc->pt_share_count to identify sharing. We didn't convert hugetlb_pmd_shared(), so right now, we would never detect a shared PMD table as such, because sharing/unsharing no longer touches the refcount of a PMD table. Page migration, like mbind() or migrate_pages() would allow for migrating folios mapped into such shared PMD tables, even though the folios are not exclusive. In smaps we would account them as "private" although they are "shared", and we would be wrongly setting the PM_MMAP_EXCLUSIVE in the pagemap interface. Fix it by properly using ptdesc_pmd_is_shared() in hugetlb_pmd_shared(). Link: https://lkml.kernel.org/r/20251223214037.580860-1-david@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-2-david@kernel.org Link: https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/ [1] Link: https://lore.kernel.org/all/8cab934d-4a56-44aa-b641-bfd7e23bd673@kernel.org/ [2] Fixes: 59d9094 ("mm: hugetlb: independent PMD page table shared count") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Uschakow, Stanislav" <suschako@amazon.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
…itives [ Upstream commit c06343b ] The "valid" readout delay between the two reads of the watchdog is larger than the valid delta between the resulting watchdog and clocksource intervals, which results in false positive watchdog results. Assume TSC is the clocksource and HPET is the watchdog and both have a uncertainty margin of 250us (default). The watchdog readout does: 1) wdnow = read(HPET); 2) csnow = read(TSC); 3) wdend = read(HPET); The valid window for the delta between #1 and #3 is calculated by the uncertainty margins of the watchdog and the clocksource: m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin; which results in 750us for the TSC/HPET case. The actual interval comparison uses a smaller margin: m = watchdog.uncertainty_margin + cs.uncertainty margin; which results in 500us for the TSC/HPET case. That means the following scenario will trigger the watchdog: Watchdog cycle N: 1) wdnow[N] = read(HPET); 2) csnow[N] = read(TSC); 3) wdend[N] = read(HPET); Assume the delay between #1 and #2 is 100us and the delay between #1 and Watchdog cycle N + 1: 4) wdnow[N + 1] = read(HPET); 5) csnow[N + 1] = read(TSC); 6) wdend[N + 1] = read(HPET); If the delay between #4 and gregkh#6 is within the 750us margin then any delay between #4 and #5 which is larger than 600us will fail the interval check and mark the TSC unstable because the intervals are calculated against the previous value: wd_int = wdnow[N + 1] - wdnow[N]; cs_int = csnow[N + 1] - csnow[N]; Putting the above delays in place this results in: cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us); -> cs_int = wd_int + 510us; which is obviously larger than the allowed 500us margin and results in marking TSC unstable. Fix this by using the same margin as the interval comparison. If the delay between two watchdog reads is larger than that, then the readout was either disturbed by interconnect congestion, NMIs or SMIs. Fixes: 4ac1dd3 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin") Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/ Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
[ Upstream commit 9910159 ] When one iio device is a consumer of another, it is possible that the ->info_exist_lock of both ends up being taken when reading the value of the consumer device. Since they currently belong to the same lockdep class (being initialized in a single location with mutex_init()), that results in a lockdep warning CPU0 ---- lock(&iio_dev_opaque->info_exist_lock); lock(&iio_dev_opaque->info_exist_lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by sensors/414: #0: c31fd6dc (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0x44/0x4e4 #1: c4f5a1c4 (&of->mutex){+.+.}-{3:3}, at: kernfs_seq_start+0x1c/0xac #2: c2827548 (kn->active#34){.+.+}-{0:0}, at: kernfs_seq_start+0x30/0xac #3: c1dd2b6 (&iio_dev_opaque->info_exist_lock){+.+.}-{3:3}, at: iio_read_channel_processed_scale+0x24/0xd8 stack backtrace: CPU: 0 UID: 0 PID: 414 Comm: sensors Not tainted 6.17.11 #5 NONE Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x44/0x60 dump_stack_lvl from print_deadlock_bug+0x2b8/0x334 print_deadlock_bug from __lock_acquire+0x13a4/0x2ab0 __lock_acquire from lock_acquire+0xd0/0x2c0 lock_acquire from __mutex_lock+0xa0/0xe8c __mutex_lock from mutex_lock_nested+0x1c/0x24 mutex_lock_nested from iio_read_channel_raw+0x20/0x6c iio_read_channel_raw from rescale_read_raw+0x128/0x1c4 rescale_read_raw from iio_channel_read+0xe4/0xf4 iio_channel_read from iio_read_channel_processed_scale+0x6c/0xd8 iio_read_channel_processed_scale from iio_hwmon_read_val+0x68/0xbc iio_hwmon_read_val from dev_attr_show+0x18/0x48 dev_attr_show from sysfs_kf_seq_show+0x80/0x110 sysfs_kf_seq_show from seq_read_iter+0xdc/0x4e4 seq_read_iter from vfs_read+0x238/0x2e4 vfs_read from ksys_read+0x6c/0xec ksys_read from ret_fast_syscall+0x0/0x1c Just as the mlock_key already has its own lockdep class, add a lock_class_key for the info_exist mutex. Note that this has in theory been a problem since before IIO first left staging, but it only occurs when a chain of consumers is in use and that is not often done. Fixes: ac917a8 ("staging:iio:core set the iio_dev.info pointer to null on unregister under lock.") Signed-off-by: Rasmus Villemoes <ravi@prevas.dk> Reviewed-by: Peter Rosin <peda@axentia.se> Cc: <stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
[ Upstream commit a80c9d9 ] A null-ptr-deref was reported in the SCTP transmit path when SCTP-AUTH key initialization fails: ================================================================== KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G W 6.6.0 #2 RIP: 0010:sctp_packet_bundle_auth net/sctp/output.c:264 [inline] RIP: 0010:sctp_packet_append_chunk+0xb36/0x1260 net/sctp/output.c:401 Call Trace: sctp_packet_transmit_chunk+0x31/0x250 net/sctp/output.c:189 sctp_outq_flush_data+0xa29/0x26d0 net/sctp/outqueue.c:1111 sctp_outq_flush+0xc80/0x1240 net/sctp/outqueue.c:1217 sctp_cmd_interpreter.isra.0+0x19a5/0x62c0 net/sctp/sm_sideeffect.c:1787 sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x1a3/0x670 net/sctp/sm_sideeffect.c:1169 sctp_assoc_bh_rcv+0x33e/0x640 net/sctp/associola.c:1052 sctp_inq_push+0x1dd/0x280 net/sctp/inqueue.c:88 sctp_rcv+0x11ae/0x3100 net/sctp/input.c:243 sctp6_rcv+0x3d/0x60 net/sctp/ipv6.c:1127 The issue is triggered when sctp_auth_asoc_init_active_key() fails in sctp_sf_do_5_1C_ack() while processing an INIT_ACK. In this case, the command sequence is currently: - SCTP_CMD_PEER_INIT - SCTP_CMD_TIMER_STOP (T1_INIT) - SCTP_CMD_TIMER_START (T1_COOKIE) - SCTP_CMD_NEW_STATE (COOKIE_ECHOED) - SCTP_CMD_ASSOC_SHKEY - SCTP_CMD_GEN_COOKIE_ECHO If SCTP_CMD_ASSOC_SHKEY fails, asoc->shkey remains NULL, while asoc->peer.auth_capable and asoc->peer.peer_chunks have already been set by SCTP_CMD_PEER_INIT. This allows a DATA chunk with auth = 1 and shkey = NULL to be queued by sctp_datamsg_from_user(). Since command interpretation stops on failure, no COOKIE_ECHO should been sent via SCTP_CMD_GEN_COOKIE_ECHO. However, the T1_COOKIE timer has already been started, and it may enqueue a COOKIE_ECHO into the outqueue later. As a result, the DATA chunk can be transmitted together with the COOKIE_ECHO in sctp_outq_flush_data(), leading to the observed issue. Similar to the other places where it calls sctp_auth_asoc_init_active_key() right after sctp_process_init(), this patch moves the SCTP_CMD_ASSOC_SHKEY immediately after SCTP_CMD_PEER_INIT, before stopping T1_INIT and starting T1_COOKIE. This ensures that if shared key generation fails, authenticated DATA cannot be sent. It also allows the T1_INIT timer to retransmit INIT, giving the client another chance to process INIT_ACK and retry key setup. Fixes: 730fc3d ("[SCTP]: Implete SCTP-AUTH parameter processing") Reported-by: Zhen Chen <chenzhen126@huawei.com> Tested-by: Zhen Chen <chenzhen126@huawei.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/44881224b375aa8853f5e19b4055a1a56d895813.1768324226.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
…itives [ Upstream commit c06343b ] The "valid" readout delay between the two reads of the watchdog is larger than the valid delta between the resulting watchdog and clocksource intervals, which results in false positive watchdog results. Assume TSC is the clocksource and HPET is the watchdog and both have a uncertainty margin of 250us (default). The watchdog readout does: 1) wdnow = read(HPET); 2) csnow = read(TSC); 3) wdend = read(HPET); The valid window for the delta between #1 and #3 is calculated by the uncertainty margins of the watchdog and the clocksource: m = 2 * watchdog.uncertainty_margin + cs.uncertainty margin; which results in 750us for the TSC/HPET case. The actual interval comparison uses a smaller margin: m = watchdog.uncertainty_margin + cs.uncertainty margin; which results in 500us for the TSC/HPET case. That means the following scenario will trigger the watchdog: Watchdog cycle N: 1) wdnow[N] = read(HPET); 2) csnow[N] = read(TSC); 3) wdend[N] = read(HPET); Assume the delay between #1 and #2 is 100us and the delay between #1 and Watchdog cycle N + 1: 4) wdnow[N + 1] = read(HPET); 5) csnow[N + 1] = read(TSC); 6) wdend[N + 1] = read(HPET); If the delay between #4 and gregkh#6 is within the 750us margin then any delay between #4 and #5 which is larger than 600us will fail the interval check and mark the TSC unstable because the intervals are calculated against the previous value: wd_int = wdnow[N + 1] - wdnow[N]; cs_int = csnow[N + 1] - csnow[N]; Putting the above delays in place this results in: cs_int = (wdnow[N + 1] + 610us) - (wdnow[N] + 100us); -> cs_int = wd_int + 510us; which is obviously larger than the allowed 500us margin and results in marking TSC unstable. Fix this by using the same margin as the interval comparison. If the delay between two watchdog reads is larger than that, then the readout was either disturbed by interconnect congestion, NMIs or SMIs. Fixes: 4ac1dd3 ("clocksource: Set cs_watchdog_read() checks based on .uncertainty_margin") Reported-by: Daniel J Blueman <daniel@quora.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/lkml/20250602223251.496591-1-daniel@quora.org/ Link: https://patch.msgid.link/87bjjxc9dq.ffs@tglx Signed-off-by: Sasha Levin <sashal@kernel.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
[ Upstream commit 9910159 ] When one iio device is a consumer of another, it is possible that the ->info_exist_lock of both ends up being taken when reading the value of the consumer device. Since they currently belong to the same lockdep class (being initialized in a single location with mutex_init()), that results in a lockdep warning CPU0 ---- lock(&iio_dev_opaque->info_exist_lock); lock(&iio_dev_opaque->info_exist_lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by sensors/414: #0: c31fd6dc (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0x44/0x4e4 #1: c4f5a1c4 (&of->mutex){+.+.}-{3:3}, at: kernfs_seq_start+0x1c/0xac #2: c2827548 (kn->active#34){.+.+}-{0:0}, at: kernfs_seq_start+0x30/0xac #3: c1dd2b6 (&iio_dev_opaque->info_exist_lock){+.+.}-{3:3}, at: iio_read_channel_processed_scale+0x24/0xd8 stack backtrace: CPU: 0 UID: 0 PID: 414 Comm: sensors Not tainted 6.17.11 #5 NONE Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x44/0x60 dump_stack_lvl from print_deadlock_bug+0x2b8/0x334 print_deadlock_bug from __lock_acquire+0x13a4/0x2ab0 __lock_acquire from lock_acquire+0xd0/0x2c0 lock_acquire from __mutex_lock+0xa0/0xe8c __mutex_lock from mutex_lock_nested+0x1c/0x24 mutex_lock_nested from iio_read_channel_raw+0x20/0x6c iio_read_channel_raw from rescale_read_raw+0x128/0x1c4 rescale_read_raw from iio_channel_read+0xe4/0xf4 iio_channel_read from iio_read_channel_processed_scale+0x6c/0xd8 iio_read_channel_processed_scale from iio_hwmon_read_val+0x68/0xbc iio_hwmon_read_val from dev_attr_show+0x18/0x48 dev_attr_show from sysfs_kf_seq_show+0x80/0x110 sysfs_kf_seq_show from seq_read_iter+0xdc/0x4e4 seq_read_iter from vfs_read+0x238/0x2e4 vfs_read from ksys_read+0x6c/0xec ksys_read from ret_fast_syscall+0x0/0x1c Just as the mlock_key already has its own lockdep class, add a lock_class_key for the info_exist mutex. Note that this has in theory been a problem since before IIO first left staging, but it only occurs when a chain of consumers is in use and that is not often done. Fixes: ac917a8 ("staging:iio:core set the iio_dev.info pointer to null on unregister under lock.") Signed-off-by: Rasmus Villemoes <ravi@prevas.dk> Reviewed-by: Peter Rosin <peda@axentia.se> Cc: <stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
…alled from deferred_grow_zone() Commit 3acb913 ("mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()") made deferred_grow_zone() call deferred_init_memmap_chunk() within a pgdat_resize_lock() critical section with irqs disabled. It did check for irqs_disabled() in deferred_init_memmap_chunk() to avoid calling cond_resched(). For a PREEMPT_RT kernel build, however, spin_lock_irqsave() does not disable interrupt but rcu_read_lock() is called. This leads to the following bug report. BUG: sleeping function called from invalid context at mm/mm_init.c:2091 in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 1, name: swapper/0 preempt_count: 0, expected: 0 RCU nest depth: 1, expected: 0 3 locks held by swapper/0/1: #0: ffff80008471b7a0 (sched_domains_mutex){+.+.}-{4:4}, at: sched_domains_mutex_lock+0x28/0x40 #1: ffff003bdfffef48 (&pgdat->node_size_lock){+.+.}-{3:3}, at: deferred_grow_zone+0x140/0x278 #2: ffff800084acf600 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1b4/0x408 CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Tainted: G W 6.19.0-rc6-test #1 PREEMPT_{RT,(full) } Tainted: [W]=WARN Call trace: show_stack+0x20/0x38 (C) dump_stack_lvl+0xdc/0xf8 dump_stack+0x1c/0x28 __might_resched+0x384/0x530 deferred_init_memmap_chunk+0x560/0x688 deferred_grow_zone+0x190/0x278 _deferred_grow_zone+0x18/0x30 get_page_from_freelist+0x780/0xf78 __alloc_frozen_pages_noprof+0x1dc/0x348 alloc_slab_page+0x30/0x110 allocate_slab+0x98/0x2a0 new_slab+0x4c/0x80 ___slab_alloc+0x5a4/0x770 __slab_alloc.constprop.0+0x88/0x1e0 __kmalloc_node_noprof+0x2c0/0x598 __sdt_alloc+0x3b8/0x728 build_sched_domains+0xe0/0x1260 sched_init_domains+0x14c/0x1c8 sched_init_smp+0x9c/0x1d0 kernel_init_freeable+0x218/0x358 kernel_init+0x28/0x208 ret_from_fork+0x10/0x20 Fix it adding a new argument to deferred_init_memmap_chunk() to explicitly tell it if cond_resched() is allowed or not instead of relying on some current state information which may vary depending on the exact kernel configuration options that are enabled. Link: https://lkml.kernel.org/r/20260122184343.546627-1-longman@redhat.com Fixes: 3acb913 ("mm/mm_init: use deferred_init_memmap_chunk() in deferred_grow_zone()") Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernrl.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
Since the commit 25c6a5a ("net: phy: micrel: Dynamically control external clock of KSZ PHY"), the clock of Micrel PHY has been enabled by phy_driver::resume() and disabled by phy_driver::suspend(). However, devm_clk_get_optional_enabled() is used in kszphy_probe(), so the clock will automatically be disabled when the device is unbound from the bus. Therefore, this could cause the clock to be disabled twice, resulting in clk driver warnings. For example, this issue can be reproduced on i.MX6ULL platform, and we can see the following logs when removing the FEC MAC drivers. $ echo 2188000.ethernet > /sys/bus/platform/drivers/fec/unbind $ echo 20b4000.ethernet > /sys/bus/platform/drivers/fec/unbind [ 109.758207] ------------[ cut here ]------------ [ 109.758240] WARNING: drivers/clk/clk.c:1188 at clk_core_disable+0xb4/0xd0, CPU#0: sh/639 [ 109.771011] enet2_ref already disabled [ 109.793359] Call trace: [ 109.822006] clk_core_disable from clk_disable+0x28/0x34 [ 109.827340] clk_disable from clk_disable_unprepare+0xc/0x18 [ 109.833029] clk_disable_unprepare from devm_clk_release+0x1c/0x28 [ 109.839241] devm_clk_release from devres_release_all+0x98/0x100 [ 109.845278] devres_release_all from device_unbind_cleanup+0xc/0x70 [ 109.851571] device_unbind_cleanup from device_release_driver_internal+0x1a4/0x1f4 [ 109.859170] device_release_driver_internal from bus_remove_device+0xbc/0xe4 [ 109.866243] bus_remove_device from device_del+0x140/0x458 [ 109.871757] device_del from phy_mdio_device_remove+0xc/0x24 [ 109.877452] phy_mdio_device_remove from mdiobus_unregister+0x40/0xac [ 109.883918] mdiobus_unregister from fec_enet_mii_remove+0x40/0x78 [ 109.890125] fec_enet_mii_remove from fec_drv_remove+0x4c/0x158 [ 109.896076] fec_drv_remove from device_release_driver_internal+0x17c/0x1f4 [ 109.962748] WARNING: drivers/clk/clk.c:1047 at clk_core_unprepare+0xfc/0x13c, CPU#0: sh/639 [ 109.975805] enet2_ref already unprepared [ 110.002866] Call trace: [ 110.031758] clk_core_unprepare from clk_unprepare+0x24/0x2c [ 110.037440] clk_unprepare from devm_clk_release+0x1c/0x28 [ 110.042957] devm_clk_release from devres_release_all+0x98/0x100 [ 110.048989] devres_release_all from device_unbind_cleanup+0xc/0x70 [ 110.055280] device_unbind_cleanup from device_release_driver_internal+0x1a4/0x1f4 [ 110.062877] device_release_driver_internal from bus_remove_device+0xbc/0xe4 [ 110.069950] bus_remove_device from device_del+0x140/0x458 [ 110.075469] device_del from phy_mdio_device_remove+0xc/0x24 [ 110.081165] phy_mdio_device_remove from mdiobus_unregister+0x40/0xac [ 110.087632] mdiobus_unregister from fec_enet_mii_remove+0x40/0x78 [ 110.093836] fec_enet_mii_remove from fec_drv_remove+0x4c/0x158 [ 110.099782] fec_drv_remove from device_release_driver_internal+0x17c/0x1f4 After analyzing the process of removing the FEC driver, as shown below, it can be seen that the clock was disabled twice by the PHY driver. fec_drv_remove() --> fec_enet_close() --> phy_stop() --> phy_suspend() --> kszphy_suspend() #1 The clock is disabled --> fec_enet_mii_remove() --> mdiobus_unregister() --> phy_mdio_device_remove() --> device_del() --> devm_clk_release() #2 The clock is disabled again Therefore, devm_clk_get_optional() is used to fix the above issue. And to avoid the issue mentioned by the commit 9853294 ("net: phy: micrel: use devm_clk_get_optional_enabled for the rmii-ref clock"), the clock is enabled by clk_prepare_enable() to get the correct clock rate. Fixes: 25c6a5a ("net: phy: micrel: Dynamically control external clock of KSZ PHY") Signed-off-by: Wei Fang <wei.fang@nxp.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260126081544.983517-1-wei.fang@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
[ Upstream commit a80c9d9 ] A null-ptr-deref was reported in the SCTP transmit path when SCTP-AUTH key initialization fails: ================================================================== KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f] CPU: 0 PID: 16 Comm: ksoftirqd/0 Tainted: G W 6.6.0 #2 RIP: 0010:sctp_packet_bundle_auth net/sctp/output.c:264 [inline] RIP: 0010:sctp_packet_append_chunk+0xb36/0x1260 net/sctp/output.c:401 Call Trace: sctp_packet_transmit_chunk+0x31/0x250 net/sctp/output.c:189 sctp_outq_flush_data+0xa29/0x26d0 net/sctp/outqueue.c:1111 sctp_outq_flush+0xc80/0x1240 net/sctp/outqueue.c:1217 sctp_cmd_interpreter.isra.0+0x19a5/0x62c0 net/sctp/sm_sideeffect.c:1787 sctp_side_effects net/sctp/sm_sideeffect.c:1198 [inline] sctp_do_sm+0x1a3/0x670 net/sctp/sm_sideeffect.c:1169 sctp_assoc_bh_rcv+0x33e/0x640 net/sctp/associola.c:1052 sctp_inq_push+0x1dd/0x280 net/sctp/inqueue.c:88 sctp_rcv+0x11ae/0x3100 net/sctp/input.c:243 sctp6_rcv+0x3d/0x60 net/sctp/ipv6.c:1127 The issue is triggered when sctp_auth_asoc_init_active_key() fails in sctp_sf_do_5_1C_ack() while processing an INIT_ACK. In this case, the command sequence is currently: - SCTP_CMD_PEER_INIT - SCTP_CMD_TIMER_STOP (T1_INIT) - SCTP_CMD_TIMER_START (T1_COOKIE) - SCTP_CMD_NEW_STATE (COOKIE_ECHOED) - SCTP_CMD_ASSOC_SHKEY - SCTP_CMD_GEN_COOKIE_ECHO If SCTP_CMD_ASSOC_SHKEY fails, asoc->shkey remains NULL, while asoc->peer.auth_capable and asoc->peer.peer_chunks have already been set by SCTP_CMD_PEER_INIT. This allows a DATA chunk with auth = 1 and shkey = NULL to be queued by sctp_datamsg_from_user(). Since command interpretation stops on failure, no COOKIE_ECHO should been sent via SCTP_CMD_GEN_COOKIE_ECHO. However, the T1_COOKIE timer has already been started, and it may enqueue a COOKIE_ECHO into the outqueue later. As a result, the DATA chunk can be transmitted together with the COOKIE_ECHO in sctp_outq_flush_data(), leading to the observed issue. Similar to the other places where it calls sctp_auth_asoc_init_active_key() right after sctp_process_init(), this patch moves the SCTP_CMD_ASSOC_SHKEY immediately after SCTP_CMD_PEER_INIT, before stopping T1_INIT and starting T1_COOKIE. This ensures that if shared key generation fails, authenticated DATA cannot be sent. It also allows the T1_INIT timer to retransmit INIT, giving the client another chance to process INIT_ACK and retry key setup. Fixes: 730fc3d ("[SCTP]: Implete SCTP-AUTH parameter processing") Reported-by: Zhen Chen <chenzhen126@huawei.com> Tested-by: Zhen Chen <chenzhen126@huawei.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/44881224b375aa8853f5e19b4055a1a56d895813.1768324226.git.lucien.xin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Jan 30, 2026
[ Upstream commit 9910159 ] When one iio device is a consumer of another, it is possible that the ->info_exist_lock of both ends up being taken when reading the value of the consumer device. Since they currently belong to the same lockdep class (being initialized in a single location with mutex_init()), that results in a lockdep warning CPU0 ---- lock(&iio_dev_opaque->info_exist_lock); lock(&iio_dev_opaque->info_exist_lock); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by sensors/414: #0: c31fd6dc (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0x44/0x4e4 #1: c4f5a1c4 (&of->mutex){+.+.}-{3:3}, at: kernfs_seq_start+0x1c/0xac #2: c2827548 (kn->active#34){.+.+}-{0:0}, at: kernfs_seq_start+0x30/0xac #3: c1dd2b6 (&iio_dev_opaque->info_exist_lock){+.+.}-{3:3}, at: iio_read_channel_processed_scale+0x24/0xd8 stack backtrace: CPU: 0 UID: 0 PID: 414 Comm: sensors Not tainted 6.17.11 #5 NONE Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x44/0x60 dump_stack_lvl from print_deadlock_bug+0x2b8/0x334 print_deadlock_bug from __lock_acquire+0x13a4/0x2ab0 __lock_acquire from lock_acquire+0xd0/0x2c0 lock_acquire from __mutex_lock+0xa0/0xe8c __mutex_lock from mutex_lock_nested+0x1c/0x24 mutex_lock_nested from iio_read_channel_raw+0x20/0x6c iio_read_channel_raw from rescale_read_raw+0x128/0x1c4 rescale_read_raw from iio_channel_read+0xe4/0xf4 iio_channel_read from iio_read_channel_processed_scale+0x6c/0xd8 iio_read_channel_processed_scale from iio_hwmon_read_val+0x68/0xbc iio_hwmon_read_val from dev_attr_show+0x18/0x48 dev_attr_show from sysfs_kf_seq_show+0x80/0x110 sysfs_kf_seq_show from seq_read_iter+0xdc/0x4e4 seq_read_iter from vfs_read+0x238/0x2e4 vfs_read from ksys_read+0x6c/0xec ksys_read from ret_fast_syscall+0x0/0x1c Just as the mlock_key already has its own lockdep class, add a lock_class_key for the info_exist mutex. Note that this has in theory been a problem since before IIO first left staging, but it only occurs when a chain of consumers is in use and that is not often done. Fixes: ac917a8 ("staging:iio:core set the iio_dev.info pointer to null on unregister under lock.") Signed-off-by: Rasmus Villemoes <ravi@prevas.dk> Reviewed-by: Peter Rosin <peda@axentia.se> Cc: <stable@vger.kernel.org> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
paniakin-aws
pushed a commit
that referenced
this pull request
Feb 2, 2026
There was a lockdep warning in sprd_gpio: [ 6.258269][T329@C6] [ BUG: Invalid wait context ] [ 6.258270][T329@C6] 6.18.0-android17-0-g30527ad7aaae-ab00009-4k #1 Tainted: G W OE [ 6.258272][T329@C6] ----------------------------- [ 6.258273][T329@C6] modprobe/329 is trying to lock: [ 6.258275][T329@C6] ffffff8081c91690 (&sprd_gpio->lock){....}-{3:3}, at: sprd_gpio_irq_unmask+0x4c/0xa4 [gpio_sprd] [ 6.258282][T329@C6] other info that might help us debug this: [ 6.258283][T329@C6] context-{5:5} [ 6.258285][T329@C6] 3 locks held by modprobe/329: [ 6.258286][T329@C6] #0: ffffff808baca108 (&dev->mutex){....}-{4:4}, at: __driver_attach+0xc4/0x204 [ 6.258295][T329@C6] #1: ffffff80965e7240 (request_class#4){+.+.}-{4:4}, at: __setup_irq+0x1cc/0x82c [ 6.258304][T329@C6] #2: ffffff80965e70c8 (lock_class#4){....}-{2:2}, at: __setup_irq+0x21c/0x82c [ 6.258313][T329@C6] stack backtrace: [ 6.258314][T329@C6] CPU: 6 UID: 0 PID: 329 Comm: modprobe Tainted: G W OE 6.18.0-android17-0-g30527ad7aaae-ab00009-4k #1 PREEMPT 3ad5b0f45741a16e5838da790706e16ceb6717df [ 6.258316][T329@C6] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 6.258317][T329@C6] Hardware name: Unisoc UMS9632-base Board (DT) [ 6.258318][T329@C6] Call trace: [ 6.258318][T329@C6] show_stack+0x20/0x30 (C) [ 6.258321][T329@C6] __dump_stack+0x28/0x3c [ 6.258324][T329@C6] dump_stack_lvl+0xac/0xf0 [ 6.258326][T329@C6] dump_stack+0x18/0x3c [ 6.258329][T329@C6] __lock_acquire+0x824/0x2c28 [ 6.258331][T329@C6] lock_acquire+0x148/0x2cc [ 6.258333][T329@C6] _raw_spin_lock_irqsave+0x6c/0xb4 [ 6.258334][T329@C6] sprd_gpio_irq_unmask+0x4c/0xa4 [gpio_sprd 814535e93c6d8e0853c45c02eab0fa88a9da6487] [ 6.258337][T329@C6] irq_startup+0x238/0x350 [ 6.258340][T329@C6] __setup_irq+0x504/0x82c [ 6.258342][T329@C6] request_threaded_irq+0x118/0x184 [ 6.258344][T329@C6] devm_request_threaded_irq+0x94/0x120 [ 6.258347][T329@C6] sc8546_init_irq+0x114/0x170 [sc8546_charger 223586ccafc27439f7db4f95b0c8e6e882349a99] [ 6.258352][T329@C6] sc8546_charger_probe+0x53c/0x5a0 [sc8546_charger 223586ccafc27439f7db4f95b0c8e6e882349a99] [ 6.258358][T329@C6] i2c_device_probe+0x2c8/0x350 [ 6.258361][T329@C6] really_probe+0x1a8/0x46c [ 6.258363][T329@C6] __driver_probe_device+0xa4/0x10c [ 6.258366][T329@C6] driver_probe_device+0x44/0x1b4 [ 6.258369][T329@C6] __driver_attach+0xd0/0x204 [ 6.258371][T329@C6] bus_for_each_dev+0x10c/0x168 [ 6.258373][T329@C6] driver_attach+0x2c/0x3c [ 6.258376][T329@C6] bus_add_driver+0x154/0x29c [ 6.258378][T329@C6] driver_register+0x70/0x10c [ 6.258381][T329@C6] i2c_register_driver+0x48/0xc8 [ 6.258384][T329@C6] init_module+0x28/0xfd8 [sc8546_charger 223586ccafc27439f7db4f95b0c8e6e882349a99] [ 6.258389][T329@C6] do_one_initcall+0x128/0x42c [ 6.258392][T329@C6] do_init_module+0x60/0x254 [ 6.258395][T329@C6] load_module+0x1054/0x1220 [ 6.258397][T329@C6] __arm64_sys_finit_module+0x240/0x35c [ 6.258400][T329@C6] invoke_syscall+0x60/0xec [ 6.258402][T329@C6] el0_svc_common+0xb0/0xe4 [ 6.258405][T329@C6] do_el0_svc+0x24/0x30 [ 6.258407][T329@C6] el0_svc+0x54/0x1c4 [ 6.258409][T329@C6] el0t_64_sync_handler+0x68/0xdc [ 6.258411][T329@C6] el0t_64_sync+0x1c4/0x1c8 This is because the spin_lock would change to rt_mutex in PREEMPT_RT, however the sprd_gpio->lock would use in hard-irq, this is unsafe. So change the spin_lock_t to raw_spin_lock_t to use the spinlock in hard-irq. Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://lore.kernel.org/r/20260126094209.9855-1-xuewen.yan@unisoc.com [Bartosz: tweaked the commit message] Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@oss.qualcomm.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
A "small" CIFS buffer is not big enough in general to hold a setacl request for SMB2, and we end up overflowing the buffer in send_set_info(). For instance: # mount.cifs //127.0.0.1/test /mnt/test -o username=test,password=test,nounix,cifsacl # touch /mnt/test/acltest # getcifsacl /mnt/test/acltest REVISION:0x1 CONTROL:0x9004 OWNER:S-1-5-21-2926364953-924364008-418108241-1000 GROUP:S-1-22-2-1001 ACL:S-1-5-21-2926364953-924364008-418108241-1000:ALLOWED/0x0/0x1e01ff ACL:S-1-22-2-1001:ALLOWED/0x0/R ACL:S-1-22-2-1001:ALLOWED/0x0/R ACL:S-1-5-21-2926364953-924364008-418108241-1000:ALLOWED/0x0/0x1e01ff ACL:S-1-1-0:ALLOWED/0x0/R # setcifsacl -a "ACL:S-1-22-2-1004:ALLOWED/0x0/R" /mnt/test/acltest this setacl will cause the following KASAN splat: [ 330.777927] BUG: KASAN: slab-out-of-bounds in send_set_info+0x4dd/0xc20 [cifs] [ 330.779696] Write of size 696 at addr ffff88010d5e2860 by task setcifsacl/1012 [ 330.781882] CPU: 1 PID: 1012 Comm: setcifsacl Not tainted 4.18.0-rc2+ #2 [ 330.783140] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 330.784395] Call Trace: [ 330.784789] dump_stack+0xc2/0x16b [ 330.786777] print_address_description+0x6a/0x270 [ 330.787520] kasan_report+0x258/0x380 [ 330.788845] memcpy+0x34/0x50 [ 330.789369] send_set_info+0x4dd/0xc20 [cifs] [ 330.799511] SMB2_set_acl+0x76/0xa0 [cifs] [ 330.801395] set_smb2_acl+0x7ac/0xf30 [cifs] [ 330.830888] cifs_xattr_set+0x963/0xe40 [cifs] [ 330.840367] __vfs_setxattr+0x84/0xb0 [ 330.842060] __vfs_setxattr_noperm+0xe6/0x370 [ 330.843848] vfs_setxattr+0xc2/0xd0 [ 330.845519] setxattr+0x258/0x320 [ 330.859211] path_setxattr+0x15b/0x1b0 [ 330.864392] __x64_sys_setxattr+0xc0/0x160 [ 330.866133] do_syscall_64+0x14e/0x4b0 [ 330.876631] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 330.878503] RIP: 0033:0x7ff2e507db0a [ 330.880151] Code: 48 8b 0d 89 93 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 bc 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 56 93 2c 00 f7 d8 64 89 01 48 [ 330.885358] RSP: 002b:00007ffdc4903c18 EFLAGS: 00000246 ORIG_RAX: 00000000000000bc [ 330.887733] RAX: ffffffffffffffda RBX: 000055d1170de140 RCX: 00007ff2e507db0a [ 330.890067] RDX: 000055d1170de7d0 RSI: 000055d115b39184 RDI: 00007ffdc4904818 [ 330.892410] RBP: 0000000000000001 R08: 0000000000000000 R09: 000055d1170de7e4 [ 330.894785] R10: 00000000000002b8 R11: 0000000000000246 R12: 0000000000000007 [ 330.897148] R13: 000055d1170de0c0 R14: 0000000000000008 R15: 000055d1170de550 [ 330.901057] Allocated by task 1012: [ 330.902888] kasan_kmalloc+0xa0/0xd0 [ 330.904714] kmem_cache_alloc+0xc8/0x1d0 [ 330.906615] mempool_alloc+0x11e/0x380 [ 330.908496] cifs_small_buf_get+0x35/0x60 [cifs] [ 330.910510] smb2_plain_req_init+0x4a/0xd60 [cifs] [ 330.912551] send_set_info+0x198/0xc20 [cifs] [ 330.914535] SMB2_set_acl+0x76/0xa0 [cifs] [ 330.916465] set_smb2_acl+0x7ac/0xf30 [cifs] [ 330.918453] cifs_xattr_set+0x963/0xe40 [cifs] [ 330.920426] __vfs_setxattr+0x84/0xb0 [ 330.922284] __vfs_setxattr_noperm+0xe6/0x370 [ 330.924213] vfs_setxattr+0xc2/0xd0 [ 330.926008] setxattr+0x258/0x320 [ 330.927762] path_setxattr+0x15b/0x1b0 [ 330.929592] __x64_sys_setxattr+0xc0/0x160 [ 330.931459] do_syscall_64+0x14e/0x4b0 [ 330.933314] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 330.936843] Freed by task 0: [ 330.938588] (stack is not available) [ 330.941886] The buggy address belongs to the object at ffff88010d5e2800 which belongs to the cache cifs_small_rq of size 448 [ 330.946362] The buggy address is located 96 bytes inside of 448-byte region [ffff88010d5e2800, ffff88010d5e29c0) [ 330.950722] The buggy address belongs to the page: [ 330.952789] page:ffffea0004357880 count:1 mapcount:0 mapping:ffff880108fdca80 index:0x0 compound_mapcount: 0 [ 330.955665] flags: 0x17ffffc0008100(slab|head) [ 330.957760] raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff880108fdca80 [ 330.960356] raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 [ 330.963005] page dumped because: kasan: bad access detected [ 330.967039] Memory state around the buggy address: [ 330.969255] ffff88010d5e2880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 330.971833] ffff88010d5e2900: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 330.974397] >ffff88010d5e2980: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc [ 330.976956] ^ [ 330.979226] ffff88010d5e2a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 330.981755] ffff88010d5e2a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 330.984225] ================================================================== Fix this by allocating a regular CIFS buffer in smb2_plain_req_init() if the request command is SMB2_SET_INFO. Reported-by: Jianhong Yin <jiyin@redhat.com> Fixes: 366ed84 ("cifs: Use smb 2 - 3 and cifsacl mount options setacl function") CC: Stable <stable@vger.kernel.org> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-and-tested-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Boris Protopopov <pboris@amazon.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
[ Upstream commit 10876da ] syzkaller reported a null-ptr-deref in sock_omalloc() while allocating a CALIPSO option. [0] The NULL is of struct sock, which was fetched by sk_to_full_sk() in calipso_req_setattr(). Since commit a1a5344 ("tcp: avoid two atomic ops for syncookies"), reqsk->rsk_listener could be NULL when SYN Cookie is returned to its client, as hinted by the leading SYN Cookie log. Here are 3 options to fix the bug: 1) Return 0 in calipso_req_setattr() 2) Return an error in calipso_req_setattr() 3) Alaways set rsk_listener 1) is no go as it bypasses LSM, but 2) effectively disables SYN Cookie for CALIPSO. 3) is also no go as there have been many efforts to reduce atomic ops and make TCP robust against DDoS. See also commit 3b24d85 ("tcp/dccp: do not touch listener sk_refcnt under synflood"). As of the blamed commit, SYN Cookie already did not need refcounting, and no one has stumbled on the bug for 9 years, so no CALIPSO user will care about SYN Cookie. Let's return an error in calipso_req_setattr() and calipso_req_delattr() in the SYN Cookie case. This can be reproduced by [1] on Fedora and now connect() of nc times out. [0]: TCP: request_sock_TCPv6: Possible SYN flooding on port [::]:20002. Sending cookies. Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037] CPU: 3 UID: 0 PID: 12262 Comm: syz.1.2611 Not tainted 6.14.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 RIP: 0010:read_pnet include/net/net_namespace.h:406 [inline] RIP: 0010:sock_net include/net/sock.h:655 [inline] RIP: 0010:sock_kmalloc+0x35/0x170 net/core/sock.c:2806 Code: 89 d5 41 54 55 89 f5 53 48 89 fb e8 25 e3 c6 fd e8 f0 91 e3 00 48 8d 7b 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 26 01 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b RSP: 0018:ffff88811af89038 EFLAGS: 00010216 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffff888105266400 RDX: 0000000000000006 RSI: ffff88800c890000 RDI: 0000000000000030 RBP: 0000000000000050 R08: 0000000000000000 R09: ffff88810526640e R10: ffffed1020a4cc81 R11: ffff88810526640f R12: 0000000000000000 R13: 0000000000000820 R14: ffff888105266400 R15: 0000000000000050 FS: 00007f0653a07640(0000) GS:ffff88811af80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f863ba096f4 CR3: 00000000163c0005 CR4: 0000000000770ef0 PKRU: 80000000 Call Trace: <IRQ> ipv6_renew_options+0x279/0x950 net/ipv6/exthdrs.c:1288 calipso_req_setattr+0x181/0x340 net/ipv6/calipso.c:1204 calipso_req_setattr+0x56/0x80 net/netlabel/netlabel_calipso.c:597 netlbl_req_setattr+0x18a/0x440 net/netlabel/netlabel_kapi.c:1249 selinux_netlbl_inet_conn_request+0x1fb/0x320 security/selinux/netlabel.c:342 selinux_inet_conn_request+0x1eb/0x2c0 security/selinux/hooks.c:5551 security_inet_conn_request+0x50/0xa0 security/security.c:4945 tcp_v6_route_req+0x22c/0x550 net/ipv6/tcp_ipv6.c:825 tcp_conn_request+0xec8/0x2b70 net/ipv4/tcp_input.c:7275 tcp_v6_conn_request+0x1e3/0x440 net/ipv6/tcp_ipv6.c:1328 tcp_rcv_state_process+0xafa/0x52b0 net/ipv4/tcp_input.c:6781 tcp_v6_do_rcv+0x8a6/0x1a40 net/ipv6/tcp_ipv6.c:1667 tcp_v6_rcv+0x505e/0x5b50 net/ipv6/tcp_ipv6.c:1904 ip6_protocol_deliver_rcu+0x17c/0x1da0 net/ipv6/ip6_input.c:436 ip6_input_finish+0x103/0x180 net/ipv6/ip6_input.c:480 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_input+0x13c/0x6b0 net/ipv6/ip6_input.c:491 dst_input include/net/dst.h:469 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:79 [inline] ip6_rcv_finish+0xb6/0x490 net/ipv6/ip6_input.c:69 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ipv6_rcv+0xf9/0x490 net/ipv6/ip6_input.c:309 __netif_receive_skb_one_core+0x12e/0x1f0 net/core/dev.c:5896 __netif_receive_skb+0x1d/0x170 net/core/dev.c:6009 process_backlog+0x41e/0x13b0 net/core/dev.c:6357 __napi_poll+0xbd/0x710 net/core/dev.c:7191 napi_poll net/core/dev.c:7260 [inline] net_rx_action+0x9de/0xde0 net/core/dev.c:7382 handle_softirqs+0x19a/0x770 kernel/softirq.c:561 do_softirq.part.0+0x36/0x70 kernel/softirq.c:462 </IRQ> <TASK> do_softirq arch/x86/include/asm/preempt.h:26 [inline] __local_bh_enable_ip+0xf1/0x110 kernel/softirq.c:389 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:919 [inline] __dev_queue_xmit+0xc2a/0x3c40 net/core/dev.c:4679 dev_queue_xmit include/linux/netdevice.h:3313 [inline] neigh_hh_output include/net/neighbour.h:523 [inline] neigh_output include/net/neighbour.h:537 [inline] ip6_finish_output2+0xd69/0x1f80 net/ipv6/ip6_output.c:141 __ip6_finish_output net/ipv6/ip6_output.c:215 [inline] ip6_finish_output+0x5dc/0xd60 net/ipv6/ip6_output.c:226 NF_HOOK_COND include/linux/netfilter.h:303 [inline] ip6_output+0x24b/0x8d0 net/ipv6/ip6_output.c:247 dst_output include/net/dst.h:459 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip6_xmit+0xbbc/0x20d0 net/ipv6/ip6_output.c:366 inet6_csk_xmit+0x39a/0x720 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x1a7b/0x3b40 net/ipv4/tcp_output.c:1471 tcp_transmit_skb net/ipv4/tcp_output.c:1489 [inline] tcp_send_syn_data net/ipv4/tcp_output.c:4059 [inline] tcp_connect+0x1c0c/0x4510 net/ipv4/tcp_output.c:4148 tcp_v6_connect+0x156c/0x2080 net/ipv6/tcp_ipv6.c:333 __inet_stream_connect+0x3a7/0xed0 net/ipv4/af_inet.c:677 tcp_sendmsg_fastopen+0x3e2/0x710 net/ipv4/tcp.c:1039 tcp_sendmsg_locked+0x1e82/0x3570 net/ipv4/tcp.c:1091 tcp_sendmsg+0x2f/0x50 net/ipv4/tcp.c:1358 inet6_sendmsg+0xb9/0x150 net/ipv6/af_inet6.c:659 sock_sendmsg_nosec net/socket.c:718 [inline] __sock_sendmsg+0xf4/0x2a0 net/socket.c:733 __sys_sendto+0x29a/0x390 net/socket.c:2187 __do_sys_sendto net/socket.c:2194 [inline] __se_sys_sendto net/socket.c:2190 [inline] __x64_sys_sendto+0xe1/0x1c0 net/socket.c:2190 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xc3/0x1d0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f06553c47ed Code: 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f0653a06fc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f0655605fa0 RCX: 00007f06553c47ed RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000000000b RBP: 00007f065545db38 R08: 0000200000000140 R09: 000000000000001c R10: f7384d4ea84b01bd R11: 0000000000000246 R12: 0000000000000000 R13: 00007f0655605fac R14: 00007f0655606038 R15: 00007f06539e7000 </TASK> Modules linked in: [1]: dnf install -y selinux-policy-targeted policycoreutils netlabel_tools procps-ng nmap-ncat mount -t selinuxfs none /sys/fs/selinux load_policy netlabelctl calipso add pass doi:1 netlabelctl map del default netlabelctl map add default address:::1 protocol:calipso,1 sysctl net.ipv4.tcp_syncookies=2 nc -l ::1 80 & nc ::1 80 Fixes: e1adea9 ("calipso: Allow request sockets to be relabelled by the lsm.") Reported-by: syzkaller <syzkaller@googlegroups.com> Reported-by: John Cheung <john.cs.hey@gmail.com> Closes: https://lore.kernel.org/netdev/CAP=Rh=MvfhrGADy+-WJiftV2_WzMH4VEhEFmeT28qY+4yxNu4w@mail.gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20250617224125.17299-1-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Mark Bundschuh <mkbund@amazon.com> Signed-off-by: Nathan Gao <zcgao@amazon.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
As reported by syzbot and experienced by Pavel, using cpus_read_lock() in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem and possibly others). Instead, shrink the preempt disable region by iterating all CPUs and checking the online status for each individual CPU while having preemption disabled. Fixes: 8850cb6 ("sched: Simplify wake_up_*idle*()") Reported-by: syzbot+d5b23b18d2f4feae8a67@syzkaller.appspotmail.com Reported-by: Pavel Machek <pavel@ucw.cz> Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Qian Cai <quic_qiancai@quicinc.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
commit f79a609 upstream. log_max_qp in driver's default profile #2 was set to 18, but FW actually supports 17 at the most - a situation that led to the concerning print when the driver is loaded: "log_max_qp value in current profile is 18, changing to HCA capabaility limit (17)" The expected behavior from mlx5_profile #2 is to match the maximum FW capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is initialized to a defined static value (0xff) - which basically means that when loading this profile, log_max_qp value will be what the currently installed FW supports at most. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> [v5.10: replaced prof->log_max_qp with profile[prof_sel].log_max_qp] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
[ Upstream commit 9afb4b2 ] To clear the flow table on flow table free, the following sequence normally happens in order: 1) gc_step work is stopped to disable any further stats/del requests. 2) All flow table entries are set to teardown state. 3) Run gc_step which will queue HW del work for each flow table entry. 4) Waiting for the above del work to finish (flush). 5) Run gc_step again, deleting all entries from the flow table. 6) Flow table is freed. But if a flow table entry already has pending HW stats or HW add work step 3 will not queue HW del work (it will be skipped), step 4 will wait for the pending add/stats to finish, and step 5 will queue HW del work which might execute after freeing of the flow table. To fix the above, this patch flushes the pending work, then it sets the teardown flag to all flows in the flowtable and it forces a garbage collector run to queue work to remove the flows from hardware, then it flushes this new pending work and (finally) it forces another garbage collector run to remove the entry from the software flowtable. Stack trace: [47773.882335] BUG: KASAN: use-after-free in down_read+0x99/0x460 [47773.883634] Write of size 8 at addr ffff888103b45aa8 by task kworker/u20:6/543704 [47773.885634] CPU: 3 PID: 543704 Comm: kworker/u20:6 Not tainted 5.12.0-rc7+ #2 [47773.886745] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009) [47773.888438] Workqueue: nf_ft_offload_del flow_offload_work_handler [nf_flow_table] [47773.889727] Call Trace: [47773.890214] dump_stack+0xbb/0x107 [47773.890818] print_address_description.constprop.0+0x18/0x140 [47773.892990] kasan_report.cold+0x7c/0xd8 [47773.894459] kasan_check_range+0x145/0x1a0 [47773.895174] down_read+0x99/0x460 [47773.899706] nf_flow_offload_tuple+0x24f/0x3c0 [nf_flow_table] [47773.907137] flow_offload_work_handler+0x72d/0xbe0 [nf_flow_table] [47773.913372] process_one_work+0x8ac/0x14e0 [47773.921325] [47773.921325] Allocated by task 592159: [47773.922031] kasan_save_stack+0x1b/0x40 [47773.922730] __kasan_kmalloc+0x7a/0x90 [47773.923411] tcf_ct_flow_table_get+0x3cb/0x1230 [act_ct] [47773.924363] tcf_ct_init+0x71c/0x1156 [act_ct] [47773.925207] tcf_action_init_1+0x45b/0x700 [47773.925987] tcf_action_init+0x453/0x6b0 [47773.926692] tcf_exts_validate+0x3d0/0x600 [47773.927419] fl_change+0x757/0x4a51 [cls_flower] [47773.928227] tc_new_tfilter+0x89a/0x2070 [47773.936652] [47773.936652] Freed by task 543704: [47773.937303] kasan_save_stack+0x1b/0x40 [47773.938039] kasan_set_track+0x1c/0x30 [47773.938731] kasan_set_free_info+0x20/0x30 [47773.939467] __kasan_slab_free+0xe7/0x120 [47773.940194] slab_free_freelist_hook+0x86/0x190 [47773.941038] kfree+0xce/0x3a0 [47773.941644] tcf_ct_flow_table_cleanup_work Original patch description and stack trace by Paul Blakey. Fixes: c29f74e ("netfilter: nf_flow_table: hardware offload support") Reported-by: Paul Blakey <paulb@nvidia.com> Tested-by: Paul Blakey <paulb@nvidia.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Nathan Gao <zcgao@amazon.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
commit c886d70 upstream. Dipanjan reported a syzbot splat at close time: WARNING: CPU: 1 PID: 10818 at net/ipv4/af_inet.c:153 inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Modules linked in: uio_ivshmem(OE) uio(E) CPU: 1 PID: 10818 Comm: kworker/1:16 Tainted: G OE 5.19.0-rc6-g2eae0556bb9d #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 Workqueue: events mptcp_worker RIP: 0010:inet_sock_destruct+0x6d0/0x8e0 net/ipv4/af_inet.c:153 Code: 21 02 00 00 41 8b 9c 24 28 02 00 00 e9 07 ff ff ff e8 34 4d 91 f9 89 ee 4c 89 e7 e8 4a 47 60 ff e9 a6 fc ff ff e8 20 4d 91 f9 <0f> 0b e9 84 fe ff ff e8 14 4d 91 f9 0f 0b e9 d4 fd ff ff e8 08 4d RSP: 0018:ffffc9001b35fa78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000002879d0 RCX: ffff8881326f3b00 RDX: 0000000000000000 RSI: ffff8881326f3b00 RDI: 0000000000000002 RBP: ffff888179662674 R08: ffffffff87e983a0 R09: 0000000000000000 R10: 0000000000000005 R11: 00000000000004ea R12: ffff888179662400 R13: ffff888179662428 R14: 0000000000000001 R15: ffff88817e38e258 FS: 0000000000000000(0000) GS:ffff8881f5f00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020007bc0 CR3: 0000000179592000 CR4: 0000000000150ee0 Call Trace: <TASK> __sk_destruct+0x4f/0x8e0 net/core/sock.c:2067 sk_destruct+0xbd/0xe0 net/core/sock.c:2112 __sk_free+0xef/0x3d0 net/core/sock.c:2123 sk_free+0x78/0xa0 net/core/sock.c:2134 sock_put include/net/sock.h:1927 [inline] __mptcp_close_ssk+0x50f/0x780 net/mptcp/protocol.c:2351 __mptcp_destroy_sock+0x332/0x760 net/mptcp/protocol.c:2828 mptcp_worker+0x5d2/0xc90 net/mptcp/protocol.c:2586 process_one_work+0x9cc/0x1650 kernel/workqueue.c:2289 worker_thread+0x623/0x1070 kernel/workqueue.c:2436 kthread+0x2e9/0x3a0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:302 </TASK> The root cause of the problem is that an mptcp-level (re)transmit can race with mptcp_close() and the packet scheduler checks the subflow state before acquiring the socket lock: we can try to (re)transmit on an already closed ssk. Fix the issue checking again the subflow socket status under the subflow socket lock protection. Additionally add the missing check for the fallback-to-tcp case. Fixes: d5f4919 ("mptcp: allow picking different xmit subflows") Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> [mheyne@: Fixed a couple of conflicts due to some missing commit: - Commit d9ca1de ("mptcp: move page frag allocation in mptcp_sendmsg()") reworked a couple of functions. Before this commit there used to be a check for EAGAIN after mptcp_sendmsg_frag. This is equivalent to the "continue" from this backport and therefore is dropped. Other commits that affected the context: - 5cf92bb ("mptcp: re-enable sndbuf autotune") - b713d00 ("mptcp: really share subflow snd_wnd") - f70cad1 ("mptcp: stop relying on tcp_tx_skb_cache")] Signed-off-by: Maximilian Heyne <mheyne@amazon.de>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
commit 96611c2 upstream. As reported by syzbot and experienced by Pavel, using cpus_read_lock() in wake_up_all_idle_cpus() generates lock inversion (against mmap_sem and possibly others). Instead, shrink the preempt disable region by iterating all CPUs and checking the online status for each individual CPU while having preemption disabled. Fixes: 8850cb6 ("sched: Simplify wake_up_*idle*()") Reported-by: syzbot+d5b23b18d2f4feae8a67@syzkaller.appspotmail.com Reported-by: Pavel Machek <pavel@ucw.cz> Reported-by: Qian Cai <quic_qiancai@quicinc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Qian Cai <quic_qiancai@quicinc.com> (cherry picked from commit 96611c2)
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
[ Upstream commit 711741f ] Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order and prevent the following deadlock from happening ====================================================== WARNING: possible circular locking dependency detected 6.16.0-rc3-build2+ #1301 Tainted: G S W ------------------------------------------------------ cifsd/6055 is trying to acquire lock: ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200 but task is already holding lock: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&ret_buf->chan_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_setup_session+0x81/0x4b0 cifs_get_smb_ses+0x771/0x900 cifs_mount_get_session+0x7e/0x170 cifs_mount+0x92/0x2d0 cifs_smb3_do_mount+0x161/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&ret_buf->ses_lock){+.+.}-{3:3}: validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_match_super+0x101/0x320 sget+0xab/0x270 cifs_smb3_do_mount+0x1e0/0x460 smb3_get_tree+0x55/0x90 vfs_get_tree+0x46/0x180 do_new_mount+0x1b0/0x2e0 path_mount+0x6ee/0x740 do_mount+0x98/0xe0 __do_sys_mount+0x148/0x180 do_syscall_64+0xa4/0x260 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}: check_noncircular+0x95/0xc0 check_prev_add+0x115/0x2f0 validate_chain+0x1cf/0x270 __lock_acquire+0x60e/0x780 lock_acquire.part.0+0xb4/0x1f0 _raw_spin_lock+0x2f/0x40 cifs_signal_cifsd_for_reconnect+0x134/0x200 __cifs_reconnect+0x8f/0x500 cifs_handle_standard+0x112/0x280 cifs_demultiplex_thread+0x64d/0xbc0 kthread+0x2f7/0x310 ret_from_fork+0x2a/0x230 ret_from_fork_asm+0x1a/0x30 other info that might help us debug this: Chain exists of: &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ret_buf->chan_lock); lock(&ret_buf->ses_lock); lock(&ret_buf->chan_lock); lock(&tcp_ses->srv_lock); *** DEADLOCK *** 3 locks held by cifsd/6055: #0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200 #1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200 #2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200 Cc: linux-cifs@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Fixes: d7d7a66 ("cifs: avoid use of global locks for high contention data") Reviewed-by: David Howells <dhowells@redhat.com> Tested-by: David Howells <dhowells@redhat.com> Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org> [v6.1: replaced SERVER_IS_CHAN with CIFS_SERVER_IS_CHAN; resolved contextual conflicts in cifs_signal_cifsd_for_reconnect()] Signed-off-by: Shaoying Xu <shaoyi@amazon.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
Aishwarya reports that warnings are sometimes seen when running the ftrace kselftests, e.g. | WARNING: CPU: 5 PID: 2066 at arch/arm64/kernel/stacktrace.c:141 arch_stack_walk+0x4a0/0x4c0 | Modules linked in: | CPU: 5 UID: 0 PID: 2066 Comm: ftracetest Not tainted 6.13.0-rc2 #2 | Hardware name: linux,dummy-virt (DT) | pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : arch_stack_walk+0x4a0/0x4c0 | lr : arch_stack_walk+0x248/0x4c0 | sp : ffff800083643d20 | x29: ffff800083643dd0 x28: ffff00007b891400 x27: ffff00007b891928 | x26: 0000000000000001 x25: 00000000000000c0 x24: ffff800082f39d80 | x23: ffff80008003ee8c x22: ffff80008004baa8 x21: ffff8000800533e0 | x20: ffff800083643e10 x19: ffff80008003eec8 x18: 0000000000000000 | x17: 0000000000000000 x16: ffff800083640000 x15: 0000000000000000 | x14: 02a37a802bbb8a92 x13: 00000000000001a9 x12: 0000000000000001 | x11: ffff800082ffad60 x10: ffff800083643d20 x9 : ffff80008003eed0 | x8 : ffff80008004baa8 x7 : ffff800086f2be80 x6 : ffff0000057cf000 | x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff800086f2b690 | x2 : ffff80008004baa8 x1 : ffff80008004baa8 x0 : ffff80008004baa8 | Call trace: | arch_stack_walk+0x4a0/0x4c0 (P) | arch_stack_walk+0x248/0x4c0 (L) | profile_pc+0x44/0x80 | profile_tick+0x50/0x80 (F) | tick_nohz_handler+0xcc/0x160 (F) | __hrtimer_run_queues+0x2ac/0x340 (F) | hrtimer_interrupt+0xf4/0x268 (F) | arch_timer_handler_virt+0x34/0x60 (F) | handle_percpu_devid_irq+0x88/0x220 (F) | generic_handle_domain_irq+0x34/0x60 (F) | gic_handle_irq+0x54/0x140 (F) | call_on_irq_stack+0x24/0x58 (F) | do_interrupt_handler+0x88/0x98 | el1_interrupt+0x34/0x68 (F) | el1h_64_irq_handler+0x18/0x28 | el1h_64_irq+0x6c/0x70 | queued_spin_lock_slowpath+0x78/0x460 (P) The warning in question is: WARN_ON_ONCE(state->common.pc == orig_pc)) ... in kunwind_recover_return_address(), which is triggered when return_to_handler() is encountered in the trace, but ftrace_graph_ret_addr() cannot find a corresponding original return address on the fgraph return stack. This happens because the stacktrace code encounters an exception boundary where the LR was not live at the time of the exception, but the LR happens to contain return_to_handler(); either because the task recently returned there, or due to unfortunate usage of the LR at a scratch register. In such cases attempts to recover the return address via ftrace_graph_ret_addr() may fail, triggering the WARN_ON_ONCE() above and aborting the unwind (hence the stacktrace terminating after reporting the PC at the time of the exception). Handling unreliable LR values in these cases is likely to require some larger rework, so for the moment avoid this problem by restoring the old behaviour of skipping the LR at exception boundaries, which the stacktrace code did prior to commit: c2c6b27 ("arm64: stacktrace: unwind exception boundaries") This commit is effectively a partial revert, keeping the structures and logic to explicitly identify exception boundaries while still skipping reporting of the LR. The logic to explicitly identify exception boundaries is still useful for general robustness and as a building block for future support for RELIABLE_STACKTRACE. Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries") Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20241211140704.2498712-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
mngyadam
pushed a commit
that referenced
this pull request
Feb 2, 2026
The arm64 stacktrace code has a few error conditions where a
WARN_ON_ONCE() is triggered before the stacktrace is terminated and an
error is returned to the caller. The conditions shouldn't be triggered
when unwinding the current task, but it is possible to trigger these
when unwinding another task which is not blocked, as the stack of that
task is concurrently modified. Kent reports that these warnings can be
triggered while running filesystem tests on bcachefs, which calls the
stacktrace code directly.
To produce a meaningful stacktrace of another task, the task in question
should be blocked, but the stacktrace code is expected to be robust to
cases where it is not blocked. Note that this is purely about not
unuduly scaring the user and/or crashing the kernel; stacktraces in such
cases are meaningless and may leak kernel secrets from the stack of the
task being unwound.
Ideally we'd pin the task in a blocked state during the unwind, as we do
for /proc/${PID}/wchan since commit:
42a20f8 ("sched: Add wrapper for get_wchan() to keep task blocked")
... but a bunch of places don't do that, notably /proc/${PID}/stack,
where we don't pin the task in a blocked state, but do restrict the
output to privileged users since commit:
f8a00ce ("proc: restrict kernel stack dumps to root")
... and so it's possible to trigger these warnings accidentally, e.g. by
reading /proc/*/stack (as root):
| for n in $(seq 1 10); do
| while true; do cat /proc/*/stack > /dev/null 2>&1; done &
| done
| ------------[ cut here ]------------
| WARNING: CPU: 3 PID: 166 at arch/arm64/kernel/stacktrace.c:207 arch_stack_walk+0x1c8/0x370
| Modules linked in:
| CPU: 3 UID: 0 PID: 166 Comm: cat Not tainted 6.13.0-rc2-00003-g3dafa7a7925d #2
| Hardware name: linux,dummy-virt (DT)
| pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x1c8/0x370
| lr : arch_stack_walk+0x1b0/0x370
| sp : ffff800080773890
| x29: ffff800080773930 x28: fff0000005c44500 x27: fff00000058fa038
| x26: 000000007ffff000 x25: 0000000000000000 x24: 0000000000000000
| x23: ffffa35a8d9600ec x22: 0000000000000000 x21: fff00000043a33c0
| x20: ffff800080773970 x19: ffffa35a8d960168 x18: 0000000000000000
| x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
| x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
| x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
| x8 : ffff8000807738e0 x7 : ffff8000806e3800 x6 : ffff8000806e3818
| x5 : ffff800080773920 x4 : ffff8000806e4000 x3 : ffff8000807738e0
| x2 : 0000000000000018 x1 : ffff8000806e3800 x0 : 0000000000000000
| Call trace:
| arch_stack_walk+0x1c8/0x370 (P)
| stack_trace_save_tsk+0x8c/0x108
| proc_pid_stack+0xb0/0x134
| proc_single_show+0x60/0x120
| seq_read_iter+0x104/0x438
| seq_read+0xf8/0x140
| vfs_read+0xc4/0x31c
| ksys_read+0x70/0x108
| __arm64_sys_read+0x1c/0x28
| invoke_syscall+0x48/0x104
| el0_svc_common.constprop.0+0x40/0xe0
| do_el0_svc+0x1c/0x28
| el0_svc+0x30/0xcc
| el0t_64_sync_handler+0x10c/0x138
| el0t_64_sync+0x198/0x19c
| ---[ end trace 0000000000000000 ]---
Fix this by only warning when unwinding the current task. When unwinding
another task the error conditions will be handled by returning an error
without producing a warning.
The two warnings in kunwind_next_frame_record_meta() were added recently
as part of commit:
c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
The warning when recovering the fgraph return address has changed form
many times, but was originally introduced back in commit:
9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Fixes: c2c6b27 ("arm64: stacktrace: unwind exception boundaries")
Fixes: 9f41631 ("arm64: fix unwind_frame() for filtered out fn for function graph tracing")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20241211140704.2498712-3-mark.rutland@arm.com
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
vewe-richard#1
Description of changes:
Do not call generic_file_buffered_read() directly, call generic_file_read_iter() which will check number of iov to read, and return immediately when the number of iov is zero.
Else, generic_file_buffered_read() will fall into an infinite loop, eat up all cpu resources, cause system lockup.