-
Notifications
You must be signed in to change notification settings - Fork 6
crypto: ecdsa: fixup boot hang (fedora38-hardware). #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Currently only a single default signature encoding is supported per akcipher. A subsequent commit will allow a second encoding for ecdsa, namely P1363 alternatively to X9.62. To accommodate for that, amend struct akcipher_request and struct crypto_akcipher_sync_data to store the desired signature encoding for verify and sign ops. Amend akcipher_request_set_crypt(), crypto_sig_verify() and crypto_sig_sign() with an additional parameter which specifies the desired signature encoding. Adjust all callers. Signed-off-by: Lukas Wunner <lukas@wunner.de>
Alternatively to the X9.62 encoding of ecdsa signatures, which uses ASN.1 and is already supported by the kernel, there's another common encoding called P1363. It stores r and s as the concatenation of two big endian, unsigned integers. The name originates from IEEE P1363. The Security Protocol and Data Model (SPDM) specification prescribes that ecdsa signatures are encoded according to P1363: "For ECDSA signatures, excluding SM2, in SPDM, the signature shall be the concatenation of r and s. The size of r shall be the size of the selected curve. Likewise, the size of s shall be the size of the selected curve. See BaseAsymAlgo in NEGOTIATE_ALGORITHMS for the size of r and s. The byte order for r and s shall be in big endian order. When placing ECDSA signatures into an SPDM signature field, r shall come first followed by s." (SPDM 1.2.1 margin no 44, https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.2.1.pdf) A subsequent commit introduces an SPDM library to enable PCI device authentication, so add support for P1363 ecdsa signature verification. Signed-off-by: Lukas Wunner <lukas@wunner.de>
The Security Protocol and Data Model (SPDM) allows for authentication, measurement, key exchange and encrypted sessions with devices. A commonly used term for authentication and measurement is attestation. SPDM was conceived by the Distributed Management Task Force (DMTF). Its specification defines messages, data objects and sequences for performing exchanges with devices over various transports and physical media: https://www.dmtf.org/dsp/DSP0274 This implementation supports SPDM 1.0 through 1.3 (the latest version). It is designed to be transport-agnostic as the kernel already contains two different mechanisms which can serve as transports: * PCIe Data Object Exchange (PCIe r6.1 sec 6.30, drivers/pci/doe.c) * Management Component Transport Protocol (MCTP, Documentation/networking/mctp.rst) Use cases for SPDM include, but are not limited to: * PCIe Component Measurement and Authentication (PCIe r6.1 sec 6.31) * Compute Express Link (CXL r3.0 sec 14.11.6) * Open Compute Project (Attestation of System Components r1.0) https://www.opencompute.org/documents/attestation-v1-0-20201104-pdf The initial focus of this implementation is enabling PCIe CMA device authentication. As such, only a subset of the SPDM specification is contained herein, namely the exchanges GET_VERSION, GET_CAPABILITIES, NEGOTIATE_ALGORITHMS, GET_DIGESTS, GET_CERTIFICATE and CHALLENGE. The set of supported algorithms is limited to those declared mandatory in PCIe r6.1 sec 6.31.3. Adding more algorithms is straightforward as long as the crypto subsystem supports them. Future commits will extend this implementation with support for measurement, key exchange and encrypted sessions. So far, only the SPDM requester role is implemented. Care was taken to allow for effortless addition of the responder role at a later stage. This could be needed for a PCIe host bridge operating in endpoint mode. The responder role will be able to reuse struct definitions and helpers such as spdm_create_combined_prefix(). Those can be moved to spdm_common.{h,c} files upon introduction of the responder role. For now, all is kept in a single source file to avoid polluting the global namespace with unnecessary symbols. Credits: Jonathan wrote a proof-of-concept of this SPDM implementation. Lukas reworked it for upstream. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Lukas Wunner <lukas@wunner.de> [to be deleted:] There are a large number of open questions around how we do this that need to be resolved. These include: * Key chain management - Current approach is to use a keychain provide as part of per transport initialization for the root certificates which are assumed to be loaded into that keychain, perhaps in an initrd script.
Component Measurement and Authentication (CMA, PCIe r6.1 sec 6.31) allows for measurement and authentication of PCIe devices. It is based on the Security Protocol and Data Model specification (SPDM, https://www.dmtf.org/dsp/DSP0274). CMA-SPDM in turn forms the basis for Integrity and Data Encryption (IDE, PCIe r6.1 sec 6.33) because the key material used by IDE is exchanged over a CMA-SPDM session. As a first step, authenticate CMA-capable devices on enumeration. A subsequent commit will expose the result in sysfs. When allocating SPDM session state with spdm_create(), the maximum SPDM message length needs to be passed. Make the PCI_DOE_MAX_LENGTH macro public and calculate the maximum payload length from it. Credits: Jonathan wrote a proof-of-concept of this CMA implementation. Lukas reworked it for upstream. Wilfred contributed fixes for issues discovered during testing. Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com> Signed-off-by: Lukas Wunner <lukas@wunner.de>
PCIe r6.1 sec 6.31.3 stipulates requirements for X.509 Leaf Certificates presented by devices, in particular the presence of a Subject Alternative Name extension with a name that encodes the Vendor ID, Device ID, Device Serial Number, etc. This prevents a mismatch between the device identity in Config Space and the certificate. A device cannot misappropriate a certificate from a different device without also spoofing Config Space. As a corollary, it cannot dupe an arbitrary driver into binding to it. (Only those which bind to the device identity in the Subject Alternative Name work.) Parse the Subject Alternative Name using a small ASN.1 module and validate its contents. The theory of operation is explained in a code comment at the top of the newly added cma-x509.c. This functionality is introduced in a separate commit on top of basic CMA-SPDM support to split the code into digestible, reviewable chunks. The CMA OID added here is taken from the official OID Repository (it's not documented in the PCIe Base Spec): https://oid-rep.orange-labs.fr/get/2.23.147 Signed-off-by: Lukas Wunner <lukas@wunner.de>
CMA-SPDM state is lost when a device undergoes a Conventional Reset. (But not a Function Level Reset, PCIe r6.1 sec 6.6.2.) A D3cold to D0 transition implies a Conventional Reset (PCIe r6.1 sec 5.8). Thus, reauthenticate devices on resume from D3cold and on recovery from a Secondary Bus Reset or DPC-induced Hot Reset. The requirement to reauthenticate devices on resume from system sleep (and in the future reestablish IDE encryption) is the reason why SPDM needs to be in-kernel: During ->resume_noirq, which is the first phase after system sleep, the PCI core walks down the hierarchy, puts each device in D0, restores its config space and invokes the driver's ->resume_noirq callback. The driver is afforded the right to access the device already during this phase. To retain this usage model in the face of authentication and encryption, CMA-SPDM reauthentication and IDE reestablishment must happen during the ->resume_noirq phase, before the driver's first access to the device. The driver is thus afforded seamless authenticated and encrypted access until the last moment before suspend and from the first moment after resume. During the ->resume_noirq phase, device interrupts are not yet enabled. It is thus impossible to defer CMA-SPDM reauthentication to a user space component on an attached disk or on the network, making an in-kernel SPDM implementation mandatory. The same catch-22 exists on recovery from a Conventional Reset: A user space SPDM implementation might live on a device which underwent reset, rendering its execution impossible. Signed-off-by: Lukas Wunner <lukas@wunner.de>
The PCI core has just been amended to authenticate CMA-capable devices on enumeration and store the result in an "authenticated" bit in struct pci_dev->spdm_state. Expose the bit to user space through an eponymous sysfs attribute. Allow user space to trigger reauthentication (e.g. after it has updated the CMA keyring) by writing to the sysfs attribute. Subject to further discussion, a future commit might add a user-defined policy to forbid driver binding to devices which failed authentication, similar to the "authorized" attribute for USB. Alternatively, authentication success might be signaled to user space through a uevent, whereupon it may bind a (blacklisted) driver. A uevent signaling authentication failure might similarly cause user space to unbind or outright remove the potentially malicious device. Traffic from devices which failed authentication could also be filtered through ACS I/O Request Blocking Enable (PCIe r6.1 sec 7.7.11.3) or through Link Disable (PCIe r6.1 sec 7.5.3.7). Unlike an IOMMU, that will not only protect the host, but also prevent malicious peer-to-peer traffic to other devices. Signed-off-by: Lukas Wunner <lukas@wunner.de>
At any given time, only a single entity in a physical system may have an SPDM connection to a device. That's because the GET_VERSION request (which begins an authentication sequence) resets "the connection and all context associated with that connection" (SPDM 1.3.0 margin no 158). Thus, when a device is passed through to a guest and the guest has authenticated it, a subsequent authentication by the host would reset the device's CMA-SPDM session behind the guest's back. Prevent by letting the guest claim exclusive CMA ownership of the device during passthrough. Refuse CMA reauthentication on the host as long. After passthrough has concluded, reauthenticate the device on the host. Store the flag indicating guest ownership in struct pci_dev's priv_flags to avoid the concurrency issues observed by commit 44bda4b ("PCI: Fix is_added/is_busmaster race condition"). Side note: The Data Object Exchange r1.1 ECN (published Oct 11 2022) retrofits DOE with Connection IDs. In theory these allow simultaneous CMA-SPDM connections by multiple entities to the same device. But the first hardware generation capable of CMA-SPDM only supports DOE r1.0. The specification also neglects to reserve unique Connection IDs for hosts and guests, which further limits its usefulness. In general, forcing the transport to compensate for SPDM's lack of a connection identifier feels like a questionable layering violation. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
When testing on Fedora38 (x86-HW), I ran into a hang during the boot process. Which seemed to be caused by a NULL dereference of the encoding of the signature `req->enc`. This patch adds a NULL check, that now allows the boot to proceed but I suppose it fails the ecdsa_verify. Also switch to `strncmp()` to be 'safer'. Squish into commit: 09d3a2c Signed-off-by: Wilfred Mallawa <wilfred.mallawa@wdc.com>
Did you get a stacktrace when the kernel oopsed or just a hang without screen output or logs? If you did not get a stacktrace, could you add a Thanks! |
It was basically a hang during the boot process.
Sure, this is the result:
It looks like
|
I neglected to update the ECDSA test vectors in I've amended the test vectors now and that should fix the issue. I've added a debug patch (crypto: ecdsa - Assume x962 encoding by default) which emits a WARN splat if ecdsa verification is attempted without specifying the signature encoding. In theory you should never see the WARN splat. Just so that you don't experience another boot hang in case I've missed a code path where no encoding is specified. About Thanks again! |
Awesome! thanks for the update! |
bbe5fa8
to
93559c5
Compare
closing this, issue is addressed. |
The driver creates /sys/kernel/debug/dri/0/mob_ttm even when the corresponding ttm_resource_manager is not allocated. This leads to a crash when trying to read from this file. Add a check to create mob_ttm, system_mob_ttm, and gmr_ttm debug file only when the corresponding ttm_resource_manager is allocated. crash> bt PID: 3133409 TASK: ffff8fe4834a5000 CPU: 3 COMMAND: "grep" #0 [ffffb954506b3b20] machine_kexec at ffffffffb2a6bec3 #1 [ffffb954506b3b78] __crash_kexec at ffffffffb2bb598a #2 [ffffb954506b3c38] crash_kexec at ffffffffb2bb68c1 #3 [ffffb954506b3c50] oops_end at ffffffffb2a2a9b1 #4 [ffffb954506b3c70] no_context at ffffffffb2a7e913 #5 [ffffb954506b3cc8] __bad_area_nosemaphore at ffffffffb2a7ec8c #6 [ffffb954506b3d10] do_page_fault at ffffffffb2a7f887 torvalds#7 [ffffb954506b3d40] page_fault at ffffffffb360116e [exception RIP: ttm_resource_manager_debug+0x11] RIP: ffffffffc04afd11 RSP: ffffb954506b3df0 RFLAGS: 00010246 RAX: ffff8fe41a6d1200 RBX: 0000000000000000 RCX: 0000000000000940 RDX: 0000000000000000 RSI: ffffffffc04b4338 RDI: 0000000000000000 RBP: ffffb954506b3e08 R8: ffff8fee3ffad000 R9: 0000000000000000 R10: ffff8fe41a76a000 R11: 0000000000000001 R12: 00000000ffffffff R13: 0000000000000001 R14: ffff8fe5bb6f3900 R15: ffff8fe41a6d1200 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#8 [ffffb954506b3e00] ttm_resource_manager_show at ffffffffc04afde7 [ttm] torvalds#9 [ffffb954506b3e30] seq_read at ffffffffb2d8f9f3 RIP: 00007f4c4eda8985 RSP: 00007ffdbba9e9f8 RFLAGS: 00000246 RAX: ffffffffffffffda RBX: 000000000037e000 RCX: 00007f4c4eda8985 RDX: 000000000037e000 RSI: 00007f4c41573000 RDI: 0000000000000003 RBP: 000000000037e000 R8: 0000000000000000 R9: 000000000037fe30 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4c41573000 R13: 0000000000000003 R14: 00007f4c41572010 R15: 0000000000000003 ORIG_RAX: 0000000000000000 CS: 0033 SS: 002b Signed-off-by: Jocelyn Falempe <jfalempe@redhat.com> Fixes: af4a25b ("drm/vmwgfx: Add debugfs entries for various ttm resource managers") Cc: <stable@vger.kernel.org> Reviewed-by: Zack Rusin <zack.rusin@broadcom.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240312093551.196609-1-jfalempe@redhat.com
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patch #1 unlike early commit path stage which triggers a call to abort, an explicit release of the batch is required on abort, otherwise mutex is released and commit_list remains in place. Patch #2 release mutex after nft_gc_seq_end() in commit path, otherwise async GC worker could collect expired objects. Patch #3 flush pending destroy work in module removal path, otherwise UaF is possible. Patch #4 and #6 restrict the table dormant flag with basechain updates to fix state inconsistency in the hook registration. Patch #5 adds missing RCU read side lock to flowtable type to avoid races with module removal. * tag 'nf-24-04-04' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: discard table flag update with pending basechain deletion netfilter: nf_tables: Fix potential data-race in __nft_flowtable_type_get() netfilter: nf_tables: reject new basechain after table flag update netfilter: nf_tables: flush pending destroy work before exit_net release netfilter: nf_tables: release mutex after nft_gc_seq_end from abort path netfilter: nf_tables: release batch on table validation from abort path ==================== Link: https://lore.kernel.org/r/20240404104334.1627-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
…git/netfilter/nf netfilter pull request 24-04-11 Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patches #1 and #2 add missing rcu read side lock when iterating over expression and object type list which could race with module removal. Patch #3 prevents promisc packet from visiting the bridge/input hook to amend a recent fix to address conntrack confirmation race in br_netfilter and nf_conntrack_bridge. Patch #4 adds and uses iterate decorator type to fetch the current pipapo set backend datastructure view when netlink dumps the set elements. Patch #5 fixes removal of duplicate elements in the pipapo set backend. Patch #6 flowtable validates pppoe header before accessing it. Patch torvalds#7 fixes flowtable datapath for pppoe packets, otherwise lookup fails and pppoe packets follow classic path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patch #1 syzbot reports that nf_reinject() could be called without rcu_read_lock() when flushing pending packets at nfnetlink queue removal, from Eric Dumazet. Patch #2 flushes ipset list:set when canceling garbage collection to reference to other lists to fix a race, from Jozsef Kadlecsik. Patch #3 restores q-in-q matching with nft_payload by reverting f6ae9f1 ("netfilter: nft_payload: add C-VLAN support"). Patch #4 fixes vlan mangling in skbuff when vlan offload is present in skbuff, without this patch nft_payload corrupts packets in this case. Patch #5 fixes possible nul-deref in tproxy no IP address is found in netdevice, reported by syzbot and patch from Florian Westphal. Patch #6 removes a superfluous restriction which prevents loose fib lookups from input and forward hooks, from Eric Garver. My assessment is that patches #1, #2 and #5 address possible kernel crash, anything else in this batch fixes broken features. netfilter pull request 24-05-29 * tag 'nf-24-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_fib: allow from forward/input without iif selector netfilter: tproxy: bail out if IP has been disabled on the device netfilter: nft_payload: skbuff vlan metadata mangle support netfilter: nft_payload: restore vlan q-in-q match support netfilter: ipset: Add list flush to cancel_gc netfilter: nfnetlink_queue: acquire rcu_read_lock() in instance_destroy_rcu() ==================== Link: https://lore.kernel.org/r/20240528225519.1155786-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
With commit c4cb231 ("iommu/amd: Add support for enable/disable IOPF") we are hitting below issue. This happens because in IOPF enablement path it holds spin lock with irq disable and then tries to take mutex lock. dmesg: ----- [ 0.938739] ============================= [ 0.938740] [ BUG: Invalid wait context ] [ 0.938742] 6.10.0-rc1+ #1 Not tainted [ 0.938745] ----------------------------- [ 0.938746] swapper/0/1 is trying to lock: [ 0.938748] ffffffff8c9f01d8 (&port_lock_key){....}-{3:3}, at: serial8250_console_write+0x78/0x4a0 [ 0.938767] other info that might help us debug this: [ 0.938768] context-{5:5} [ 0.938769] 7 locks held by swapper/0/1: [ 0.938772] #0: ffff888101a91310 (&group->mutex){+.+.}-{4:4}, at: bus_iommu_probe+0x70/0x160 [ 0.938790] #1: ffff888101d1f1b8 (&domain->lock){....}-{3:3}, at: amd_iommu_attach_device+0xa5/0x700 [ 0.938799] #2: ffff888101cc3d18 (&dev_data->lock){....}-{3:3}, at: amd_iommu_attach_device+0xc5/0x700 [ 0.938806] #3: ffff888100052830 (&iommu->lock){....}-{2:2}, at: amd_iommu_iopf_add_device+0x3f/0xa0 [ 0.938813] #4: ffffffff8945a340 (console_lock){+.+.}-{0:0}, at: _printk+0x48/0x50 [ 0.938822] #5: ffffffff8945a390 (console_srcu){....}-{0:0}, at: console_flush_all+0x58/0x4e0 [ 0.938867] #6: ffffffff82459f80 (console_owner){....}-{0:0}, at: console_flush_all+0x1f0/0x4e0 [ 0.938872] stack backtrace: [ 0.938874] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 6.10.0-rc1+ #1 [ 0.938877] Hardware name: HP HP EliteBook 745 G3/807E, BIOS N73 Ver. 01.39 04/16/2019 Fix above issue by re-arranging code in attach device path: - move device PASID/IOPF enablement outside lock in AMD IOMMU driver. This is safe as core layer holds group->mutex lock before calling iommu_ops->attach_dev. Reported-by: Borislav Petkov <bp@alien8.de> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Fixes: c4cb231 ("iommu/amd: Add support for enable/disable IOPF") Tested-by: Borislav Petkov <bp@alien8.de> Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Tested-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: Vasant Hegde <vasant.hegde@amd.com> Link: https://lore.kernel.org/r/20240530084801.10758-1-vasant.hegde@amd.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
…PLES event" This reverts commit 7d1405c. This causes segfaults in some cases, as reported by Milian: ``` sudo /usr/bin/perf record -z --call-graph dwarf -e cycles -e raw_syscalls:sys_enter ls ... [ perf record: Woken up 3 times to write data ] malloc(): invalid next size (unsorted) Aborted ``` Backtrace with GDB + debuginfod: ``` malloc(): invalid next size (unsorted) Thread 1 "perf" received signal SIGABRT, Aborted. __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 Downloading source file /usr/src/debug/glibc/glibc/nptl/pthread_kill.c 44 return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0; (gdb) bt #0 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44 #1 0x00007ffff6ea8eb3 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78 #2 0x00007ffff6e50a30 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/ raise.c:26 #3 0x00007ffff6e384c3 in __GI_abort () at abort.c:79 #4 0x00007ffff6e39354 in __libc_message_impl (fmt=fmt@entry=0x7ffff6fc22ea "%s\n") at ../sysdeps/posix/libc_fatal.c:132 #5 0x00007ffff6eb3085 in malloc_printerr (str=str@entry=0x7ffff6fc5850 "malloc(): invalid next size (unsorted)") at malloc.c:5772 #6 0x00007ffff6eb657c in _int_malloc (av=av@entry=0x7ffff6ff6ac0 <main_arena>, bytes=bytes@entry=368) at malloc.c:4081 torvalds#7 0x00007ffff6eb877e in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3754 torvalds#8 0x000055555569bdb6 in perf_session.do_write_header () torvalds#9 0x00005555555a373a in __cmd_record.constprop.0 () torvalds#10 0x00005555555a6846 in cmd_record () torvalds#11 0x000055555564db7f in run_builtin () torvalds#12 0x000055555558ed77 in main () ``` Valgrind memcheck: ``` ==45136== Invalid write of size 8 ==45136== at 0x2B38A5: perf_event__synthesize_id_sample (in /usr/bin/perf) ==45136== by 0x157069: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ==45136== Syscall param write(buf) points to unaddressable byte(s) ==45136== at 0x575953D: __libc_write (write.c:26) ==45136== by 0x575953D: write (write.c:24) ==45136== by 0x35761F: ion (in /usr/bin/perf) ==45136== by 0x357778: writen (in /usr/bin/perf) ==45136== by 0x1548F7: record__write (in /usr/bin/perf) ==45136== by 0x15708A: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== Address 0x6a866a8 is 0 bytes after a block of size 40 alloc'd ==45136== at 0x4849BF3: calloc (vg_replace_malloc.c:1675) ==45136== by 0x3574AB: zalloc (in /usr/bin/perf) ==45136== by 0x1570E0: __cmd_record.constprop.0 (in /usr/bin/perf) ==45136== by 0x15A845: cmd_record (in /usr/bin/perf) ==45136== by 0x201B7E: run_builtin (in /usr/bin/perf) ==45136== by 0x142D76: main (in /usr/bin/perf) ==45136== ----- Closes: https://lore.kernel.org/linux-perf-users/23879991.0LEYPuXRzz@milian-workstation/ Reported-by: Milian Wolff <milian.wolff@kdab.com> Tested-by: Milian Wolff <milian.wolff@kdab.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Ian Rogers <irogers@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: stable@kernel.org # 6.8+ Link: https://lore.kernel.org/lkml/Zl9ksOlHJHnKM70p@x1 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We have been seeing crashes on duplicate keys in btrfs_set_item_key_safe(): BTRFS critical (device vdb): slot 4 key (450 108 8192) new key (450 108 8192) ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.c:2620! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 3139 Comm: xfs_io Kdump: loaded Not tainted 6.9.0 #6 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 RIP: 0010:btrfs_set_item_key_safe+0x11f/0x290 [btrfs] With the following stack trace: #0 btrfs_set_item_key_safe (fs/btrfs/ctree.c:2620:4) #1 btrfs_drop_extents (fs/btrfs/file.c:411:4) #2 log_one_extent (fs/btrfs/tree-log.c:4732:9) #3 btrfs_log_changed_extents (fs/btrfs/tree-log.c:4955:9) #4 btrfs_log_inode (fs/btrfs/tree-log.c:6626:9) #5 btrfs_log_inode_parent (fs/btrfs/tree-log.c:7070:8) #6 btrfs_log_dentry_safe (fs/btrfs/tree-log.c:7171:8) torvalds#7 btrfs_sync_file (fs/btrfs/file.c:1933:8) torvalds#8 vfs_fsync_range (fs/sync.c:188:9) torvalds#9 vfs_fsync (fs/sync.c:202:9) torvalds#10 do_fsync (fs/sync.c:212:9) torvalds#11 __do_sys_fdatasync (fs/sync.c:225:9) torvalds#12 __se_sys_fdatasync (fs/sync.c:223:1) torvalds#13 __x64_sys_fdatasync (fs/sync.c:223:1) torvalds#14 do_syscall_x64 (arch/x86/entry/common.c:52:14) torvalds#15 do_syscall_64 (arch/x86/entry/common.c:83:7) torvalds#16 entry_SYSCALL_64+0xaf/0x14c (arch/x86/entry/entry_64.S:121) So we're logging a changed extent from fsync, which is splitting an extent in the log tree. But this split part already exists in the tree, triggering the BUG(). This is the state of the log tree at the time of the crash, dumped with drgn (https://github.com/osandov/drgn/blob/main/contrib/btrfs_tree.py) to get more details than btrfs_print_leaf() gives us: >>> print_extent_buffer(prog.crashed_thread().stack_trace()[0]["eb"]) leaf 33439744 level 0 items 72 generation 9 owner 18446744073709551610 leaf 33439744 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da item 0 key (450 INODE_ITEM 0) itemoff 16123 itemsize 160 generation 7 transid 9 size 8192 nbytes 8473563889606862198 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 204 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417704.983333333 (2024-05-22 15:41:44) mtime 1716417704.983333333 (2024-05-22 15:41:44) otime 17592186044416.000000000 (559444-03-08 01:40:16) item 1 key (450 INODE_REF 256) itemoff 16110 itemsize 13 index 195 namelen 3 name: 193 item 2 key (450 XATTR_ITEM 1640047104) itemoff 16073 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 3 key (450 EXTENT_DATA 0) itemoff 16020 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 4096 ram 12288 extent compression 0 (none) item 4 key (450 EXTENT_DATA 4096) itemoff 15967 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 4096 nr 8192 item 5 key (450 EXTENT_DATA 8192) itemoff 15914 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 ... So the real problem happened earlier: notice that items 4 (4k-12k) and 5 (8k-12k) overlap. Both are prealloc extents. Item 4 straddles i_size and item 5 starts at i_size. Here is the state of the filesystem tree at the time of the crash: >>> root = prog.crashed_thread().stack_trace()[2]["inode"].root >>> ret, nodes, slots = btrfs_search_slot(root, BtrfsKey(450, 0, 0)) >>> print_extent_buffer(nodes[0]) leaf 30425088 level 0 items 184 generation 9 owner 5 leaf 30425088 flags 0x100000000000000 fs uuid e5bd3946-400c-4223-8923-190ef1f18677 chunk uuid d58cb17e-6d02-494a-829a-18b7d8a399da ... item 179 key (450 INODE_ITEM 0) itemoff 4907 itemsize 160 generation 7 transid 7 size 4096 nbytes 12288 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 sequence 6 flags 0x10(PREALLOC) atime 1716417703.220000000 (2024-05-22 15:41:43) ctime 1716417703.220000000 (2024-05-22 15:41:43) mtime 1716417703.220000000 (2024-05-22 15:41:43) otime 1716417703.220000000 (2024-05-22 15:41:43) item 180 key (450 INODE_REF 256) itemoff 4894 itemsize 13 index 195 namelen 3 name: 193 item 181 key (450 XATTR_ITEM 1640047104) itemoff 4857 itemsize 37 location key (0 UNKNOWN.0 0) type XATTR transid 7 data_len 1 name_len 6 name: user.a data a item 182 key (450 EXTENT_DATA 0) itemoff 4804 itemsize 53 generation 9 type 1 (regular) extent data disk byte 303144960 nr 12288 extent data offset 0 nr 8192 ram 12288 extent compression 0 (none) item 183 key (450 EXTENT_DATA 8192) itemoff 4751 itemsize 53 generation 9 type 2 (prealloc) prealloc data disk byte 303144960 nr 12288 prealloc data offset 8192 nr 4096 Item 5 in the log tree corresponds to item 183 in the filesystem tree, but nothing matches item 4. Furthermore, item 183 is the last item in the leaf. btrfs_log_prealloc_extents() is responsible for logging prealloc extents beyond i_size. It first truncates any previously logged prealloc extents that start beyond i_size. Then, it walks the filesystem tree and copies the prealloc extent items to the log tree. If it hits the end of a leaf, then it calls btrfs_next_leaf(), which unlocks the tree and does another search. However, while the filesystem tree is unlocked, an ordered extent completion may modify the tree. In particular, it may insert an extent item that overlaps with an extent item that was already copied to the log tree. This may manifest in several ways depending on the exact scenario, including an EEXIST error that is silently translated to a full sync, overlapping items in the log tree, or this crash. This particular crash is triggered by the following sequence of events: - Initially, the file has i_size=4k, a regular extent from 0-4k, and a prealloc extent beyond i_size from 4k-12k. The prealloc extent item is the last item in its B-tree leaf. - The file is fsync'd, which copies its inode item and both extent items to the log tree. - An xattr is set on the file, which sets the BTRFS_INODE_COPY_EVERYTHING flag. - The range 4k-8k in the file is written using direct I/O. i_size is extended to 8k, but the ordered extent is still in flight. - The file is fsync'd. Since BTRFS_INODE_COPY_EVERYTHING is set, this calls copy_inode_items_to_log(), which calls btrfs_log_prealloc_extents(). - btrfs_log_prealloc_extents() finds the 4k-12k prealloc extent in the filesystem tree. Since it starts before i_size, it skips it. Since it is the last item in its B-tree leaf, it calls btrfs_next_leaf(). - btrfs_next_leaf() unlocks the path. - The ordered extent completion runs, which converts the 4k-8k part of the prealloc extent to written and inserts the remaining prealloc part from 8k-12k. - btrfs_next_leaf() does a search and finds the new prealloc extent 8k-12k. - btrfs_log_prealloc_extents() copies the 8k-12k prealloc extent into the log tree. Note that it overlaps with the 4k-12k prealloc extent that was copied to the log tree by the first fsync. - fsync calls btrfs_log_changed_extents(), which tries to log the 4k-8k extent that was written. - This tries to drop the range 4k-8k in the log tree, which requires adjusting the start of the 4k-12k prealloc extent in the log tree to 8k. - btrfs_set_item_key_safe() sees that there is already an extent starting at 8k in the log tree and calls BUG(). Fix this by detecting when we're about to insert an overlapping file extent item in the log tree and truncating the part that would overlap. CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
The syzbot fuzzer found that the interrupt-URB completion callback in the cdc-wdm driver was taking too long, and the driver's immediate resubmission of interrupt URBs with -EPROTO status combined with the dummy-hcd emulation to cause a CPU lockup: cdc_wdm 1-1:1.0: nonzero urb status received: -71 cdc_wdm 1-1:1.0: wdm_int_callback - 0 bytes watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [syz-executor782:6625] CPU#0 Utilization every 4s during lockup: #1: 98% system, 0% softirq, 3% hardirq, 0% idle #2: 98% system, 0% softirq, 3% hardirq, 0% idle #3: 98% system, 0% softirq, 3% hardirq, 0% idle #4: 98% system, 0% softirq, 3% hardirq, 0% idle #5: 98% system, 1% softirq, 3% hardirq, 0% idle Modules linked in: irq event stamp: 73096 hardirqs last enabled at (73095): [<ffff80008037bc00>] console_emit_next_record kernel/printk/printk.c:2935 [inline] hardirqs last enabled at (73095): [<ffff80008037bc00>] console_flush_all+0x650/0xb74 kernel/printk/printk.c:2994 hardirqs last disabled at (73096): [<ffff80008af10b00>] __el1_irq arch/arm64/kernel/entry-common.c:533 [inline] hardirqs last disabled at (73096): [<ffff80008af10b00>] el1_interrupt+0x24/0x68 arch/arm64/kernel/entry-common.c:551 softirqs last enabled at (73048): [<ffff8000801ea530>] softirq_handle_end kernel/softirq.c:400 [inline] softirqs last enabled at (73048): [<ffff8000801ea530>] handle_softirqs+0xa60/0xc34 kernel/softirq.c:582 softirqs last disabled at (73043): [<ffff800080020de8>] __do_softirq+0x14/0x20 kernel/softirq.c:588 CPU: 0 PID: 6625 Comm: syz-executor782 Tainted: G W 6.10.0-rc2-syzkaller-g8867bbd4a056 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 04/02/2024 Testing showed that the problem did not occur if the two error messages -- the first two lines above -- were removed; apparently adding material to the kernel log takes a surprisingly large amount of time. In any case, the best approach for preventing these lockups and to avoid spamming the log with thousands of error messages per second is to ratelimit the two dev_err() calls. Therefore we replace them with dev_err_ratelimited(). Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Suggested-by: Greg KH <gregkh@linuxfoundation.org> Reported-and-tested-by: syzbot+5f996b83575ef4058638@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-usb/00000000000073d54b061a6a1c65@google.com/ Reported-and-tested-by: syzbot+1b2abad17596ad03dcff@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-usb/000000000000f45085061aa9b37e@google.com/ Fixes: 9908a32 ("USB: remove err() macro from usb class drivers") Link: https://lore.kernel.org/linux-usb/40dfa45b-5f21-4eef-a8c1-51a2f320e267@rowland.harvard.edu/ Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/29855215-52f5-4385-b058-91f42c2bee18@rowland.harvard.edu Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Luis has been reporting an assert failure when freeing an inode cluster during inode inactivation for a while. The assert looks like: XFS: Assertion failed: bp->b_flags & XBF_DONE, file: fs/xfs/xfs_trans_buf.c, line: 241 ------------[ cut here ]------------ kernel BUG at fs/xfs/xfs_message.c:102! Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI CPU: 4 PID: 73 Comm: kworker/4:1 Not tainted 6.10.0-rc1 #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Workqueue: xfs-inodegc/loop5 xfs_inodegc_worker [xfs] RIP: 0010:assfail (fs/xfs/xfs_message.c:102) xfs RSP: 0018:ffff88810188f7f0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88816e748250 RCX: 1ffffffff844b0e7 RDX: 0000000000000004 RSI: ffff88810188f558 RDI: ffffffffc2431fa0 RBP: 1ffff11020311f01 R08: 0000000042431f9f R09: ffffed1020311e9b R10: ffff88810188f4df R11: ffffffffac725d70 R12: ffff88817a3f4000 R13: ffff88812182f000 R14: ffff88810188f998 R15: ffffffffc2423f80 FS: 0000000000000000(0000) GS:ffff8881c8400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055fe9d0f109c CR3: 000000014426c002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:241 (discriminator 1)) xfs xfs_imap_to_bp (fs/xfs/xfs_trans.h:210 fs/xfs/libxfs/xfs_inode_buf.c:138) xfs xfs_inode_item_precommit (fs/xfs/xfs_inode_item.c:145) xfs xfs_trans_run_precommits (fs/xfs/xfs_trans.c:931) xfs __xfs_trans_commit (fs/xfs/xfs_trans.c:966) xfs xfs_inactive_ifree (fs/xfs/xfs_inode.c:1811) xfs xfs_inactive (fs/xfs/xfs_inode.c:2013) xfs xfs_inodegc_worker (fs/xfs/xfs_icache.c:1841 fs/xfs/xfs_icache.c:1886) xfs process_one_work (kernel/workqueue.c:3231) worker_thread (kernel/workqueue.c:3306 (discriminator 2) kernel/workqueue.c:3393 (discriminator 2)) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> And occurs when the the inode precommit handlers is attempt to look up the inode cluster buffer to attach the inode for writeback. The trail of logic that I can reconstruct is as follows. 1. the inode is clean when inodegc runs, so it is not attached to a cluster buffer when precommit runs. 2. #1 implies the inode cluster buffer may be clean and not pinned by dirty inodes when inodegc runs. 3. #2 implies that the inode cluster buffer can be reclaimed by memory pressure at any time. 4. The assert failure implies that the cluster buffer was attached to the transaction, but not marked done. It had been accessed earlier in the transaction, but not marked done. 5. #4 implies the cluster buffer has been invalidated (i.e. marked stale). 6. #5 implies that the inode cluster buffer was instantiated uninitialised in the transaction in xfs_ifree_cluster(), which only instantiates the buffers to invalidate them and never marks them as done. Given factors 1-3, this issue is highly dependent on timing and environmental factors. Hence the issue can be very difficult to reproduce in some situations, but highly reliable in others. Luis has an environment where it can be reproduced easily by g/531 but, OTOH, I've reproduced it only once in ~2000 cycles of g/531. I think the fix is to have xfs_ifree_cluster() set the XBF_DONE flag on the cluster buffers, even though they may not be initialised. The reasons why I think this is safe are: 1. A buffer cache lookup hit on a XBF_STALE buffer will clear the XBF_DONE flag. Hence all future users of the buffer know they have to re-initialise the contents before use and mark it done themselves. 2. xfs_trans_binval() sets the XFS_BLI_STALE flag, which means the buffer remains locked until the journal commit completes and the buffer is unpinned. Hence once marked XBF_STALE/XFS_BLI_STALE by xfs_ifree_cluster(), the only context that can access the freed buffer is the currently running transaction. 3. #2 implies that future buffer lookups in the currently running transaction will hit the transaction match code and not the buffer cache. Hence XBF_STALE and XFS_BLI_STALE will not be cleared unless the transaction initialises and logs the buffer with valid contents again. At which point, the buffer will be marked marked XBF_DONE again, so having XBF_DONE already set on the stale buffer is a moot point. 4. #2 also implies that any concurrent access to that cluster buffer will block waiting on the buffer lock until the inode cluster has been fully freed and is no longer an active inode cluster buffer. 5. #4 + #1 means that any future user of the disk range of that buffer will always see the range of disk blocks covered by the cluster buffer as not done, and hence must initialise the contents themselves. 6. Setting XBF_DONE in xfs_ifree_cluster() then means the unlinked inode precommit code will see a XBF_DONE buffer from the transaction match as it expects. It can then attach the stale but newly dirtied inode to the stale but newly dirtied cluster buffer without unexpected failures. The stale buffer will then sail through the journal and do the right thing with the attached stale inode during unpin. Hence the fix is just one line of extra code. The explanation of why we have to set XBF_DONE in xfs_ifree_cluster, OTOH, is long and complex.... Fixes: 82842fe ("xfs: fix AGF vs inode cluster buffer deadlock") Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
…git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: Patch #1 fixes the suspicious RCU usage warning that resulted from the recent fix for the race between namespace cleanup and gc in ipset left out checking the pernet exit phase when calling rcu_dereference_protected(), from Jozsef Kadlecsik. Patch #2 fixes incorrect input and output netdevice in SRv6 prerouting hooks, from Jianguo Wu. Patch #3 moves nf_hooks_lwtunnel sysctl toggle to the netfilter core. The connection tracking system is loaded on-demand, this ensures availability of this knob regardless. Patch #4-#5 adds selftests for SRv6 netfilter hooks also from Jianguo Wu. netfilter pull request 24-06-19 * tag 'nf-24-06-19' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: selftests: add selftest for the SRv6 End.DX6 behavior with netfilter selftests: add selftest for the SRv6 End.DX4 behavior with netfilter netfilter: move the sysctl nf_hooks_lwtunnel into the netfilter core seg6: fix parameter passing when calling NF_HOOK() in End.DX4 and End.DX6 behaviors netfilter: ipset: Fix suspicious rcu_dereference_protected() ==================== Link: https://lore.kernel.org/r/20240619170537.2846-1-pablo@netfilter.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
…play During inode logging (and log replay too), we are holding a transaction handle and we often need to call btrfs_iget(), which will read an inode from its subvolume btree if it's not loaded in memory and that results in allocating an inode with GFP_KERNEL semantics at the btrfs_alloc_inode() callback - and this may recurse into the filesystem in case we are under memory pressure and attempt to commit the current transaction, resulting in a deadlock since the logging (or log replay) task is holding a transaction handle open. Syzbot reported this with the following stack traces: WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Not tainted ------------------------------------------------------ syz-executor.1/9919 is trying to acquire lock: ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: might_alloc include/linux/sched/mm.h:334 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_pre_alloc_hook mm/slub.c:3891 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: slab_alloc_node mm/slub.c:3981 [inline] ffffffff8dd3aac0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 but task is already holding lock: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&ei->log_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:608 [inline] __mutex_lock+0x175/0x9c0 kernel/locking/mutex.c:752 btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 btrfs_log_inode_parent+0x8cb/0x2a90 fs/btrfs/tree-log.c:7079 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 new_sync_write fs/read_write.c:497 [inline] vfs_write+0x6b6/0x1140 fs/read_write.c:590 ksys_write+0x12f/0x260 fs/read_write.c:643 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_commit_super+0xa1/0x110 fs/btrfs/disk-io.c:4170 close_ctree+0xcb0/0xf90 fs/btrfs/disk-io.c:4324 generic_shutdown_super+0x159/0x3d0 fs/super.c:642 kill_anon_super+0x3a/0x60 fs/super.c:1226 btrfs_kill_super+0x3b/0x50 fs/btrfs/super.c:2096 deactivate_locked_super+0xbe/0x1a0 fs/super.c:473 deactivate_super+0xde/0x100 fs/super.c:506 cleanup_mnt+0x222/0x450 fs/namespace.c:1267 task_work_run+0x14e/0x250 kernel/task_work.c:180 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] exit_to_user_mode_loop kernel/entry/common.c:114 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:328 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x278/0x2a0 kernel/entry/common.c:218 __do_fast_syscall_32+0x80/0x120 arch/x86/entry/common.c:389 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: __lock_release kernel/locking/lockdep.c:5468 [inline] lock_release+0x33e/0x6c0 kernel/locking/lockdep.c:5774 percpu_up_read include/linux/percpu-rwsem.h:99 [inline] __sb_end_write include/linux/fs.h:1650 [inline] sb_end_intwrite include/linux/fs.h:1767 [inline] __btrfs_end_transaction+0x5ca/0x920 fs/btrfs/transaction.c:1071 btrfs_commit_inode_delayed_inode+0x228/0x330 fs/btrfs/delayed-inode.c:1301 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 dentry_unlink_inode+0x295/0x480 fs/dcache.c:400 __dentry_kill+0x1d0/0x600 fs/dcache.c:603 dput.part.0+0x4b1/0x9b0 fs/dcache.c:845 dput+0x1f/0x30 fs/dcache.c:835 ovl_stack_put+0x60/0x90 fs/overlayfs/util.c:132 ovl_destroy_inode+0xc6/0x190 fs/overlayfs/super.c:182 destroy_inode+0xc4/0x1b0 fs/inode.c:311 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 dentry_unlink_inode+0x295/0x480 fs/dcache.c:400 __dentry_kill+0x1d0/0x600 fs/dcache.c:603 shrink_kill fs/dcache.c:1048 [inline] shrink_dentry_list+0x140/0x5d0 fs/dcache.c:1075 prune_dcache_sb+0xeb/0x150 fs/dcache.c:1156 super_cache_scan+0x32a/0x550 fs/super.c:221 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab_memcg mm/shrinker.c:548 [inline] shrink_slab+0xa87/0x1310 mm/shrinker.c:626 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 -> #0 (fs_reclaim){+.+.}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 __fs_reclaim_acquire mm/page_alloc.c:3801 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3891 [inline] slab_alloc_node mm/slub.c:3981 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline] copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928 btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592 log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline] btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718 btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline] btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 do_iter_readv_writev+0x504/0x780 fs/read_write.c:741 vfs_writev+0x36f/0xde0 fs/read_write.c:971 do_pwritev+0x1b2/0x260 fs/read_write.c:1072 __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline] __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline] __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e other info that might help us debug this: Chain exists of: fs_reclaim --> btrfs_trans_num_extwriters --> &ei->log_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->log_mutex); lock(btrfs_trans_num_extwriters); lock(&ei->log_mutex); lock(fs_reclaim); *** DEADLOCK *** 7 locks held by syz-executor.1/9919: #0: ffff88802be20420 (sb_writers#23){.+.+}-{0:0}, at: do_pwritev+0x1b2/0x260 fs/read_write.c:1072 #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: inode_lock include/linux/fs.h:791 [inline] #1: ffff888065c0f8f0 (&sb->s_type->i_mutex_key#33){++++}-{3:3}, at: btrfs_inode_lock+0xc8/0x110 fs/btrfs/inode.c:385 #2: ffff888065c0f778 (&ei->i_mmap_lock){++++}-{3:3}, at: btrfs_inode_lock+0xee/0x110 fs/btrfs/inode.c:388 #3: ffff88802be20610 (sb_internal#4){.+.+}-{0:0}, at: btrfs_sync_file+0x95b/0xe10 fs/btrfs/file.c:1952 #4: ffff8880546323f0 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290 #5: ffff888054632418 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x430/0xf40 fs/btrfs/transaction.c:290 #6: ffff88804b569358 (&ei->log_mutex){+.+.}-{3:3}, at: btrfs_log_inode+0x39c/0x4660 fs/btrfs/tree-log.c:6481 stack backtrace: CPU: 2 PID: 9919 Comm: syz-executor.1 Not tainted 6.10.0-rc2-syzkaller-00361-g061d1af7b030 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 __fs_reclaim_acquire mm/page_alloc.c:3801 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3815 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3891 [inline] slab_alloc_node mm/slub.c:3981 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4020 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 add_conflicting_inode fs/btrfs/tree-log.c:5657 [inline] copy_inode_items_to_log+0x1039/0x1e30 fs/btrfs/tree-log.c:5928 btrfs_log_inode+0xa48/0x4660 fs/btrfs/tree-log.c:6592 log_new_delayed_dentries fs/btrfs/tree-log.c:6363 [inline] btrfs_log_inode+0x27dd/0x4660 fs/btrfs/tree-log.c:6718 btrfs_log_all_parents fs/btrfs/tree-log.c:6833 [inline] btrfs_log_inode_parent+0x22ba/0x2a90 fs/btrfs/tree-log.c:7141 btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7180 btrfs_sync_file+0x9c1/0xe10 fs/btrfs/file.c:1959 vfs_fsync_range+0x141/0x230 fs/sync.c:188 generic_write_sync include/linux/fs.h:2794 [inline] btrfs_do_write_iter+0x584/0x10c0 fs/btrfs/file.c:1705 do_iter_readv_writev+0x504/0x780 fs/read_write.c:741 vfs_writev+0x36f/0xde0 fs/read_write.c:971 do_pwritev+0x1b2/0x260 fs/read_write.c:1072 __do_compat_sys_pwritev2 fs/read_write.c:1218 [inline] __se_compat_sys_pwritev2 fs/read_write.c:1210 [inline] __ia32_compat_sys_pwritev2+0x121/0x1b0 fs/read_write.c:1210 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e RIP: 0023:0xf7334579 Code: b8 01 10 06 03 (...) RSP: 002b:00000000f5f265ac EFLAGS: 00000292 ORIG_RAX: 000000000000017b RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000200002c0 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000292 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Fix this by ensuring we are under a NOFS scope whenever we call btrfs_iget() during inode logging and log replay. Reported-by: syzbot+8576cfa84070dce4d59b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/000000000000274a3a061abbd928@google.com/ Fixes: 712e36c ("btrfs: use GFP_KERNEL in btrfs_alloc_inode") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 #1 __crash_kexec at ffffffff8c1338fa #2 panic at ffffffff8c1d69b9 #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] torvalds#7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] torvalds#8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] torvalds#9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] torvalds#10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] torvalds#11 dio_complete at ffffffff8c2b9fa7 torvalds#12 do_blockdev_direct_IO at ffffffff8c2bc09f torvalds#13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] torvalds#14 generic_file_direct_write at ffffffff8c1dcf14 torvalds#15 __generic_file_write_iter at ffffffff8c1dd07b torvalds#16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] torvalds#17 aio_write at ffffffff8c2cc72e torvalds#18 kmem_cache_alloc at ffffffff8c248dde torvalds#19 do_io_submit at ffffffff8c2ccada torvalds#20 do_syscall_64 at ffffffff8c004984 torvalds#21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/20240617095543.6971-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240614145243.8837-1-jack@suse.cz Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Improve the copy of task comm", v8. Using {memcpy,strncpy,strcpy,kstrdup} to copy the task comm relies on the length of task comm. Changes in the task comm could result in a destination string that is overflow. Therefore, we should explicitly ensure the destination string is always NUL-terminated, regardless of the task comm. This approach will facilitate future extensions to the task comm. As suggested by Linus [0], we can identify all relevant code with the following git grep command: git grep 'memcpy.*->comm\>' git grep 'kstrdup.*->comm\>' git grep 'strncpy.*->comm\>' git grep 'strcpy.*->comm\>' PATCH #2~#4: memcpy PATCH #5~#6: kstrdup PATCH torvalds#7: strcpy Please note that strncpy() is not included in this series as it is being tracked by another effort. [1] This patch (of 7): We want to eliminate the use of __get_task_comm() for the following reasons: - The task_lock() is unnecessary Quoted from Linus [0]: : Since user space can randomly change their names anyway, using locking : was always wrong for readers (for writers it probably does make sense : to have some lock - although practically speaking nobody cares there : either, but at least for a writer some kind of race could have : long-term mixed results Link: https://lkml.kernel.org/r/20241007144911.27693-1-laoar.shao@gmail.com Link: https://lkml.kernel.org/r/20241007144911.27693-2-laoar.shao@gmail.com Link: https://lore.kernel.org/all/CAHk-=wivfrF0_zvf+oj6==Sh=-npJooP8chLPEfaFV0oNYTTBA@mail.gmail.com [0] Link: https://lore.kernel.org/all/CAHk-=whWtUC-AjmGJveAETKOMeMFSTwKwu99v7+b6AyHMmaDFA@mail.gmail.com/ Link: https://lore.kernel.org/all/CAHk-=wjAmmHUg6vho1KjzQi2=psR30+CogFd4aXrThr2gsiS4g@mail.gmail.com/ [0] Link: KSPP#90 [1] Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Matus Jokay <matus.jokay@stuba.sk> Cc: Alejandro Colomar <alx@kernel.org> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Andy Shevchenko <andy.shevchenko@gmail.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@gmail.com> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Maxime Ripard <mripard@kernel.org> Cc: Ondrej Mosnacek <omosnace@redhat.com> Cc: Paul Moore <paul@paul-moore.com> Cc: Quentin Monnet <qmo@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Stephen Smalley <stephen.smalley.work@gmail.com> Cc: Thomas Zimmermann <tzimmermann@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "page allocation tag compression", v4. This patchset implements several improvements: 1. Gracefully handles module unloading while there are used allocations allocated from that module; 2. Provides an option to store page allocation tag references in the page flags, removing dependency on page extensions and eliminating the memory overhead from storing page allocation references (~0.2% of total system memory). This also improves page allocation performance when CONFIG_MEM_ALLOC_PROFILING is enabled by eliminating page extension lookup. Page allocation performance overhead is reduced from 41% to 5.5%. Patch #1 introduces mas_for_each_rev() helper function. Patch #2 introduces shutdown_mem_profiling() helper function to be used when disabling memory allocation profiling. Patch #3 copies module tags into virtually contiguous memory which serves two purposes: - Lets us deal with the situation when module is unloaded while there are still live allocations from that module. Since we are using a copy version of the tags we can safely unload the module. Space and gaps in this contiguous memory are managed using a maple tree. - Enables simple indexing of the tags in the later patches. Patch #4 changes the way we allocate virtually contiguous memory for module tags to reserve only vitrual area and populate physical pages only as needed at module load time. Patch #5 abstracts page allocation tag reference to simplify later changes. Patch #6 adds compression option to the sysctl.vm.mem_profiling boot parameter for storing page allocation tag references inside page flags if they fit. If the number of available page flag bits is insufficient to address all kernel allocations, memory allocation profiling gets disabled with an appropriate warning. This patch (of 6): Add mas_for_each_rev() function to iterate maple tree nodes in reverse order. Link: https://lkml.kernel.org/r/20241023170759.999909-1-surenb@google.com Link: https://lkml.kernel.org/r/20241023170759.999909-2-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Minchan Kim <minchan@google.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Huth <thuth@redhat.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xiongwei Song <xiongwei.song@windriver.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Its used from trace__run(), for the 'perf trace' live mode, i.e. its strace-like, non-perf.data file processing mode, the most common one. The trace__run() function will set trace->host using machine__new_host() that is supposed to give a machine instance representing the running machine, and since we'll use perf_env__arch_strerrno() to get the right errno -> string table, we need to use machine->env, so initialize it in machine__new_host(). Before the patch: (gdb) run trace --errno-summary -a sleep 1 <SNIP> Summary of events: gvfs-afc-volume (3187), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ pselect6 1 0 0.000 0.000 0.000 0.000 0.00% GUsbEventThread (3519), 2 events, 0.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 1 0 0.000 0.000 0.000 0.000 0.00% <SNIP> Program received signal SIGSEGV, Segmentation fault. 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 478 if (env->arch_strerrno == NULL) (gdb) bt #0 0x00000000005caba0 in perf_env__arch_strerrno (env=0x0, err=110) at util/env.c:478 #1 0x00000000004b75d2 in thread__dump_stats (ttrace=0x14f58f0, trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4673 #2 0x00000000004b78bf in trace__fprintf_thread (fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>, thread=0x10fa0b0, trace=0x7fffffffa5b0) at builtin-trace.c:4708 #3 0x00000000004b7ad9 in trace__fprintf_thread_summary (trace=0x7fffffffa5b0, fp=0x7ffff6ff74e0 <_IO_2_1_stderr_>) at builtin-trace.c:4747 #4 0x00000000004b656e in trace__run (trace=0x7fffffffa5b0, argc=2, argv=0x7fffffffde60) at builtin-trace.c:4456 #5 0x00000000004ba43e in cmd_trace (argc=2, argv=0x7fffffffde60) at builtin-trace.c:5487 #6 0x00000000004c0414 in run_builtin (p=0xec3068 <commands+648>, argc=5, argv=0x7fffffffde60) at perf.c:351 torvalds#7 0x00000000004c06bb in handle_internal_command (argc=5, argv=0x7fffffffde60) at perf.c:404 torvalds#8 0x00000000004c0814 in run_argv (argcp=0x7fffffffdc4c, argv=0x7fffffffdc40) at perf.c:448 torvalds#9 0x00000000004c0b5d in main (argc=5, argv=0x7fffffffde60) at perf.c:560 (gdb) After: root@number:~# perf trace -a --errno-summary sleep 1 <SNIP> pw-data-loop (2685), 1410 events, 16.0% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ epoll_wait 188 0 983.428 0.000 5.231 15.595 8.68% ioctl 94 0 0.811 0.004 0.009 0.016 2.82% read 188 0 0.322 0.001 0.002 0.006 5.15% write 141 0 0.280 0.001 0.002 0.018 8.39% timerfd_settime 94 0 0.138 0.001 0.001 0.007 6.47% gnome-control-c (179406), 1848 events, 20.9% syscall calls errors total min avg max stddev (msec) (msec) (msec) (msec) (%) --------------- -------- ------ -------- --------- --------- --------- ------ poll 222 0 959.577 0.000 4.322 21.414 11.40% recvmsg 150 0 0.539 0.001 0.004 0.013 5.12% write 300 0 0.442 0.001 0.001 0.007 3.29% read 150 0 0.183 0.001 0.001 0.009 5.53% getpid 102 0 0.101 0.000 0.001 0.008 7.82% root@number:~# Fixes: 54373b5 ("perf env: Introduce perf_env__arch_strerrno()") Reported-by: Veronika Molnarova <vmolnaro@redhat.com> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Acked-by: Veronika Molnarova <vmolnaro@redhat.com> Acked-by: Michael Petlan <mpetlan@redhat.com> Tested-by: Michael Petlan <mpetlan@redhat.com> Link: https://lore.kernel.org/r/Z0XffUgNSv_9OjOi@x1 Signed-off-by: Namhyung Kim <namhyung@kernel.org>
…s_lock For storing a value to a queue attribute, the queue_attr_store function first freezes the queue (->q_usage_counter(io)) and then acquire ->sysfs_lock. This seems not correct as the usual ordering should be to acquire ->sysfs_lock before freezing the queue. This incorrect ordering causes the following lockdep splat which we are able to reproduce always simply by accessing /sys/kernel/debug file using ls command: [ 57.597146] WARNING: possible circular locking dependency detected [ 57.597154] 6.12.0-10553-gb86545e02e8c torvalds#20 Tainted: G W [ 57.597162] ------------------------------------------------------ [ 57.597168] ls/4605 is trying to acquire lock: [ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0 [ 57.597200] but task is already holding lock: [ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 [ 57.597226] which lock already depends on the new lock. [ 57.597233] the existing dependency chain (in reverse order) is: [ 57.597241] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 57.597255] down_write+0x6c/0x18c [ 57.597264] start_creating+0xb4/0x24c [ 57.597274] debugfs_create_dir+0x2c/0x1e8 [ 57.597283] blk_register_queue+0xec/0x294 [ 57.597292] add_disk_fwnode+0x2e4/0x548 [ 57.597302] brd_alloc+0x2c8/0x338 [ 57.597309] brd_init+0x100/0x178 [ 57.597317] do_one_initcall+0x88/0x3e4 [ 57.597326] kernel_init_freeable+0x3cc/0x6e0 [ 57.597334] kernel_init+0x34/0x1cc [ 57.597342] ret_from_kernel_user_thread+0x14/0x1c [ 57.597350] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}: [ 57.597362] __mutex_lock+0xfc/0x12a0 [ 57.597370] blk_register_queue+0xd4/0x294 [ 57.597379] add_disk_fwnode+0x2e4/0x548 [ 57.597388] brd_alloc+0x2c8/0x338 [ 57.597395] brd_init+0x100/0x178 [ 57.597402] do_one_initcall+0x88/0x3e4 [ 57.597410] kernel_init_freeable+0x3cc/0x6e0 [ 57.597418] kernel_init+0x34/0x1cc [ 57.597426] ret_from_kernel_user_thread+0x14/0x1c [ 57.597434] -> #3 (&q->sysfs_lock){+.+.}-{4:4}: [ 57.597446] __mutex_lock+0xfc/0x12a0 [ 57.597454] queue_attr_store+0x9c/0x110 [ 57.597462] sysfs_kf_write+0x70/0xb0 [ 57.597471] kernfs_fop_write_iter+0x1b0/0x2ac [ 57.597480] vfs_write+0x3dc/0x6e8 [ 57.597488] ksys_write+0x84/0x140 [ 57.597495] system_call_exception+0x130/0x360 [ 57.597504] system_call_common+0x160/0x2c4 [ 57.597516] -> #2 (&q->q_usage_counter(io)torvalds#21){++++}-{0:0}: [ 57.597530] __submit_bio+0x5ec/0x828 [ 57.597538] submit_bio_noacct_nocheck+0x1e4/0x4f0 [ 57.597547] iomap_readahead+0x2a0/0x448 [ 57.597556] xfs_vm_readahead+0x28/0x3c [ 57.597564] read_pages+0x88/0x41c [ 57.597571] page_cache_ra_unbounded+0x1ac/0x2d8 [ 57.597580] filemap_get_pages+0x188/0x984 [ 57.597588] filemap_read+0x13c/0x4bc [ 57.597596] xfs_file_buffered_read+0x88/0x17c [ 57.597605] xfs_file_read_iter+0xac/0x158 [ 57.597614] vfs_read+0x2d4/0x3b4 [ 57.597622] ksys_read+0x84/0x144 [ 57.597629] system_call_exception+0x130/0x360 [ 57.597637] system_call_common+0x160/0x2c4 [ 57.597647] -> #1 (mapping.invalidate_lock#2){++++}-{4:4}: [ 57.597661] down_read+0x6c/0x220 [ 57.597669] filemap_fault+0x870/0x100c [ 57.597677] xfs_filemap_fault+0xc4/0x18c [ 57.597684] __do_fault+0x64/0x164 [ 57.597693] __handle_mm_fault+0x1274/0x1dac [ 57.597702] handle_mm_fault+0x248/0x484 [ 57.597711] ___do_page_fault+0x428/0xc0c [ 57.597719] hash__do_page_fault+0x30/0x68 [ 57.597727] do_hash_fault+0x90/0x35c [ 57.597736] data_access_common_virt+0x210/0x220 [ 57.597745] _copy_from_user+0xf8/0x19c [ 57.597754] sel_write_load+0x178/0xd54 [ 57.597762] vfs_write+0x108/0x6e8 [ 57.597769] ksys_write+0x84/0x140 [ 57.597777] system_call_exception+0x130/0x360 [ 57.597785] system_call_common+0x160/0x2c4 [ 57.597794] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 57.597806] __lock_acquire+0x17cc/0x2330 [ 57.597814] lock_acquire+0x138/0x400 [ 57.597822] __might_fault+0x7c/0xc0 [ 57.597830] filldir64+0xe8/0x390 [ 57.597839] dcache_readdir+0x80/0x2d4 [ 57.597846] iterate_dir+0xd8/0x1d4 [ 57.597855] sys_getdents64+0x88/0x2d4 [ 57.597864] system_call_exception+0x130/0x360 [ 57.597872] system_call_common+0x160/0x2c4 [ 57.597881] other info that might help us debug this: [ 57.597888] Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3 [ 57.597905] Possible unsafe locking scenario: [ 57.597911] CPU0 CPU1 [ 57.597917] ---- ---- [ 57.597922] rlock(&sb->s_type->i_mutex_key#3); [ 57.597932] lock(&q->debugfs_mutex); [ 57.597940] lock(&sb->s_type->i_mutex_key#3); [ 57.597950] rlock(&mm->mmap_lock); [ 57.597958] *** DEADLOCK *** [ 57.597965] 2 locks held by ls/4605: [ 57.597971] #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154 [ 57.597989] #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 Prevent the above lockdep warning by acquiring ->sysfs_lock before freezing the queue while storing a queue attribute in queue_attr_store function. Later, we also found[1] another function __blk_mq_update_nr_ hw_queues where we first freeze queue and then acquire the ->sysfs_lock. So we've also updated lock ordering in __blk_mq_update_nr_hw_queues function and ensured that in all code paths we follow the correct lock ordering i.e. acquire ->sysfs_lock before freezing the queue. [1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/ Reported-by: kjain@linux.ibm.com Fixes: af28141 ("block: freeze the queue in queue_attr_store") Tested-by: kjain@linux.ibm.com Cc: hch@lst.de Cc: axboe@kernel.dk Cc: ritesh.list@gmail.com Cc: ming.lei@redhat.com Cc: gjoyce@linux.ibm.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
syzbot reports that a recent fix causes nesting issues between the (now) raw timeoutlock and the eventfd locking: ============================= [ BUG: Invalid wait context ] 6.13.0-rc4-00080-g9828a4c0901f torvalds#29 Not tainted ----------------------------- kworker/u32:0/68094 is trying to lock: ffff000014d7a520 (&ctx->wqh#2){..-.}-{3:3}, at: eventfd_signal_mask+0x64/0x180 other info that might help us debug this: context-{5:5} 6 locks held by kworker/u32:0/68094: #0: ffff0000c1d98148 ((wq_completion)iou_exit){+.+.}-{0:0}, at: process_one_work+0x4e8/0xfc0 #1: ffff80008d927c78 ((work_completion)(&ctx->exit_work)){+.+.}-{0:0}, at: process_one_work+0x53c/0xfc0 #2: ffff0000c59bc3d8 (&ctx->completion_lock){+.+.}-{3:3}, at: io_kill_timeouts+0x40/0x180 #3: ffff0000c59bc358 (&ctx->timeout_lock){-.-.}-{2:2}, at: io_kill_timeouts+0x48/0x180 #4: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 #5: ffff800085127aa0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x8/0x38 stack backtrace: CPU: 7 UID: 0 PID: 68094 Comm: kworker/u32:0 Not tainted 6.13.0-rc4-00080-g9828a4c0901f torvalds#29 Hardware name: linux,dummy-virt (DT) Workqueue: iou_exit io_ring_exit_work Call trace: show_stack+0x1c/0x30 (C) __dump_stack+0x24/0x30 dump_stack_lvl+0x60/0x80 dump_stack+0x14/0x20 __lock_acquire+0x19f8/0x60c8 lock_acquire+0x1a4/0x540 _raw_spin_lock_irqsave+0x90/0xd0 eventfd_signal_mask+0x64/0x180 io_eventfd_signal+0x64/0x108 io_req_local_work_add+0x294/0x430 __io_req_task_work_add+0x1c0/0x270 io_kill_timeout+0x1f0/0x288 io_kill_timeouts+0xd4/0x180 io_uring_try_cancel_requests+0x2e8/0x388 io_ring_exit_work+0x150/0x550 process_one_work+0x5e8/0xfc0 worker_thread+0x7ec/0xc80 kthread+0x24c/0x300 ret_from_fork+0x10/0x20 because after the preempt-rt fix for the timeout lock nesting inside the io-wq lock, we now have the eventfd spinlock nesting inside the raw timeout spinlock. Rather than play whack-a-mole with other nesting on the timeout lock, split the deletion and killing of timeouts so queueing the task_work for the timeout cancelations can get done outside of the timeout lock. Reported-by: syzbot+b1fc199a40b65d601b65@syzkaller.appspotmail.com Fixes: 020b40f ("io_uring: make ctx->timeout_lock a raw spinlock") Signed-off-by: Jens Axboe <axboe@kernel.dk>
…le_direct_reclaim() The task sometimes continues looping in throttle_direct_reclaim() because allow_direct_reclaim(pgdat) keeps returning false. #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c #2 [ffff80002cb6f990] schedule at ffff800008abc50c #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98 torvalds#7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8 torvalds#8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974 torvalds#9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4 At this point, the pgdat contains the following two zones: NODE: 4 ZONE: 0 ADDR: ffff00817fffe540 NAME: "DMA32" SIZE: 20480 MIN/LOW/HIGH: 11/28/45 VM_STAT: NR_FREE_PAGES: 359 NR_ZONE_INACTIVE_ANON: 18813 NR_ZONE_ACTIVE_ANON: 0 NR_ZONE_INACTIVE_FILE: 50 NR_ZONE_ACTIVE_FILE: 0 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 NODE: 4 ZONE: 1 ADDR: ffff00817fffec00 NAME: "Normal" SIZE: 8454144 PRESENT: 98304 MIN/LOW/HIGH: 68/166/264 VM_STAT: NR_FREE_PAGES: 146 NR_ZONE_INACTIVE_ANON: 94668 NR_ZONE_ACTIVE_ANON: 3 NR_ZONE_INACTIVE_FILE: 735 NR_ZONE_ACTIVE_FILE: 78 NR_ZONE_UNEVICTABLE: 0 NR_ZONE_WRITE_PENDING: 0 NR_MLOCK: 0 NR_BOUNCE: 0 NR_ZSPAGES: 0 NR_FREE_CMA_PAGES: 0 In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of inactive/active file-backed pages calculated in zone_reclaimable_pages() based on the result of zone_page_state_snapshot() is zero. Additionally, since this system lacks swap, the calculation of inactive/ active anonymous pages is skipped. crash> p nr_swap_pages nr_swap_pages = $1937 = { counter = 0 } As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 having free pages significantly exceeding the high watermark. The problem is that the pgdat->kswapd_failures hasn't been incremented. crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures $1935 = 0x0 This is because the node deemed balanced. The node balancing logic in balance_pgdat() evaluates all zones collectively. If one or more zones (e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the entire node is deemed balanced. This causes balance_pgdat() to exit early before incrementing the kswapd_failures, as it considers the overall memory state acceptable, even though some zones (like ZONE_NORMAL) remain under significant pressure. The patch ensures that zone_reclaimable_pages() includes free pages (NR_FREE_PAGES) in its calculation when no other reclaimable pages are available (e.g., file-backed or anonymous pages). This change prevents zones like ZONE_DMA32, which have sufficient free pages, from being mistakenly deemed unreclaimable. By doing so, the patch ensures proper node balancing, avoids masking pressure on other zones like ZONE_NORMAL, and prevents infinite loops in throttle_direct_reclaim() caused by allow_direct_reclaim(pgdat) repeatedly returning false. The kernel hangs due to a task stuck in throttle_direct_reclaim(), caused by a node being incorrectly deemed balanced despite pressure in certain zones, such as ZONE_NORMAL. This issue arises from zone_reclaimable_pages() returning 0 for zones without reclaimable file- backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient free pages to be skipped. The lack of swap or reclaimable pages results in ZONE_DMA32 being ignored during reclaim, masking pressure in other zones. Consequently, pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback mechanisms in allow_direct_reclaim() from being triggered, leading to an infinite loop in throttle_direct_reclaim(). This patch modifies zone_reclaimable_pages() to account for free pages (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures zones with sufficient free pages are not skipped, enabling proper balancing and reclaim behavior. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20241130164346.436469-1-snishika@redhat.com Link: https://lkml.kernel.org/r/20241130161236.433747-2-snishika@redhat.com Fixes: 5a1c84b ("mm: remove reclaim and compaction retry approximations") Signed-off-by: Seiji Nishikawa <snishika@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
…nt message Address a bug in the kernel that triggers a "sleeping function called from invalid context" warning when /sys/kernel/debug/kmemleak is printed under specific conditions: - CONFIG_PREEMPT_RT=y - Set SELinux as the LSM for the system - Set kptr_restrict to 1 - kmemleak buffer contains at least one item BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 136, name: cat preempt_count: 1, expected: 0 RCU nest depth: 2, expected: 2 6 locks held by cat/136: #0: ffff32e64bcbf950 (&p->lock){+.+.}-{3:3}, at: seq_read_iter+0xb8/0xe30 #1: ffffafe6aaa9dea0 (scan_mutex){+.+.}-{3:3}, at: kmemleak_seq_start+0x34/0x128 #3: ffff32e6546b1cd0 (&object->lock){....}-{2:2}, at: kmemleak_seq_show+0x3c/0x1e0 #4: ffffafe6aa8d8560 (rcu_read_lock){....}-{1:2}, at: has_ns_capability_noaudit+0x8/0x1b0 #5: ffffafe6aabbc0f8 (notif_lock){+.+.}-{2:2}, at: avc_compute_av+0xc4/0x3d0 irq event stamp: 136660 hardirqs last enabled at (136659): [<ffffafe6a80fd7a0>] _raw_spin_unlock_irqrestore+0xa8/0xd8 hardirqs last disabled at (136660): [<ffffafe6a80fd85c>] _raw_spin_lock_irqsave+0x8c/0xb0 softirqs last enabled at (0): [<ffffafe6a5d50b28>] copy_process+0x11d8/0x3df8 softirqs last disabled at (0): [<0000000000000000>] 0x0 Preemption disabled at: [<ffffafe6a6598a4c>] kmemleak_seq_show+0x3c/0x1e0 CPU: 1 UID: 0 PID: 136 Comm: cat Tainted: G E 6.11.0-rt7+ torvalds#34 Tainted: [E]=UNSIGNED_MODULE Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0xa0/0x128 show_stack+0x1c/0x30 dump_stack_lvl+0xe8/0x198 dump_stack+0x18/0x20 rt_spin_lock+0x8c/0x1a8 avc_perm_nonode+0xa0/0x150 cred_has_capability.isra.0+0x118/0x218 selinux_capable+0x50/0x80 security_capable+0x7c/0xd0 has_ns_capability_noaudit+0x94/0x1b0 has_capability_noaudit+0x20/0x30 restricted_pointer+0x21c/0x4b0 pointer+0x298/0x760 vsnprintf+0x330/0xf70 seq_printf+0x178/0x218 print_unreferenced+0x1a4/0x2d0 kmemleak_seq_show+0xd0/0x1e0 seq_read_iter+0x354/0xe30 seq_read+0x250/0x378 full_proxy_read+0xd8/0x148 vfs_read+0x190/0x918 ksys_read+0xf0/0x1e0 __arm64_sys_read+0x70/0xa8 invoke_syscall.constprop.0+0xd4/0x1d8 el0_svc+0x50/0x158 el0t_64_sync+0x17c/0x180 %pS and %pK, in the same back trace line, are redundant, and %pS can void %pK service in certain contexts. %pS alone already provides the necessary information, and if it cannot resolve the symbol, it falls back to printing the raw address voiding the original intent behind the %pK. Additionally, %pK requires a privilege check CAP_SYSLOG enforced through the LSM, which can trigger a "sleeping function called from invalid context" warning under RT_PREEMPT kernels when the check occurs in an atomic context. This issue may also affect other LSMs. This change avoids the unnecessary privilege check and resolves the sleeping function warning without any loss of information. Link: https://lkml.kernel.org/r/20241217142032.55793-1-acarmina@redhat.com Fixes: 3a6f33d ("mm/kmemleak: use %pK to display kernel pointers in backtrace") Signed-off-by: Alessandro Carminati <acarmina@redhat.com> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Clément Léger <clement.leger@bootlin.com> Cc: Alessandro Carminati <acarmina@redhat.com> Cc: Eric Chanudet <echanude@redhat.com> Cc: Gabriele Paoloni <gpaoloni@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
gtp_newlink() links the device to a list in dev_net(dev) instead of src_net, where a udp tunnel socket is created. Even when src_net is removed, the device stays alive on dev_net(dev). Then, removing src_net triggers the splat below. [0] In this example, gtp0 is created in ns2, and the udp socket is created in ns1. ip netns add ns1 ip netns add ns2 ip -n ns1 link add netns ns2 name gtp0 type gtp role sgsn ip netns del ns1 Let's link the device to the socket's netns instead. Now, gtp_net_exit_batch_rtnl() needs another netdev iteration to remove all gtp devices in the netns. [0]: ref_tracker: net notrefcnt@000000003d6e7d05 has 1/2 users at sk_alloc (./include/net/net_namespace.h:345 net/core/sock.c:2236) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1558) udp_sock_create4 (net/ipv4/udp_tunnel_core.c:18) gtp_create_sock (./include/net/udp_tunnel.h:59 drivers/net/gtp.c:1423) gtp_create_sockets (drivers/net/gtp.c:1447) gtp_newlink (drivers/net/gtp.c:1507) rtnl_newlink (net/core/rtnetlink.c:3786 net/core/rtnetlink.c:3897 net/core/rtnetlink.c:4012) rtnetlink_rcv_msg (net/core/rtnetlink.c:6922) netlink_rcv_skb (net/netlink/af_netlink.c:2542) netlink_unicast (net/netlink/af_netlink.c:1321 net/netlink/af_netlink.c:1347) netlink_sendmsg (net/netlink/af_netlink.c:1891) ____sys_sendmsg (net/socket.c:711 net/socket.c:726 net/socket.c:2583) ___sys_sendmsg (net/socket.c:2639) __sys_sendmsg (net/socket.c:2669) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) WARNING: CPU: 1 PID: 60 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179) Modules linked in: CPU: 1 UID: 0 PID: 60 Comm: kworker/u16:2 Not tainted 6.13.0-rc5-00147-g4c1224501e9d #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:ref_tracker_dir_exit (lib/ref_tracker.c:179) Code: 00 00 00 fc ff df 4d 8b 26 49 bd 00 01 00 00 00 00 ad de 4c 39 f5 0f 85 df 00 00 00 48 8b 74 24 08 48 89 df e8 a5 cc 12 02 90 <0f> 0b 90 48 8d 6b 44 be 04 00 00 00 48 89 ef e8 80 de 67 ff 48 89 RSP: 0018:ff11000009a07b60 EFLAGS: 00010286 RAX: 0000000000002bd3 RBX: ff1100000f4e1aa0 RCX: 1ffffffff0e40ac6 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8423ee3c RBP: ff1100000f4e1af0 R08: 0000000000000001 R09: fffffbfff0e395ae R10: 0000000000000001 R11: 0000000000036001 R12: ff1100000f4e1af0 R13: dead000000000100 R14: ff1100000f4e1af0 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ff1100006ce80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f9b2464bd98 CR3: 0000000005286005 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn (kernel/panic.c:748) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? __pfx_ref_tracker_dir_exit (lib/ref_tracker.c:158) ? kfree (mm/slub.c:4613 mm/slub.c:4761) net_free (net/core/net_namespace.c:476 net/core/net_namespace.c:467) cleanup_net (net/core/net_namespace.c:664 (discriminator 3)) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Fixes: 459aa66 ("gtp: add initial driver for datapath of GPRS Tunneling Protocol (GTP-U)") Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/20250104125732.17335-1-shaw.leon@gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
pfcp_newlink() links the device to a list in dev_net(dev) instead of net, where a udp tunnel socket is created. Even when net is removed, the device stays alive on dev_net(dev). Then, removing net triggers the splat below. [0] In this example, pfcp0 is created in ns2, but the udp socket is created in ns1. ip netns add ns1 ip netns add ns2 ip -n ns1 link add netns ns2 name pfcp0 type pfcp ip netns del ns1 Let's link the device to the socket's netns instead. Now, pfcp_net_exit() needs another netdev iteration to remove all pfcp devices in the netns. pfcp_dev_list is not used under RCU, so the list API is converted to the non-RCU variant. pfcp_net_exit() can be converted to .exit_batch_rtnl() in net-next. [0]: ref_tracker: net notrefcnt@00000000128b34dc has 1/1 users at sk_alloc (./include/net/net_namespace.h:345 net/core/sock.c:2236) inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252) __sock_create (net/socket.c:1558) udp_sock_create4 (net/ipv4/udp_tunnel_core.c:18) pfcp_create_sock (drivers/net/pfcp.c:168) pfcp_newlink (drivers/net/pfcp.c:182 drivers/net/pfcp.c:197) rtnl_newlink (net/core/rtnetlink.c:3786 net/core/rtnetlink.c:3897 net/core/rtnetlink.c:4012) rtnetlink_rcv_msg (net/core/rtnetlink.c:6922) netlink_rcv_skb (net/netlink/af_netlink.c:2542) netlink_unicast (net/netlink/af_netlink.c:1321 net/netlink/af_netlink.c:1347) netlink_sendmsg (net/netlink/af_netlink.c:1891) ____sys_sendmsg (net/socket.c:711 net/socket.c:726 net/socket.c:2583) ___sys_sendmsg (net/socket.c:2639) __sys_sendmsg (net/socket.c:2669) do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) WARNING: CPU: 1 PID: 11 at lib/ref_tracker.c:179 ref_tracker_dir_exit (lib/ref_tracker.c:179) Modules linked in: CPU: 1 UID: 0 PID: 11 Comm: kworker/u16:0 Not tainted 6.13.0-rc5-00147-g4c1224501e9d #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Workqueue: netns cleanup_net RIP: 0010:ref_tracker_dir_exit (lib/ref_tracker.c:179) Code: 00 00 00 fc ff df 4d 8b 26 49 bd 00 01 00 00 00 00 ad de 4c 39 f5 0f 85 df 00 00 00 48 8b 74 24 08 48 89 df e8 a5 cc 12 02 90 <0f> 0b 90 48 8d 6b 44 be 04 00 00 00 48 89 ef e8 80 de 67 ff 48 89 RSP: 0018:ff11000007f3fb60 EFLAGS: 00010286 RAX: 00000000000020ef RBX: ff1100000d6481e0 RCX: 1ffffffff0e40d82 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8423ee3c RBP: ff1100000d648230 R08: 0000000000000001 R09: fffffbfff0e395af R10: 0000000000000001 R11: 0000000000000000 R12: ff1100000d648230 R13: dead000000000100 R14: ff1100000d648230 R15: dffffc0000000000 FS: 0000000000000000(0000) GS:ff1100006ce80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00005620e1363990 CR3: 000000000eeb2002 CR4: 0000000000771ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn (kernel/panic.c:748) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? report_bug (lib/bug.c:201 lib/bug.c:219) ? handle_bug (arch/x86/kernel/traps.c:285) ? exc_invalid_op (arch/x86/kernel/traps.c:309 (discriminator 1)) ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621) ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./arch/x86/include/asm/irqflags.h:155 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194) ? ref_tracker_dir_exit (lib/ref_tracker.c:179) ? __pfx_ref_tracker_dir_exit (lib/ref_tracker.c:158) ? kfree (mm/slub.c:4613 mm/slub.c:4761) net_free (net/core/net_namespace.c:476 net/core/net_namespace.c:467) cleanup_net (net/core/net_namespace.c:664 (discriminator 3)) process_one_work (kernel/workqueue.c:3229) worker_thread (kernel/workqueue.c:3304 kernel/workqueue.c:3391) kthread (kernel/kthread.c:389) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) </TASK> Fixes: 76c8764 ("pfcp: add PFCP module") Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/20250104125732.17335-1-shaw.leon@gmail.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
This fixes the following hard lockup in isolate_lru_folios() during memory reclaim. If the LRU mostly contains ineligible folios this may trigger watchdog. watchdog: Watchdog detected hard LOCKUP on cpu 173 RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0 Call Trace: _raw_spin_lock_irqsave+0x31/0x40 folio_lruvec_lock_irqsave+0x5f/0x90 folio_batch_move_lru+0x91/0x150 lru_add_drain_per_cpu+0x1c/0x40 process_one_work+0x17d/0x350 worker_thread+0x27b/0x3a0 kthread+0xe8/0x120 ret_from_fork+0x34/0x50 ret_from_fork_asm+0x1b/0x30 lruvec->lru_lock owner: PID: 2865 TASK: ffff888139214d40 CPU: 40 COMMAND: "kswapd0" #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde [exception RIP: isolate_lru_folios+403] RIP: ffffffffa597df53 RSP: ffffc90006fb7c28 RFLAGS: 00000002 RAX: 0000000000000001 RBX: ffffc90006fb7c60 RCX: ffffea04a2196f88 RDX: ffffc90006fb7c60 RSI: ffffc90006fb7c60 RDI: ffffea04a2197048 RBP: ffff88812cbd3010 R8: ffffea04a2197008 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000001 R12: ffffea04a2197008 R13: ffffea04a2197048 R14: ffffc90006fb7de8 R15: 0000000003e3e937 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 <NMI exception stack> #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788 torvalds#7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0 torvalds#8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354 torvalds#9 [ffffc90006fb7ef8] kthread at ffffffffa5748238 crash> Scenario: User processe are requesting a large amount of memory and keep page active. Then a module continuously requests memory from ZONE_DMA32 area. Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached. However pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. Reproduce: Terminal 1: Construct to continuously increase pages active(anon). mkdir /tmp/memory mount -t tmpfs -o size=1024000M tmpfs /tmp/memory dd if=/dev/zero of=/tmp/memory/block bs=4M tail /tmp/memory/block Terminal 2: vmstat -a 1 active will increase. procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ... r b swpd free inact active si so bi bo 1 0 0 1445623076 45898836 83646008 0 0 0 1 0 0 1445623076 43450228 86094616 0 0 0 1 0 0 1445623076 41003480 88541364 0 0 0 1 0 0 1445623076 38557088 90987756 0 0 0 1 0 0 1445623076 36109688 93435156 0 0 0 1 0 0 1445619552 33663256 95881632 0 0 0 1 0 0 1445619804 31217140 98327792 0 0 0 1 0 0 1445619804 28769988 100774944 0 0 0 1 0 0 1445619804 26322348 103222584 0 0 0 1 0 0 1445619804 23875592 105669340 0 0 0 cat /proc/meminfo | head Active(anon) increase. MemTotal: 1579941036 kB MemFree: 1445618500 kB MemAvailable: 1453013224 kB Buffers: 6516 kB Cached: 128653956 kB SwapCached: 0 kB Active: 118110812 kB Inactive: 11436620 kB Active(anon): 115345744 kB Inactive(anon): 945292 kB When the Active(anon) is 115345744 kB, insmod module triggers the ZONE_DMA32 watermark. perf record -e vmscan:mm_vmscan_lru_isolate -aR perf script isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2 nr_skipped=2 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844 nr_skipped=28835844 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29 nr_skipped=29 nr_taken=0 lru=active_anon isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0 nr_skipped=0 nr_taken=0 lru=active_anon See nr_scanned=28835844. 28835844 * 4k = 115343376KB approximately equal to 115345744 kB. If increase Active(anon) to 1000G then insmod module triggers the ZONE_DMA32 watermark. hard lockup will occur. In my device nr_scanned = 0000000003e3e937 when hard lockup. Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB. [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53 ffffc90006fb7c30: 0000000000000020 0000000000000000 ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000 ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8 ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48 ffffc90006fb7c70: 0000000000000000 0000000000000000 ffffc90006fb7c80: 0000000000000000 0000000000000000 ffffc90006fb7c90: 0000000000000000 0000000000000000 ffffc90006fb7ca0: 0000000000000000 0000000003e3e937 ffffc90006fb7cb0: 0000000000000000 0000000000000000 ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000 About the Fixes: Why did it take eight years to be discovered? The problem requires the following conditions to occur: 1. The device memory should be large enough. 2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area. 3. The memory in ZONE_DMA32 needs to reach the watermark. If the memory is not large enough, or if the usage design of ZONE_DMA32 area memory is reasonable, this problem is difficult to detect. notes: The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL, but other suitable scenarios may also trigger the problem. Link: https://lkml.kernel.org/r/20241119060842.274072-1-liuye@kylinos.cn Fixes: b2e1875 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: liuye <liuye@kylinos.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yang Shi <yang@os.amperecomputing.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When COWing a relocation tree path, at relocation.c:replace_path(), we can trigger a lockdep splat while we are in the btrfs_search_slot() call against the relocation root. This happens in that callchain at ctree.c:read_block_for_search() when we happen to find a child extent buffer already loaded through the fs tree with a lockdep class set to the fs tree. So when we attempt to lock that extent buffer through a relocation tree we have to reset the lockdep class to the class for a relocation tree, since a relocation tree has extent buffers that used to belong to a fs tree and may currently be already loaded (we swap extent buffers between the two trees at the end of replace_path()). However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset the lockdep class at ctree.c:read_block_for_search() before we read lock an extent buffer, just like we did for btrfs_search_slot() in commit b40130b ("btrfs: fix lockdep splat with reloc root extent buffers"). So add the missing btrfs_maybe_reset_lockdep_class() calls before the attempts to read lock an extent buffer at ctree.c:read_block_for_search(). The lockdep splat was reported by syzbot and it looks like this: ====================================================== WARNING: possible circular locking dependency detected 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted ------------------------------------------------------ syz.0.0/5335 is trying to acquire lock: ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 but task is already holding lock: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> l1k#2 (btrfs-treloc-02/1){+.+.}-{4:4}: reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374 __lock_release kernel/locking/lockdep.c:5563 [inline] lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870 up_write+0x79/0x590 kernel/locking/rwsem.c:1629 btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> l1k#1 (btrfs-tree-01/1){+.+.}-{4:4}: lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693 btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline] btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132 btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573 btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755 btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153 btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351 btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline] btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330 btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline] btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374 vfs_rename+0xbdb/0xf00 fs/namei.c:5067 do_renameat2+0xd94/0x13f0 fs/namei.c:5224 __do_sys_renameat2 fs/namei.c:5258 [inline] __se_sys_renameat2 fs/namei.c:5255 [inline] __x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f -> #0 (btrfs-tree-01){++++}-{4:4}: check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f other info that might help us debug this: Chain exists of: btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-treloc-02/1); lock(btrfs-tree-01/1); lock(btrfs-treloc-02/1); rlock(btrfs-tree-01); *** DEADLOCK *** 8 locks held by syz.0.0/5335: #0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559 l1k#1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183 l1k#2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086 l1k#3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659 l1k#4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 l1k#5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288 l1k#6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 torvalds#7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189 stack backtrace: CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:94 [inline] dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120 print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074 check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206 check_prev_add kernel/locking/lockdep.c:3161 [inline] check_prevs_add kernel/locking/lockdep.c:3280 [inline] validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904 __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226 lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849 down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649 btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146 btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline] read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610 btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237 replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224 merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692 merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942 relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754 btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494 __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278 btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:906 [inline] __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f1ac6985d29 Code: ff ff c3 (...) RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29 RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007 RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88 </TASK> Reported-by: syzbot+63913e558c084f7f8fdc@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/677b3014.050a0220.3b53b0.0064.GAE@google.com/ Fixes: 9978599 ("btrfs: reduce lock contention when eb cache miss for btree search") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
We have several places across the kernel where we want to access another task's syscall arguments, such as ptrace(2), seccomp(2), etc., by making a call to syscall_get_arguments(). This works for register arguments right away by accessing the task's `regs' member of `struct pt_regs', however for stack arguments seen with 32-bit/o32 kernels things are more complicated. Technically they ought to be obtained from the user stack with calls to an access_remote_vm(), but we have an easier way available already. So as to be able to access syscall stack arguments as regular function arguments following the MIPS calling convention we copy them over from the user stack to the kernel stack in arch/mips/kernel/scall32-o32.S, in handle_sys(), to the current stack frame's outgoing argument space at the top of the stack, which is where the handler called expects to see its incoming arguments. This area is also pointed at by the `pt_regs' pointer obtained by task_pt_regs(). Make the o32 stack argument space a proper member of `struct pt_regs' then, by renaming the existing member from `pad0' to `args' and using generated offsets to access the space. No functional change though. With the change in place the o32 kernel stack frame layout at the entry to a syscall handler invoked by handle_sys() is therefore as follows: $sp + 68 -> | ... | <- pt_regs.regs[9] +---------------------+ $sp + 64 -> | $t0 | <- pt_regs.regs[8] +---------------------+ $sp + 60 -> | $a3/argument l1k#4 | <- pt_regs.regs[7] +---------------------+ $sp + 56 -> | $a2/argument l1k#3 | <- pt_regs.regs[6] +---------------------+ $sp + 52 -> | $a1/argument l1k#2 | <- pt_regs.regs[5] +---------------------+ $sp + 48 -> | $a0/argument l1k#1 | <- pt_regs.regs[4] +---------------------+ $sp + 44 -> | $v1 | <- pt_regs.regs[3] +---------------------+ $sp + 40 -> | $v0 | <- pt_regs.regs[2] +---------------------+ $sp + 36 -> | $at | <- pt_regs.regs[1] +---------------------+ $sp + 32 -> | $zero | <- pt_regs.regs[0] +---------------------+ $sp + 28 -> | stack argument torvalds#8 | <- pt_regs.args[7] +---------------------+ $sp + 24 -> | stack argument torvalds#7 | <- pt_regs.args[6] +---------------------+ $sp + 20 -> | stack argument l1k#6 | <- pt_regs.args[5] +---------------------+ $sp + 16 -> | stack argument l1k#5 | <- pt_regs.args[4] +---------------------+ $sp + 12 -> | psABI space for $a3 | <- pt_regs.args[3] +---------------------+ $sp + 8 -> | psABI space for $a2 | <- pt_regs.args[2] +---------------------+ $sp + 4 -> | psABI space for $a1 | <- pt_regs.args[1] +---------------------+ $sp + 0 -> | psABI space for $a0 | <- pt_regs.args[0] +---------------------+ holding user data received and with the first 4 frame slots reserved by the psABI for the compiler to spill the incoming arguments from $a0-$a3 registers (which it sometimes does according to its needs) and the next 4 frame slots designated by the psABI for any stack function arguments that follow. This data is also available for other tasks to peek/poke at as reqired and where permitted. Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
This makes ptrace/get_syscall_info selftest pass on mips o32 and mips64 o32 by fixing the following two test assertions: 1. get_syscall_info test assertion on mips o32: # get_syscall_info.c:218:get_syscall_info:Expected exp_args[5] (3134521044) == info.entry.args[4] (4911432) # get_syscall_info.c:219:get_syscall_info:wait l1k#1: entry stop mismatch 2. get_syscall_info test assertion on mips64 o32: # get_syscall_info.c:209:get_syscall_info:Expected exp_args[2] (3134324433) == info.entry.args[1] (18446744072548908753) # get_syscall_info.c:210:get_syscall_info:wait l1k#1: entry stop mismatch The first assertion happens due to mips_get_syscall_arg() trying to access another task's context but failing to do it properly because get_user() it calls just peeks at the current task's context. It usually does not crash because the default user stack always gets assigned the same VMA, but it is pure luck which mips_get_syscall_arg() wouldn't have if e.g. the stack was switched (via setcontext(3) or however) or a non-default process's thread peeked at, and in any case irrelevant data is obtained just as observed with the test case. mips_get_syscall_arg() ought to be using access_remote_vm() instead to retrieve the other task's stack contents, but given that the data has been already obtained and saved in `struct pt_regs' it would be an overkill. The first assertion is fixed for mips o32 by using struct pt_regs.args instead of get_user() to obtain syscall arguments. This approach works due to this piece in arch/mips/kernel/scall32-o32.S: /* * Ok, copy the args from the luser stack to the kernel stack. */ .set push .set noreorder .set nomacro load_a4: user_lw(t5, 16(t0)) # argument l1k#5 from usp load_a5: user_lw(t6, 20(t0)) # argument l1k#6 from usp load_a6: user_lw(t7, 24(t0)) # argument torvalds#7 from usp load_a7: user_lw(t8, 28(t0)) # argument torvalds#8 from usp loads_done: sw t5, PT_ARG4(sp) # argument l1k#5 to ksp sw t6, PT_ARG5(sp) # argument l1k#6 to ksp sw t7, PT_ARG6(sp) # argument torvalds#7 to ksp sw t8, PT_ARG7(sp) # argument torvalds#8 to ksp .set pop .section __ex_table,"a" PTR_WD load_a4, bad_stack_a4 PTR_WD load_a5, bad_stack_a5 PTR_WD load_a6, bad_stack_a6 PTR_WD load_a7, bad_stack_a7 .previous arch/mips/kernel/scall64-o32.S has analogous code for mips64 o32 that allows fixing the issue by obtaining syscall arguments from struct pt_regs.regs[4..11] instead of the erroneous use of get_user(). The second assertion is fixed by truncating 64-bit values to 32-bit syscall arguments. Fixes: c0ff3c5 ("MIPS: Enable HAVE_ARCH_TRACEHOOK.") Signed-off-by: Dmitry V. Levin <ldv@strace.io> Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush() generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC, which causes the flush_bio to be throttled by wbt_wait(). An example from v5.4, similar problem also exists in upstream: crash> bt 2091206 PID: 2091206 TASK: ffff2050df92a300 CPU: 109 COMMAND: "kworker/u260:0" #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8 #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4 #2 [ffff800084a2f880] schedule at ffff800040bfa4b4 #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4 #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0 #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254 torvalds#7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38 torvalds#8 [ffff800084a2fa60] generic_make_request at ffff800040570138 torvalds#9 [ffff800084a2fae0] submit_bio at ffff8000405703b4 torvalds#10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs] torvalds#11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs] torvalds#12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs] torvalds#13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs] torvalds#14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs] torvalds#15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs] torvalds#16 [ffff800084a2fdb0] process_one_work at ffff800040111d08 torvalds#17 [ffff800084a2fe00] worker_thread at ffff8000401121cc torvalds#18 [ffff800084a2fe70] kthread at ffff800040118de4 After commit 2def284 ("xfs: don't allow log IO to be throttled"), the metadata submitted by xlog_write_iclog() should not be throttled. But due to the existence of the dm layer, throttling flush_bio indirectly causes the metadata bio to be throttled. Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes wbt_should_throttle() return false to avoid wbt_wait(). Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Tianxiang Peng <txpeng@tencent.com> Reviewed-by: Hao Peng <flyingpeng@tencent.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Despite the fact that several lockdep-related checks are skipped when calling trylock* versions of the locking primitives, for example mutex_trylock, each time the mutex is acquired, a held_lock is still placed onto the lockdep stack by __lock_acquire() which is called regardless of whether the trylock* or regular locking API was used. This means that if the caller successfully acquires more than MAX_LOCK_DEPTH locks of the same class, even when using mutex_trylock, lockdep will still complain that the maximum depth of the held lock stack has been reached and disable itself. For example, the following error currently occurs in the ARM version of KVM, once the code tries to lock all vCPUs of a VM configured with more than MAX_LOCK_DEPTH vCPUs, a situation that can easily happen on modern systems, where having more than 48 CPUs is common, and it's also common to run VMs that have vCPU counts approaching that number: [ 328.171264] BUG: MAX_LOCK_DEPTH too low! [ 328.175227] turning off the locking correctness validator. [ 328.180726] Please attach the output of /proc/lock_stat to the bug report [ 328.187531] depth: 48 max: 48! [ 328.190678] 48 locks held by qemu-kvm/11664: [ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0 [ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 Luckily, in all instances that require locking all vCPUs, the 'kvm->lock' is taken a priori, and that fact makes it possible to use the little known feature of lockdep, called a 'nest_lock', to avoid this warning and subsequent lockdep self-disablement. The action of 'nested lock' being provided to lockdep's lock_acquire(), causes the lockdep to detect that the top of the held lock stack contains a lock of the same class and then increment its reference counter instead of pushing a new held_lock item onto that stack. See __lock_acquire for more information. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use kvm_trylock_all_vcpus instead of a custom implementation when locking all vCPUs of a VM, to avoid triggering a lockdep warning, in the case in which the VM is configured to have more than MAX_LOCK_DEPTH vCPUs. This fixes the following false lockdep warning: [ 328.171264] BUG: MAX_LOCK_DEPTH too low! [ 328.175227] turning off the locking correctness validator. [ 328.180726] Please attach the output of /proc/lock_stat to the bug report [ 328.187531] depth: 48 max: 48! [ 328.190678] 48 locks held by qemu-kvm/11664: [ 328.194957] #0: ffff800086de5ba0 (&kvm->lock){+.+.}-{3:3}, at: kvm_ioctl_create_device+0x174/0x5b0 [ 328.204048] #1: ffff0800e78800b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.212521] #2: ffff07ffeee51e98 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.220991] #3: ffff0800dc7d80b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.229463] #4: ffff07ffe0c980b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.237934] #5: ffff0800a3883c78 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 [ 328.246405] #6: ffff07fffbe480b8 (&vcpu->mutex){+.+.}-{3:3}, at: lock_all_vcpus+0x16c/0x2a0 Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Marc Zyngier <maz@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Message-ID: <20250512180407.659015-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Finally trying this on hardware! When testing on Fedora38 (x86-HW), I ran into a hang during the boot process. Which seemed to be caused by a NULL de-reference of the encoding of the signature
req->enc
.This patch adds a NULL check, that now allows the boot to proceed but I suppose it fails the ecdsa_verify. Perhaps there's needs to be a followup to this to fix that?.
Also switch to
strncmp()
to be 'safer'.Squish into commit: 09d3a2c