-
Notifications
You must be signed in to change notification settings - Fork 58.1k
proper license on Microsoft-related files #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Is there some way to just turn off github pull requests? They're all jokes. |
You could ask support@github.com or whatever their help email is. |
don't do that, by tomorrow people will lose the urge to troll ;) |
Are you considering use Github as the standard public repository? Pull requests generally work out well for open source projects, but I'd be curious how it works out for a project the size of Linux. |
@torvalds you could just report the user. You might have to go to their profile page and click the gear to get to the reporting page. |
Reported @Holek, you guys all should too as @Spaceghost said :P |
@diegoviola called @Holek's mommy. |
I would laugh if @Holek is some millionaire troll :P |
@Holek I see you contribute to wikipedia.pl quite a lot. I would have thought that someone who probably deals with a lot of spam himself would have had more sense than this. |
Hello, everybody! First of all, I apologize about that crappy prank: that should not have happened. And I'm not saying that, because people are mad, but because I should have known better before doing that. I have a huge respect towards Linux, and people doing their job on this project, and me making them waste time on such pull requests was irresponsible. I sincerely apologize for my childish behaviour, and wish you the best of code! |
Good job for coming around, Holek! :) |
commit 1780f2d upstream. Affected kernels 2.6.36 - 3.0 AppArmor may do a GFP_KERNEL memory allocation with task_lock(tsk->group_leader); held when called from security_task_setrlimit. This will only occur when the task's current policy has been replaced, and the task's creds have not been updated before entering the LSM security_task_setrlimit() hook. BUG: sleeping function called from invalid context at mm/slub.c:847 in_atomic(): 1, irqs_disabled(): 0, pid: 1583, name: cupsd 2 locks held by cupsd/1583: #0: (tasklist_lock){.+.+.+}, at: [<ffffffff8104dafa>] do_prlimit+0x61/0x189 #1: (&(&p->alloc_lock)->rlock){+.+.+.}, at: [<ffffffff8104db2d>] do_prlimit+0x94/0x189 Pid: 1583, comm: cupsd Not tainted 3.0.0-rc2-git1 #7 Call Trace: [<ffffffff8102ebf2>] __might_sleep+0x10d/0x112 [<ffffffff810e6f46>] slab_pre_alloc_hook.isra.49+0x2d/0x33 [<ffffffff810e7bc4>] kmem_cache_alloc+0x22/0x132 [<ffffffff8105b6e6>] prepare_creds+0x35/0xe4 [<ffffffff811c0675>] aa_replace_current_profile+0x35/0xb2 [<ffffffff811c4d2d>] aa_current_profile+0x45/0x4c [<ffffffff811c4d4d>] apparmor_task_setrlimit+0x19/0x3a [<ffffffff811beaa5>] security_task_setrlimit+0x11/0x13 [<ffffffff8104db6b>] do_prlimit+0xd2/0x189 [<ffffffff8104dea9>] sys_setrlimit+0x3b/0x48 [<ffffffff814062bb>] system_call_fastpath+0x16/0x1b Signed-off-by: John Johansen <john.johansen@canonical.com> Reported-by: Miles Lane <miles.lane@gmail.com> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
This patch updates 'kvm run' to boot to host filesystem via 9p '/bin/sh' by default: $ ./kvm run # kvm run -k ../../arch/x86/boot/bzImage -m 320 -c 2 --name guest-3462 [ 0.000000] Linux version 3.1.0-rc1+ (penberg@tiger) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5) ) torvalds#7 SMP PREEMPT Tue Aug 9 16:39:20 EEST 2011 [ 0.000000] Command line: notsc noapic noacpi pci=conf1 reboot=k panic=1 console=ttyS0 earlyprintk=serial init=/bin/sh root=/dev/vda rw root=/dev/root rootflags=rw,trans=virtio,version=9p2000.u rootfstype=9p [snip] [ 1.803261] VFS: Mounted root (9p filesystem) on device 0:13. [ 1.805153] devtmpfs: mounted [ 1.808353] Freeing unused kernel memory: 924k freed [ 1.810592] Write protecting the kernel read-only data: 12288k [ 1.816268] Freeing unused kernel memory: 632k freed [ 1.826030] Freeing unused kernel memory: 1448k freed sh: cannot set terminal process group (-1): Inappropriate ioctl for device sh: no job control in this shell sh-4.1# Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Asias He <asias.hejun@gmail.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Prasad Joshi <prasadjoshi124@gmail.com> Cc: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
This patch validates sdev pointer in scsi_dh_activate before proceeding further. Without this check we might see the panic as below. I have seen this panic multiple times.. Call trace: #0 [ffff88007d647b50] machine_kexec at ffffffff81020902 #1 [ffff88007d647ba0] crash_kexec at ffffffff810875b0 #2 [ffff88007d647c70] oops_end at ffffffff8139c650 #3 [ffff88007d647c90] __bad_area_nosemaphore at ffffffff8102dd15 #4 [ffff88007d647d50] page_fault at ffffffff8139b8cf [exception RIP: scsi_dh_activate+0x82] RIP: ffffffffa0041922 RSP: ffff88007d647e00 RFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000093c5 RDX: 00000000000093c5 RSI: ffffffffa02e6640 RDI: ffff88007cc88988 RBP: 000000000000000f R8: ffff88007d646000 R9: 0000000000000000 R10: ffff880082293790 R11: 00000000ffffffff R12: ffff88007cc88988 R13: 0000000000000000 R14: 0000000000000286 R15: ffff880037b845e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #5 [ffff88007d647e38] run_workqueue at ffffffff81060268 torvalds#6 [ffff88007d647e78] worker_thread at ffffffff81060386 torvalds#7 [ffff88007d647ee8] kthread at ffffffff81064436 torvalds#8 [ffff88007d647f48] kernel_thread at ffffffff81003fba Signed-off-by: Babu Moger <babu.moger@netapp.com> Cc: stable@kernel.org Signed-off-by: James Bottomley <JBottomley@Parallels.com>
commit a18a920 upstream. This patch validates sdev pointer in scsi_dh_activate before proceeding further. Without this check we might see the panic as below. I have seen this panic multiple times.. Call trace: #0 [ffff88007d647b50] machine_kexec at ffffffff81020902 #1 [ffff88007d647ba0] crash_kexec at ffffffff810875b0 #2 [ffff88007d647c70] oops_end at ffffffff8139c650 #3 [ffff88007d647c90] __bad_area_nosemaphore at ffffffff8102dd15 #4 [ffff88007d647d50] page_fault at ffffffff8139b8cf [exception RIP: scsi_dh_activate+0x82] RIP: ffffffffa0041922 RSP: ffff88007d647e00 RFLAGS: 00010046 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000000093c5 RDX: 00000000000093c5 RSI: ffffffffa02e6640 RDI: ffff88007cc88988 RBP: 000000000000000f R8: ffff88007d646000 R9: 0000000000000000 R10: ffff880082293790 R11: 00000000ffffffff R12: ffff88007cc88988 R13: 0000000000000000 R14: 0000000000000286 R15: ffff880037b845e0 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000 #5 [ffff88007d647e38] run_workqueue at ffffffff81060268 torvalds#6 [ffff88007d647e78] worker_thread at ffffffff81060386 torvalds#7 [ffff88007d647ee8] kthread at ffffffff81064436 torvalds#8 [ffff88007d647f48] kernel_thread at ffffffff81003fba Signed-off-by: Babu Moger <babu.moger@netapp.com> Signed-off-by: James Bottomley <JBottomley@Parallels.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
If the pte mapping in generic_perform_write() is unmapped between iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic(), the "copied" parameter to ->end_write can be zero. ext4 couldn't cope with it with delayed allocations enabled. This skips the i_disksize enlargement logic if copied is zero and no new data was appeneded to the inode. gdb> bt #0 0xffffffff811afe80 in ext4_da_should_update_i_disksize (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x1\ 08000, len=0x1000, copied=0x0, page=0xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2467 #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 #2 0xffffffff810d97f1 in generic_perform_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value o\ ptimized out>, pos=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2440 #3 generic_file_buffered_write (iocb=<value optimized out>, iov=<value optimized out>, nr_segs=<value optimized out>, p\ os=0x108000, ppos=0xffff88001e26be40, count=<value optimized out>, written=0x0) at mm/filemap.c:2482 #4 0xffffffff810db5d1 in __generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, ppos=0\ xffff88001e26be40) at mm/filemap.c:2600 #5 0xffffffff810db853 in generic_file_aio_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=<value optimi\ zed out>, pos=<value optimized out>) at mm/filemap.c:2632 #6 0xffffffff811a71aa in ext4_file_write (iocb=0xffff88001e26bde8, iov=0xffff88001e26bec8, nr_segs=0x1, pos=0x108000) a\ t fs/ext4/file.c:136 #7 0xffffffff811375aa in do_sync_write (filp=0xffff88003f606a80, buf=<value optimized out>, len=<value optimized out>, \ ppos=0xffff88001e26bf48) at fs/read_write.c:406 #8 0xffffffff81137e56 in vfs_write (file=0xffff88003f606a80, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x4\ 000, pos=0xffff88001e26bf48) at fs/read_write.c:435 #9 0xffffffff8113816c in sys_write (fd=<value optimized out>, buf=0x1ec2960 <Address 0x1ec2960 out of bounds>, count=0x\ 4000) at fs/read_write.c:487 #10 <signal handler called> #11 0x00007f120077a390 in __brk_reservation_fn_dmi_alloc__ () #12 0x0000000000000000 in ?? () gdb> print offset $22 = 0xffffffffffffffff gdb> print idx $23 = 0xffffffff gdb> print inode->i_blkbits $24 = 0xc gdb> up #1 ext4_da_write_end (file=0xffff88003f606a80, mapping=0xffff88001d3824e0, pos=0x108000, len=0x1000, copied=0x0, page=0\ xffffea0000d792e8, fsdata=0x0) at fs/ext4/inode.c:2512 2512 if (ext4_da_should_update_i_disksize(page, end)) { gdb> print start $25 = 0x0 gdb> print end $26 = 0xffffffffffffffff gdb> print pos $27 = 0x108000 gdb> print new_i_size $28 = 0x108000 gdb> print ((struct ext4_inode_info *)((char *)inode-((int)(&((struct ext4_inode_info *)0)->vfs_inode))))->i_disksize $29 = 0xd9000 gdb> down 2467 for (i = 0; i < idx; i++) gdb> print i $30 = 0xd44acbee This is 100% reproducible with some autonuma development code tuned in a very aggressive manner (not normal way even for knumad) which does "exotic" changes to the ptes. It wouldn't normally trigger but I don't see why it can't happen normally if the page is added to swap cache in between the two faults leading to "copied" being zero (which then hangs in ext4). So it should be fixed. Especially possible with lumpy reclaim (albeit disabled if compaction is enabled) as that would ignore the young bits in the ptes. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@kernel.org
The current selection of the GPTIMER on was result of a hardware issue in early versions of the Beagleboards (Ax and B1 thru B4). [1] [2] Its been long since the hardware issue has been fixed. This patch uses GPTIMER 1 for all newer board revisions incl. Beagleboard XM. [1] http://thread.gmane.org/gmane.comp.hardware.beagleboard.general/91 [2] Errata torvalds#7 at http://elinux.org/BeagleBoard#Errata Signed-off-by: Sanjeev Premi <premi@ti.com> Cc: Paul Walmsley <paul@pwsan.com> Reviewed-by: Paul Walmsley <paul@pwsan.com>
$ wget "http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=mac80211_offchannel_rework_revert.patch;h=859799714cd85a58450ecde4a1dabc5adffd5100;hb=refs/heads/f16" -O mac80211_offchannel_rework_revert.patch $ patch -p1 --dry-run < mac80211_offchannel_rework_revert.patch patching file net/mac80211/ieee80211_i.h Hunk #1 succeeded at 702 (offset 8 lines). Hunk #2 succeeded at 712 (offset 8 lines). Hunk #3 succeeded at 1143 (offset -57 lines). patching file net/mac80211/main.c patching file net/mac80211/offchannel.c Hunk #1 succeeded at 18 (offset 1 line). Hunk #2 succeeded at 42 (offset 1 line). Hunk #3 succeeded at 78 (offset 1 line). Hunk #4 succeeded at 96 (offset 1 line). Hunk #5 succeeded at 162 (offset 1 line). Hunk torvalds#6 succeeded at 182 (offset 1 line). patching file net/mac80211/rx.c Hunk #1 succeeded at 421 (offset 4 lines). Hunk #2 succeeded at 2864 (offset 87 lines). patching file net/mac80211/scan.c Hunk #1 succeeded at 213 (offset 1 line). Hunk #2 succeeded at 256 (offset 2 lines). Hunk #3 succeeded at 288 (offset 2 lines). Hunk #4 succeeded at 333 (offset 2 lines). Hunk #5 succeeded at 482 (offset 2 lines). Hunk torvalds#6 succeeded at 498 (offset 2 lines). Hunk torvalds#7 succeeded at 516 (offset 2 lines). Hunk torvalds#8 succeeded at 530 (offset 2 lines). Hunk torvalds#9 succeeded at 555 (offset 2 lines). patching file net/mac80211/tx.c Hunk #1 succeeded at 259 (offset 1 line). patching file net/mac80211/work.c Hunk #1 succeeded at 899 (offset -2 lines). Hunk #2 succeeded at 949 (offset -2 lines). Hunk #3 succeeded at 1046 (offset -2 lines). Hunk #4 succeeded at 1054 (offset -2 lines).
If the netdev is already in NETREG_UNREGISTERING/_UNREGISTERED state, do not update the real num tx queues. netdev_queue_update_kobjects() is already called via remove_queue_kobjects() at NETREG_UNREGISTERING time. So, when upper layer driver, e.g., FCoE protocol stack is monitoring the netdev event of NETDEV_UNREGISTER and calls back to LLD ndo_fcoe_disable() to remove extra queues allocated for FCoE, the associated txq sysfs kobjects are already removed, and trying to update the real num queues would cause something like below: ... PID: 25138 TASK: ffff88021e64c440 CPU: 3 COMMAND: "kworker/3:3" #0 [ffff88021f007760] machine_kexec at ffffffff810226d9 #1 [ffff88021f0077d0] crash_kexec at ffffffff81089d2d #2 [ffff88021f0078a0] oops_end at ffffffff813bca78 #3 [ffff88021f0078d0] no_context at ffffffff81029e72 #4 [ffff88021f007920] __bad_area_nosemaphore at ffffffff8102a155 #5 [ffff88021f0079f0] bad_area_nosemaphore at ffffffff8102a23e torvalds#6 [ffff88021f007a00] do_page_fault at ffffffff813bf32e torvalds#7 [ffff88021f007b10] page_fault at ffffffff813bc045 [exception RIP: sysfs_find_dirent+17] RIP: ffffffff81178611 RSP: ffff88021f007bc0 RFLAGS: 00010246 RAX: ffff88021e64c440 RBX: ffffffff8156cc63 RCX: 0000000000000004 RDX: ffffffff8156cc63 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88021f007be0 R8: 0000000000000004 R9: 0000000000000008 R10: ffffffff816fed00 R11: 0000000000000004 R12: 0000000000000000 R13: ffffffff8156cc63 R14: 0000000000000000 R15: ffff8802222a0000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#8 [ffff88021f007be8] sysfs_get_dirent at ffffffff81178c07 torvalds#9 [ffff88021f007c18] sysfs_remove_group at ffffffff8117ac27 torvalds#10 [ffff88021f007c48] netdev_queue_update_kobjects at ffffffff813178f9 torvalds#11 [ffff88021f007c88] netif_set_real_num_tx_queues at ffffffff81303e38 torvalds#12 [ffff88021f007cc8] ixgbe_set_num_queues at ffffffffa0249763 [ixgbe] torvalds#13 [ffff88021f007cf8] ixgbe_init_interrupt_scheme at ffffffffa024ea89 [ixgbe] torvalds#14 [ffff88021f007d48] ixgbe_fcoe_disable at ffffffffa0267113 [ixgbe] torvalds#15 [ffff88021f007d68] vlan_dev_fcoe_disable at ffffffffa014fef5 [8021q] torvalds#16 [ffff88021f007d78] fcoe_interface_cleanup at ffffffffa02b7dfd [fcoe] torvalds#17 [ffff88021f007df8] fcoe_destroy_work at ffffffffa02b7f08 [fcoe] torvalds#18 [ffff88021f007e18] process_one_work at ffffffff8105d7ca torvalds#19 [ffff88021f007e68] worker_thread at ffffffff81060513 torvalds#20 [ffff88021f007ee8] kthread at ffffffff810648b6 torvalds#21 [ffff88021f007f48] kernel_thread_helper at ffffffff813c40f4 Signed-off-by: Yi Zou <yi.zou@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
If the netdev is already in NETREG_UNREGISTERING/_UNREGISTERED state, do not update the real num tx queues. netdev_queue_update_kobjects() is already called via remove_queue_kobjects() at NETREG_UNREGISTERING time. So, when upper layer driver, e.g., FCoE protocol stack is monitoring the netdev event of NETDEV_UNREGISTER and calls back to LLD ndo_fcoe_disable() to remove extra queues allocated for FCoE, the associated txq sysfs kobjects are already removed, and trying to update the real num queues would cause something like below: ... PID: 25138 TASK: ffff88021e64c440 CPU: 3 COMMAND: "kworker/3:3" #0 [ffff88021f007760] machine_kexec at ffffffff810226d9 #1 [ffff88021f0077d0] crash_kexec at ffffffff81089d2d #2 [ffff88021f0078a0] oops_end at ffffffff813bca78 #3 [ffff88021f0078d0] no_context at ffffffff81029e72 #4 [ffff88021f007920] __bad_area_nosemaphore at ffffffff8102a155 #5 [ffff88021f0079f0] bad_area_nosemaphore at ffffffff8102a23e torvalds#6 [ffff88021f007a00] do_page_fault at ffffffff813bf32e torvalds#7 [ffff88021f007b10] page_fault at ffffffff813bc045 [exception RIP: sysfs_find_dirent+17] RIP: ffffffff81178611 RSP: ffff88021f007bc0 RFLAGS: 00010246 RAX: ffff88021e64c440 RBX: ffffffff8156cc63 RCX: 0000000000000004 RDX: ffffffff8156cc63 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88021f007be0 R8: 0000000000000004 R9: 0000000000000008 R10: ffffffff816fed00 R11: 0000000000000004 R12: 0000000000000000 R13: ffffffff8156cc63 R14: 0000000000000000 R15: ffff8802222a0000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 torvalds#8 [ffff88021f007be8] sysfs_get_dirent at ffffffff81178c07 torvalds#9 [ffff88021f007c18] sysfs_remove_group at ffffffff8117ac27 torvalds#10 [ffff88021f007c48] netdev_queue_update_kobjects at ffffffff813178f9 torvalds#11 [ffff88021f007c88] netif_set_real_num_tx_queues at ffffffff81303e38 torvalds#12 [ffff88021f007cc8] ixgbe_set_num_queues at ffffffffa0249763 [ixgbe] torvalds#13 [ffff88021f007cf8] ixgbe_init_interrupt_scheme at ffffffffa024ea89 [ixgbe] torvalds#14 [ffff88021f007d48] ixgbe_fcoe_disable at ffffffffa0267113 [ixgbe] torvalds#15 [ffff88021f007d68] vlan_dev_fcoe_disable at ffffffffa014fef5 [8021q] torvalds#16 [ffff88021f007d78] fcoe_interface_cleanup at ffffffffa02b7dfd [fcoe] torvalds#17 [ffff88021f007df8] fcoe_destroy_work at ffffffffa02b7f08 [fcoe] torvalds#18 [ffff88021f007e18] process_one_work at ffffffff8105d7ca torvalds#19 [ffff88021f007e68] worker_thread at ffffffff81060513 torvalds#20 [ffff88021f007ee8] kthread at ffffffff810648b6 torvalds#21 [ffff88021f007f48] kernel_thread_helper at ffffffff813c40f4 Signed-off-by: Yi Zou <yi.zou@intel.com> Tested-by: Ross Brattain <ross.b.brattain@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72e #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 torvalds#6 [d72d3cb4] isolate_migratepages at c030b15a torvalds#7 [d72d3d14] zone_watermark_ok at c02d26cb torvalds#8 [d72d3d2c] compact_zone at c030b8d torvalds#9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72e #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 torvalds#6 [d72d3cb4] isolate_migratepages at c030b15a torvalds#7 [d72d3d14] zone_watermark_ok at c02d26cb torvalds#8 [d72d3d2c] compact_zone at c030b8d torvalds#9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72e #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8d #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72e #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8d #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72e #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8d #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72e #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8d #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
…S block during isolation for migration commit 0bf380b upstream. When isolating for migration, migration starts at the start of a zone which is not necessarily pageblock aligned. Further, it stops isolating when COMPACT_CLUSTER_MAX pages are isolated so migrate_pfn is generally not aligned. This allows isolate_migratepages() to call pfn_to_page() on an invalid PFN which can result in a crash. This was originally reported against a 3.0-based kernel with the following trace in a crash dump. PID: 9902 TASK: d47aecd0 CPU: 0 COMMAND: "memcg_process_s" #0 [d72d3ad0] crash_kexec at c028cfdb #1 [d72d3b24] oops_end at c05c5322 #2 [d72d3b38] __bad_area_nosemaphore at c0227e60 #3 [d72d3bec] bad_area at c0227fb6 #4 [d72d3c00] do_page_fault at c05c72e #5 [d72d3c80] error_code (via page_fault) at c05c47a4 EAX: 00000000 EBX: 000c0000 ECX: 00000001 EDX: 00000807 EBP: 000c0000 DS: 007b ESI: 00000001 ES: 007b EDI: f3000a80 GS: 6f50 CS: 0060 EIP: c030b15a ERR: ffffffff EFLAGS: 00010002 #6 [d72d3cb4] isolate_migratepages at c030b15a #7 [d72d3d14] zone_watermark_ok at c02d26cb #8 [d72d3d2c] compact_zone at c030b8d #9 [d72d3d68] compact_zone_order at c030bba1 torvalds#10 [d72d3db4] try_to_compact_pages at c030bc84 torvalds#11 [d72d3ddc] __alloc_pages_direct_compact at c02d61e7 torvalds#12 [d72d3e08] __alloc_pages_slowpath at c02d66c7 torvalds#13 [d72d3e78] __alloc_pages_nodemask at c02d6a97 torvalds#14 [d72d3eb8] alloc_pages_vma at c030a845 torvalds#15 [d72d3ed4] do_huge_pmd_anonymous_page at c03178eb torvalds#16 [d72d3f00] handle_mm_fault at c02f36c6 torvalds#17 [d72d3f30] do_page_fault at c05c70ed torvalds#18 [d72d3fb] error_code (via page_fault) at c05c47a4 EAX: b71ff000 EBX: 00000001 ECX: 00001600 EDX: 00000431 DS: 007b ESI: 08048950 ES: 007b EDI: bfaa3788 SS: 007b ESP: bfaa36e0 EBP: bfaa3828 GS: 6f50 CS: 0073 EIP: 080487c8 ERR: ffffffff EFLAGS: 00010202 It was also reported by Herbert van den Bergh against 3.1-based kernel with the following snippet from the console log. BUG: unable to handle kernel paging request at 01c00008 IP: [<c0522399>] isolate_migratepages+0x119/0x390 *pdpt = 000000002f7ce001 *pde = 0000000000000000 It is expected that it also affects 3.2.x and current mainline. The problem is that pfn_valid is only called on the first PFN being checked and that PFN is not necessarily aligned. Lets say we have a case like this H = MAX_ORDER_NR_PAGES boundary | = pageblock boundary m = cc->migrate_pfn f = cc->free_pfn o = memory hole H------|------H------|----m-Hoooooo|ooooooH-f----|------H The migrate_pfn is just below a memory hole and the free scanner is beyond the hole. When isolate_migratepages started, it scans from migrate_pfn to migrate_pfn+pageblock_nr_pages which is now in a memory hole. It checks pfn_valid() on the first PFN but then scans into the hole where there are not necessarily valid struct pages. This patch ensures that isolate_migratepages calls pfn_valid when necessary. Reported-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Tested-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
…CAN XL step 3/3" Vincent Mailhol <mailhol@kernel.org> says: In November last year, I sent an RFC to introduce CAN XL [1]. That RFC, despite positive feedback, was put on hold due to some unanswered question concerning the PWM encoding [2]. While stuck, some small preparation work was done in parallel in [3] by refactoring the struct can_priv and doing some trivial clean-up and renaming. Initially, [3] received zero feedback but was eventually merged after splitting it in smaller parts and resending it. Finally, in July this year, we clarified the remaining mysteries about PWM calculation, thus unlocking the series. Summer being a bit busy because of some personal matters brings us to now. After doing all the refactoring and adding all the CAN XL features, the final result is more than 30 patches, definitively too much for a single series. So I am splitting the remaining changes three: - can: rework the CAN MTU logic [4] - can: netlink: preparation before introduction of CAN XL (this series) - CAN XL (will come right after the two preparation series get merged) And thus, this series continues and finishes the preparation work done in [3] and [4]. It contains all the refactoring needed to smoothly introduce CAN XL. The goal is to: - split the functions in smaller pieces: CAN XL will introduce a fair amount of code. And some functions which are already fairly long (86 lines for can_validate(), 215 lines for can_changelink()) would grow to disproportionate sizes if the CAN XL logic were to be inlined in those functions. - repurpose the existing code to handle both CAN FD and CAN XL: a huge part of CAN XL simply reuses the CAN FD logic. All the existing CAN FD logic is made more generic to handle both CAN FD and XL. In more details: - Patch #1 moves struct data_bittiming_params from dev.h to bittiming.h and patch #2 makes can_get_relative_tdco() FD agnostic before also moving it to bittiming.h. - Patch #3 adds some comments to netlink.h tagging which IFLA symbols are FD specific. - Patches #4 to torvalds#6 are refactoring can_validate() and can_validate_bittiming(). - Patches torvalds#7 to torvalds#11 are refactoring can_changelink() and can_tdc_changelink(). - Patches torvalds#12 and torvalds#13 are refactoring can_get_size() and can_tdc_get_size(). - Patches torvalds#14 to torvalds#17 are refactoring can_fill_info() and can_tdc_fill_info(). - Patch torvalds#18 makes can_calc_tdco() FD agnostic. - Patch torvalds#19 adds can_get_ctrlmode_str() which converts control mode flags into strings. This is done in preparation of patch torvalds#20. - Patch torvalds#20 is the final patch and improves the user experience by providing detailed error messages whenever invalid parameters are provided. All those error messages came into handy when debugging the upcoming CAN XL patches. Aside from the last patch, the other changes do not impact any of the existing functionalities. The follow up series which introduces CAN XL is nearly completed but will be sent only once this one is approved: one thing at a time, I do not want to overwhelm people (including myself). [1] https://lore.kernel.org/linux-can/20241110155902.72807-16-mailhol.vincent@wanadoo.fr/ [2] https://lore.kernel.org/linux-can/c4771c16-c578-4a6d-baee-918fe276dbe9@wanadoo.fr/ [3] https://lore.kernel.org/linux-can/20241110155902.72807-16-mailhol.vincent@wanadoo.fr/ [4] https://lore.kernel.org/linux-can/20250923-can-fix-mtu-v2-0-984f9868db69@kernel.org/ Link: https://patch.msgid.link/20250923-canxl-netlink-prep-v4-0-e720d28f66fe@kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
For the latest kernel, with arm's multi_v7_defconfig, and set CONFIG_PREEMPT=y, CONFIG_DEBUG_PREEMPT=y, CONFIG_ARM_LPAE=y, if a user program try to accesses any valid kernel address, for example: ```c static void han(int x) { while (1); } int main(void) { signal(SIGSEGV, han); /* 0xc0331fd4 is just a kernel address in kernel .text section */ __asm__ volatile (""::"r"(*(int *)(uintptr_t)0xc0331fd4):"memory"); while (1); return 0; } ``` , the following WARN will be triggered: [ 1.089103] BUG: using smp_processor_id() in preemptible [00000000] code: init/1 [ 1.093367] caller is __do_user_fault+0x20/0x6c [ 1.094355] CPU: 0 UID: 0 PID: 1 Comm: init Not tainted 6.14.3 torvalds#7 [ 1.094585] Hardware name: Generic DT based system [ 1.094706] Call trace: [ 1.095211] unwind_backtrace from show_stack+0x10/0x14 [ 1.095329] show_stack from dump_stack_lvl+0x50/0x5c [ 1.095352] dump_stack_lvl from check_preemption_disabled+0x104/0x108 [ 1.095448] check_preemption_disabled from __do_user_fault+0x20/0x6c [ 1.095459] __do_user_fault from do_page_fault+0x334/0x3dc [ 1.095505] do_page_fault from do_DataAbort+0x30/0xa8 [ 1.095528] do_DataAbort from __dabt_usr+0x54/0x60 [ 1.095570] Exception stack(0xf0825fb0 to 0xf0825ff8) This WARN indicates that the current CPU is not stable, which means that current can be migrated to other CPUs. Therefore, in some scenarios, mitigation measures may be missed, such as: 1. Thread A attacks on cpu0 and triggers do_page_fault 2. Thread A migrates to cpu1 before bp_hardening 3. Thread A do bp_hardening on cpu1 4. Thread A migrates to cpu0 5. Thread A ret_to_user on cpu0 Assuming that all of the context_stwitch() mentioned above does not trigger switch_mm(), therefore all of the context_stwitch() does not trigger mitigation. Thread A successfully bypassed the mitigation on cpu0. Over the past six years, there have been continuous reports of this bug: 2025.4.24 https://lore.kernel.org/all/20250424100437.27477-1-xieyuanbin1@huawei.com/ 2022.6.22 https://lore.kernel.org/all/795c9463-452e-bf64-1cc0-c318ccecb1da@I-love.SAKURA.ne.jp/ 2021.3.25 https://lore.kernel.org/all/20210325095049.6948-1-liu.xiang@zlingsmart.com/ 2021.3.12 https://lore.kernel.org/all/20210312041246.15113-1-qiang.zhang@windriver.com/ 2021.3.11 https://lore.kernel.org/all/0000000000007604cb05bd3e6968@google.com/ 2019.5.27 https://lore.kernel.org/all/1558949979-129251-1-git-send-email-gaoyongliang@huawei.com/ 2019.3.19 https://lore.kernel.org/all/20190319203239.gl46fxnfz6gzeeic@linutronix.de/ To fix it, we must check whether mitigation are needed before enabling interrupt(with PREEMPT) or before calling mm_read_lock()(without PREEMPT). Fixes: f5fe12b ("ARM: spectre-v2: harden user aborts in kernel space") Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
…CAN XL step 3/3" Vincent Mailhol <mailhol@kernel.org> says: In November last year, I sent an RFC to introduce CAN XL [1]. That RFC, despite positive feedback, was put on hold due to some unanswered question concerning the PWM encoding [2]. While stuck, some small preparation work was done in parallel in [3] by refactoring the struct can_priv and doing some trivial clean-up and renaming. Initially, [3] received zero feedback but was eventually merged after splitting it in smaller parts and resending it. Finally, in July this year, we clarified the remaining mysteries about PWM calculation, thus unlocking the series. Summer being a bit busy because of some personal matters brings us to now. After doing all the refactoring and adding all the CAN XL features, the final result is more than 30 patches, definitively too much for a single series. So I am splitting the remaining changes three: - can: rework the CAN MTU logic [4] - can: netlink: preparation before introduction of CAN XL (this series) - CAN XL (will come right after the two preparation series get merged) And thus, this series continues and finishes the preparation work done in [3] and [4]. It contains all the refactoring needed to smoothly introduce CAN XL. The goal is to: - split the functions in smaller pieces: CAN XL will introduce a fair amount of code. And some functions which are already fairly long (86 lines for can_validate(), 215 lines for can_changelink()) would grow to disproportionate sizes if the CAN XL logic were to be inlined in those functions. - repurpose the existing code to handle both CAN FD and CAN XL: a huge part of CAN XL simply reuses the CAN FD logic. All the existing CAN FD logic is made more generic to handle both CAN FD and XL. In more details: - Patch #1 moves struct data_bittiming_params from dev.h to bittiming.h and patch #2 makes can_get_relative_tdco() FD agnostic before also moving it to bittiming.h. - Patch #3 adds some comments to netlink.h tagging which IFLA symbols are FD specific. - Patches #4 to torvalds#6 are refactoring can_validate() and can_validate_bittiming(). - Patches torvalds#7 to torvalds#11 are refactoring can_changelink() and can_tdc_changelink(). - Patches torvalds#12 and torvalds#13 are refactoring can_get_size() and can_tdc_get_size(). - Patches torvalds#14 to torvalds#17 are refactoring can_fill_info() and can_tdc_fill_info(). - Patch torvalds#18 makes can_calc_tdco() FD agnostic. - Patch torvalds#19 adds can_get_ctrlmode_str() which converts control mode flags into strings. This is done in preparation of patch torvalds#20. - Patch torvalds#20 is the final patch and improves the user experience by providing detailed error messages whenever invalid parameters are provided. All those error messages came into handy when debugging the upcoming CAN XL patches. Aside from the last patch, the other changes do not impact any of the existing functionalities. The follow up series which introduces CAN XL is nearly completed but will be sent only once this one is approved: one thing at a time, I do not want to overwhelm people (including myself). [1] https://lore.kernel.org/linux-can/20241110155902.72807-16-mailhol.vincent@wanadoo.fr/ [2] https://lore.kernel.org/linux-can/c4771c16-c578-4a6d-baee-918fe276dbe9@wanadoo.fr/ [3] https://lore.kernel.org/linux-can/20241110155902.72807-16-mailhol.vincent@wanadoo.fr/ [4] https://lore.kernel.org/linux-can/20250923-can-fix-mtu-v2-0-984f9868db69@kernel.org/ Link: https://patch.msgid.link/20250923-canxl-netlink-prep-v4-0-e720d28f66fe@kernel.org Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <david@redhat.com>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <david@redhat.com>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <david@redhat.com>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <david@redhat.com>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <david@redhat.com>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <david@redhat.com>
…ge_order() Patch series "mm: MM owner tracking for large folios (!hugetlb) + CONFIG_NO_PAGE_MAPCOUNT", v3. Let's add an "easy" way to decide -- without false positives, without page-mapcounts and without page table/rmap scanning -- whether a large folio is "certainly mapped exclusively" into a single MM, or whether it "maybe mapped shared" into multiple MMs. Use that information to implement Copy-on-Write reuse, to convert folio_likely_mapped_shared() to folio_maybe_mapped_share(), and to introduce a kernel config option that lets us not use+maintain per-page mapcounts in large folios anymore. The bigger picture was presented at LSF/MM [1]. This series is effectively a follow-up on my early work [2], which implemented a more precise, but also more complicated, way to identify whether a large folio is "mapped shared" into multiple MMs or "mapped exclusively" into a single MM. 1 Patch Organization ==================== Patch #1 -> torvalds#6: make more room in order-1 folios, so we have two "unsigned long" available for our purposes Patch torvalds#7 -> torvalds#11: preparations Patch torvalds#12: MM owner tracking for large folios Patch torvalds#13: COW reuse for PTE-mapped anon THP Patch torvalds#14: folio_maybe_mapped_shared() Patch torvalds#15 -> torvalds#20: introduce and implement CONFIG_NO_PAGE_MAPCOUNT 2 MM owner tracking =================== We assign each MM a unique ID ("MM ID"), to be able to squeeze more information in our folios. On 32bit we use 15-bit IDs, on 64bit we use 31-bit IDs. For each large folios, we now store two MM-ID+mapcount ("slot") combinations: * mm0_id + mm0_mapcount * mm1_id + mm1_mapcount On 32bit, we use a 16-bit per-MM mapcount, on 64bit an ordinary 32bit mapcount. This way, we require 2x "unsigned long" on 32bit and 64bit for both slots. Paired with the large mapcount, we can reliably identify whether one of these MMs is the current owner (-> owns all mappings) or even holds all folio references (-> owns all mappings, and all references are from mappings). As long as only two MMs map folio pages at a time, we can reliably and precisely identify whether a large folio is "mapped shared" or "mapped exclusively". Any additional MM that starts mapping the folio while there are no free slots becomes an "untracked MM". If one such "untracked MM" is the last one mapping a folio exclusively, we will not detect the folio as "mapped exclusively" but instead as "maybe mapped shared". (exception: only a single mapping remains) So that's where the approach gets imprecise. For now, we use a bit-spinlock to sync the large mapcount + slots, and make sure we do keep the machinery fast, to not degrade (un)map performance drastically: for example, we make sure to only use a single atomic (when grabbing the bit-spinlock), like we would already perform when updating the large mapcount. 3 CONFIG_NO_PAGE_MAPCOUNT ========================= patch torvalds#15 -> torvalds#20 spell out and document what exactly is affected when not maintaining the per-page mapcounts in large folios anymore. Most importantly, as we cannot maintain folio->_nr_pages_mapped anymore when (un)mapping pages, we'll account a complete folio as mapped if a single page is mapped. In addition, we'll not detect partially mapped anonymous folios as such in all cases yet. Likely less relevant changes include that we might now under-estimate the USS (Unique Set Size) of a process, but never over-estimate it. The goal is to make CONFIG_NO_PAGE_MAPCOUNT the default at some point, to then slowly make it the only option, as we learn about real-life impacts and possible ways to mitigate them. 4 Performance ============= Detailed performance numbers were included in v1 [3], and not that much changed between v1 and v2. I did plenty of measurements on different systems in the meantime, that all revealed slightly different results. The pte-mapped-folio micro-benchmarks [4] are fairly sensitive to code layout changes on some systems. Especially the fork() benchmark started being more-shaky-than-before on recent kernels for some reason. In summary, with my micro-benchmarks: * Small folios are not impacted. * CoW performance seems to be mostly unchanged across all folios sizes. * CoW reuse performance of large folios now matches CoW reuse performance of small folios, because we now actually implement the CoW reuse optimization. On an Intel Xeon Silver 4210R I measured a ~65% reduction in runtime, on an arm64 system I measured ~54% reduction. * munmap() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~30% on an Intel Xeon Silver 4210R and up to ~70% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * munmao() performance very slightly (couple percent) degrades without CONFIG_NO_PAGE_MAPCOUNT for smaller folios. For larger folios, there seems to be no change at all. * fork() performance improves with CONFIG_NO_PAGE_MAPCOUNT. I saw double-digit % reduction (up to ~20% on an Intel Xeon Silver 4210R and up to ~10% on an AmpereOne A192-32X) with larger folios. The larger the folios, the larger the performance improvement. * While fork() performance without CONFIG_NO_PAGE_MAPCOUNT seems to be almost unchanged on some systems, I saw some degradation for smaller folios on the AmpereOne A192-32X. I did not investigate the details yet, but I suspect code layout changes or suboptimal code placement / inlining. I'm not to worried about the fork() micro-benchmarks for smaller folios given how shaky the results are lately and by how much we improved fork() performance recently. I also ran case-anon-cow-rand and case-anon-cow-seq part of vm-scalability, to assess the scalability and the impact of the bit-spinlock. My measurements on a two 2-socket 10-core Intel Xeon Silver 4210R CPU revealed no significant changes. Similarly, running these benchmarks with 2 MiB THPs enabled on the AmpereOne A192-32X with 192 cores, I got < 1% difference with < 1% stdev, which is nice. So far, I did not get my hands on a similarly large system with multiple sockets. I found no other fitting scalability benchmarks that seem to really hammer on concurrent mapping/unmapping of large folio pages like case-anon-cow-seq does. 5 Concerns ========== 5.1 Bit spinlock ---------------- I'm not quite happy about the bit-spinlock, but so far it does not seem to affect scalability in my measurements. If it ever becomes a problem we could either investigate improving the locking, or simply stopping the MM tracking once there are "too many mappings" and simply assume that the folio is "mapped shared" until it was freed. This would be similar (but slightly different) to the "0,1,2,stopped" counting idea Willy had at some point. Adding that logic to "stop tracking" adds more code to the hot path, so I avoided that for now. 5.2 folio_maybe_mapped_shared() ------------------------------- I documented the change from folio_likely_mapped_shared() to folio_maybe_mapped_shared() quite extensively. If we run into surprises, I have some ideas on how to resolve them. For now, I think we should be fine. 5.3 Added code to map/unmap hot path ------------------------------------ So far, it looks like the added code on the rmap hot path does not really seem to matter much in the bigger picture. I'd like to further reduce it (and possibly improve fork() performance further), but I don't easily see how right now. Well, and I am out of puff 🙂 Having that said, alternatives I considered (e.g., per-MM per-folio mapcount) would add a lot more overhead to these hot paths. 6 Future Work ============= 6.1 Large mapcount ------------------ It would be very handy if the large mapcount would count how often folio pages are actually mapped into page tables: a PMD on x86-64 would count 512 times. Calculating the average per-page mapcount will be easy, and remapping (PMD->PTE) folios would get even faster. That would also remove the need for the entire mapcount (except for PMD-sized folios for memory statistics reasons ...), and allow for mapping folios larger than PMDs (e.g., 4 MiB) easily. We likely would also have to take the same number of folio references to make our folio_mapcount() == folio_ref_count() work, and we'd want to be able to avoid mapcount+refcount overflows: this could already become an issue with pte-mapped PUD-sized folios (fsdax). One approach we discussed in the THP cabal meeting is (1) extending the mapcount for large folios to 64bit (at least on 64bit systems) and (2) keeping the refcount at 32bit, but (3) having exactly one reference if the the mapcount != 0. It should be doable, but there are some corner cases to consider on the unmap path; it is something that I will be looking into next. 6.2 hugetlb ----------- I'd love to make use of the same tracking also for hugetlb. The real problem is PMD table sharing: getting a page mapped by MM X and unmapped by MM Y will not work. With mshare, that problem should not exist (all mapping/unmapping will be routed through the mshare MM). [1] https://lwn.net/Articles/974223/ [2] https://lore.kernel.org/linux-mm/a9922f58-8129-4f15-b160-e0ace581bcbe@redhat.com/T/ [3] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com [4] https://gitlab.com/davidhildenbrand/scratchspace/-/raw/main/pte-mapped-folio-benchmarks.c This patch (of 20): Let's factor it out into a simple helper function. This helper will also come in handy when working with code where we know that our folio is large. Maybe in the future we'll have the order readily available for small and large folios; in that case, folio_large_order() would simply translate to folio_order(). Link: https://lkml.kernel.org/r/20250303163014.1128035-1-david@redhat.com Link: https://lkml.kernel.org/r/20250303163014.1128035-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <ioworker0@gmail.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirks^H^Hski <luto@kernel.org> Cc: Borislav Betkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcow (Oracle) <willy@infradead.org> Cc: Michal Koutn <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: tejun heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> (cherry picked from commit 6220ea5) Signed-off-by: David Hildenbrand <david@redhat.com>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
Before disabling SR-IOV via config space accesses to the parent PF, sriov_disable() first removes the PCI devices representing the VFs. Since commit 9d16947 ("PCI: Add global pci_lock_rescan_remove()") such removal operations are serialized against concurrent remove and rescan using the pci_rescan_remove_lock. No such locking was ever added in sriov_disable() however. In particular when commit 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") factored out the PCI device removal into sriov_del_vfs() there was still no locking around the pci_iov_remove_virtfn() calls. On s390 the lack of serialization in sriov_disable() may cause double remove and list corruption with the below (amended) trace being observed: PSW: 0704c00180000000 0000000c914e4b38 (klist_put+56) GPRS: 000003800313fb48 0000000000000000 0000000100000001 0000000000000001 00000000f9b520a8 0000000000000000 0000000000002fbd 00000000f4cc9480 0000000000000001 0000000000000000 0000000000000000 0000000180692828 00000000818e8000 000003800313fe2c 000003800313fb20 000003800313fad8 #0 [3800313fb20] device_del at c9158ad5c #1 [3800313fb88] pci_remove_bus_device at c915105ba #2 [3800313fbd0] pci_iov_remove_virtfn at c9152f198 #3 [3800313fc28] zpci_iov_remove_virtfn at c90fb67c0 #4 [3800313fc60] zpci_bus_remove_device at c90fb6104 #5 [3800313fca0] __zpci_event_availability at c90fb3dca torvalds#6 [3800313fd08] chsc_process_sei_nt0 at c918fe4a2 torvalds#7 [3800313fd60] crw_collect_info at c91905822 torvalds#8 [3800313fe10] kthread at c90feb390 torvalds#9 [3800313fe68] __ret_from_fork at c90f6aa64 torvalds#10 [3800313fe98] ret_from_fork at c9194f3f2. This is because in addition to sriov_disable() removing the VFs, the platform also generates hot-unplug events for the VFs. This being the reverse operation to the hotplug events generated by sriov_enable() and handled via pdev->no_vf_scan. And while the event processing takes pci_rescan_remove_lock and checks whether the struct pci_dev still exists, the lack of synchronization makes this checking racy. Other races may also be possible of course though given that this lack of locking persisted so long observable races seem very rare. Even on s390 the list corruption was only observed with certain devices since the platform events are only triggered by config accesses after the removal, so as long as the removal finished synchronously they would not race. Either way the locking is missing so fix this by adding it to the sriov_del_vfs() helper. Just like PCI rescan-remove, locking is also missing in sriov_add_vfs() including for the error case where pci_stop_and_remove_bus_device() is called without the PCI rescan-remove lock being held. Even in the non-error case, adding new PCI devices and buses should be serialized via the PCI rescan-remove lock. Add the necessary locking. Fixes: 18f9e9d ("PCI/IOV: Factor out sriov_add_vfs()") Signed-off-by: Niklas Schnelle <schnelle@linux.ibm.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Benjamin Block <bblock@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Reviewed-by: Julian Ruess <julianr@linux.ibm.com> Cc: stable@vger.kernel.org Link: https://patch.msgid.link/20250826-pci_fix_sriov_disable-v1-1-2d0bc938f2a3@linux.ibm.com
If clk_core_populate_parent_map() fails, core->parents will be immediately released within clk_core_populate_parent_map(). Therefore it is can't be released in __clk_release() again. This fixes the following KASAN reported issue: ================================================================== BUG: KASAN: slab-use-after-free in __clk_release+0x80/0x160 Read of size 8 at addr ffffff8043fd0980 by task kworker/u6:0/27 CPU: 1 PID: 27 Comm: kworker/u6:0 Tainted: G W 6.6.69-yocto-standard+ torvalds#7 Hardware name: Raspberry Pi 4 Model B (DT) Workqueue: events_unbound deferred_probe_work_func Call trace: dump_backtrace+0x98/0xf8 show_stack+0x20/0x38 dump_stack_lvl+0x48/0x60 print_report+0xf8/0x5d8 kasan_report+0xb4/0x100 __asan_load8+0x9c/0xc0 __clk_release+0x80/0x160 __clk_register+0x6dc/0xfb8 devm_clk_hw_register+0x70/0x108 bcm2835_register_clock+0x284/0x358 bcm2835_clk_probe+0x2c4/0x438 platform_probe+0x98/0x110 really_probe+0x1e4/0x3e8 __driver_probe_device+0xc0/0x1a0 driver_probe_device+0x110/0x1e8 __device_attach_driver+0xf0/0x1a8 bus_for_each_drv+0xf8/0x178 __device_attach+0x120/0x240 device_initial_probe+0x1c/0x30 bus_probe_device+0xdc/0xe8 deferred_probe_work_func+0xe8/0x130 process_one_work+0x2a4/0x698 worker_thread+0x53c/0x708 kthread+0x1b4/0x1c8 ret_from_fork+0x10/0x20 Allocated by task 27: kasan_save_stack+0x3c/0x68 kasan_set_track+0x2c/0x40 kasan_save_alloc_info+0x24/0x38 __kasan_kmalloc+0xd4/0xd8 __kmalloc+0x74/0x238 __clk_register+0x718/0xfb8 devm_clk_hw_register+0x70/0x108 bcm2835_register_clock+0x284/0x358 bcm2835_clk_probe+0x2c4/0x438 platform_probe+0x98/0x110 really_probe+0x1e4/0x3e8 __driver_probe_device+0xc0/0x1a0 driver_probe_device+0x110/0x1e8 __device_attach_driver+0xf0/0x1a8 bus_for_each_drv+0xf8/0x178 __device_attach+0x120/0x240 device_initial_probe+0x1c/0x30 bus_probe_device+0xdc/0xe8 deferred_probe_work_func+0xe8/0x130 process_one_work+0x2a4/0x698 worker_thread+0x53c/0x708 kthread+0x1b4/0x1c8 ret_from_fork+0x10/0x20 Freed by task 27: kasan_save_stack+0x3c/0x68 kasan_set_track+0x2c/0x40 kasan_save_free_info+0x38/0x60 __kasan_slab_free+0x100/0x170 slab_free_freelist_hook+0xcc/0x218 __kmem_cache_free+0x158/0x210 kfree+0x88/0x140 __clk_register+0x9d0/0xfb8 devm_clk_hw_register+0x70/0x108 bcm2835_register_clock+0x284/0x358 bcm2835_clk_probe+0x2c4/0x438 platform_probe+0x98/0x110 really_probe+0x1e4/0x3e8 __driver_probe_device+0xc0/0x1a0 driver_probe_device+0x110/0x1e8 __device_attach_driver+0xf0/0x1a8 bus_for_each_drv+0xf8/0x178 __device_attach+0x120/0x240 device_initial_probe+0x1c/0x30 bus_probe_device+0xdc/0xe8 deferred_probe_work_func+0xe8/0x130 process_one_work+0x2a4/0x698 worker_thread+0x53c/0x708 kthread+0x1b4/0x1c8 ret_from_fork+0x10/0x20 The buggy address belongs to the object at ffffff8043fd0800 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 384 bytes inside of freed 512-byte region [ffffff8043fd0800, ffffff8043fd0a00) The buggy address belongs to the physical page: page:fffffffe010ff400 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffff8043fd0e00 pfn:0x43fd0 head:fffffffe010ff400 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x4000000000000840(slab|head|zone=1) page_type: 0xffffffff() raw: 4000000000000840 ffffff8040002f40 ffffff8040000a50 ffffff8040000a50 raw: ffffff8043fd0e00 0000000000150002 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffff8043fd0880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffffff8043fd0900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffffff8043fd0980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffffff8043fd0a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffff8043fd0a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: 9d05ae5 ("clk: Initialize struct clk_core kref earlier") Signed-off-by: Yun Zhou <yun.zhou@windriver.com>
When s_start() fails to allocate memory for set_event_iter, it returns NULL before acquiring event_mutex. However, the corresponding s_stop() function always tries to unlock the mutex, causing a lock imbalance warning: WARNING: bad unlock balance detected! 6.17.0-rc7-00175-g2b2e0c04f78c torvalds#7 Not tainted ------------------------------------- syz.0.85611/376514 is trying to release lock (event_mutex) at: [<ffffffff8dafc7a4>] traverse.part.0.constprop.0+0x2c4/0x650 fs/seq_file.c:131 but there are no more locks to release! The issue was introduced by commit b355247 ("tracing: Cache ':mod:' events for modules not loaded yet") which added the kzalloc() allocation before the mutex lock, creating a path where s_start() could return without locking the mutex while s_stop() would still try to unlock it. Fix this by unconditionally acquiring the mutex immediately after allocation, regardless of whether the allocation succeeded. Fixes: b355247 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Sasha Levin <sashal@kernel.org>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
When s_start() fails to allocate memory for set_event_iter, it returns NULL before acquiring event_mutex. However, the corresponding s_stop() function always tries to unlock the mutex, causing a lock imbalance warning: WARNING: bad unlock balance detected! 6.17.0-rc7-00175-g2b2e0c04f78c torvalds#7 Not tainted ------------------------------------- syz.0.85611/376514 is trying to release lock (event_mutex) at: [<ffffffff8dafc7a4>] traverse.part.0.constprop.0+0x2c4/0x650 fs/seq_file.c:131 but there are no more locks to release! The issue was introduced by commit b355247 ("tracing: Cache ':mod:' events for modules not loaded yet") which added the kzalloc() allocation before the mutex lock, creating a path where s_start() could return without locking the mutex while s_stop() would still try to unlock it. Fix this by unconditionally acquiring the mutex immediately after allocation, regardless of whether the allocation succeeded. Link: https://lore.kernel.org/20250929113238.3722055-1-sashal@kernel.org Fixes: b355247 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The test starts a workload and then opens events. If the events fail to open, for example because of perf_event_paranoid, the gopipe of the workload is leaked and the file descriptor leak check fails when the test exits. To avoid this cancel the workload when opening the events fails. Before: ``` $ perf test -vv 7 7: PERF_RECORD_* events & perf_sample fields: --- start --- test child forked, pid 1189568 Using CPUID GenuineIntel-6-B7-1 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 Attempt to add: software/cpu-clock/ ..after resolving event: software/config=0/ cpu-clock -> software/cpu-clock/ ------------------------------------------------------------ perf_event_attr: type 1 (PERF_TYPE_SOFTWARE) size 136 config 0x9 (PERF_COUNT_SW_DUMMY) sample_type IP|TID|TIME|CPU read_format ID|LOST disabled 1 inherit 1 mmap 1 comm 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ksymbol 1 bpf_event 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 1189569 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 perf_evlist__open: Permission denied ---- end(-2) ---- Leak of file descriptor 6 that opened: 'pipe:[14200347]' ---- unexpected signal (6) ---- iFailed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311 #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0 #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44 #3 0x7f29ce849cc2 in raise raise.c:27 #4 0x7f29ce8324ac in abort abort.c:81 #5 0x565358f662d4 in check_leaks builtin-test.c:226 torvalds#6 0x565358f6682e in run_test_child builtin-test.c:344 torvalds#7 0x565358ef7121 in start_command run-command.c:128 torvalds#8 0x565358f67273 in start_test builtin-test.c:545 torvalds#9 0x565358f6771d in __cmd_test builtin-test.c:647 torvalds#10 0x565358f682bd in cmd_test builtin-test.c:849 torvalds#11 0x565358ee5ded in run_builtin perf.c:349 torvalds#12 0x565358ee6085 in handle_internal_command perf.c:401 torvalds#13 0x565358ee61de in run_argv perf.c:448 torvalds#14 0x565358ee6527 in main perf.c:555 torvalds#15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74 torvalds#16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 torvalds#17 0x565358e391c1 in _start perf[851c1] 7: PERF_RECORD_* events & perf_sample fields : FAILED! ``` After: ``` $ perf test 7 7: PERF_RECORD_* events & perf_sample fields : Skip (permissions) ``` Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object") Signed-off-by: Ian Rogers <irogers@google.com> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Athira Rajeev <atrajeev@linux.ibm.com> Cc: Chun-Tse Shao <ctshao@google.com> Cc: Howard Chu <howardchu95@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Clark <james.clark@linaro.org> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
If clk_core_populate_parent_map() fails, core->parents will be immediately released within clk_core_populate_parent_map(). Therefore it is can't be released in __clk_release() again. This fixes the following KASAN reported issue: ================================================================== BUG: KASAN: slab-use-after-free in __clk_release+0x80/0x160 Read of size 8 at addr ffffff8043fd0980 by task kworker/u6:0/27 CPU: 1 PID: 27 Comm: kworker/u6:0 Tainted: G W 6.6.69-yocto-standard+ torvalds#7 Hardware name: Raspberry Pi 4 Model B (DT) Workqueue: events_unbound deferred_probe_work_func Call trace: dump_backtrace+0x98/0xf8 show_stack+0x20/0x38 dump_stack_lvl+0x48/0x60 print_report+0xf8/0x5d8 kasan_report+0xb4/0x100 __asan_load8+0x9c/0xc0 __clk_release+0x80/0x160 __clk_register+0x6dc/0xfb8 devm_clk_hw_register+0x70/0x108 bcm2835_register_clock+0x284/0x358 bcm2835_clk_probe+0x2c4/0x438 platform_probe+0x98/0x110 really_probe+0x1e4/0x3e8 __driver_probe_device+0xc0/0x1a0 driver_probe_device+0x110/0x1e8 __device_attach_driver+0xf0/0x1a8 bus_for_each_drv+0xf8/0x178 __device_attach+0x120/0x240 device_initial_probe+0x1c/0x30 bus_probe_device+0xdc/0xe8 deferred_probe_work_func+0xe8/0x130 process_one_work+0x2a4/0x698 worker_thread+0x53c/0x708 kthread+0x1b4/0x1c8 ret_from_fork+0x10/0x20 Allocated by task 27: kasan_save_stack+0x3c/0x68 kasan_set_track+0x2c/0x40 kasan_save_alloc_info+0x24/0x38 __kasan_kmalloc+0xd4/0xd8 __kmalloc+0x74/0x238 __clk_register+0x718/0xfb8 devm_clk_hw_register+0x70/0x108 bcm2835_register_clock+0x284/0x358 bcm2835_clk_probe+0x2c4/0x438 platform_probe+0x98/0x110 really_probe+0x1e4/0x3e8 __driver_probe_device+0xc0/0x1a0 driver_probe_device+0x110/0x1e8 __device_attach_driver+0xf0/0x1a8 bus_for_each_drv+0xf8/0x178 __device_attach+0x120/0x240 device_initial_probe+0x1c/0x30 bus_probe_device+0xdc/0xe8 deferred_probe_work_func+0xe8/0x130 process_one_work+0x2a4/0x698 worker_thread+0x53c/0x708 kthread+0x1b4/0x1c8 ret_from_fork+0x10/0x20 Freed by task 27: kasan_save_stack+0x3c/0x68 kasan_set_track+0x2c/0x40 kasan_save_free_info+0x38/0x60 __kasan_slab_free+0x100/0x170 slab_free_freelist_hook+0xcc/0x218 __kmem_cache_free+0x158/0x210 kfree+0x88/0x140 __clk_register+0x9d0/0xfb8 devm_clk_hw_register+0x70/0x108 bcm2835_register_clock+0x284/0x358 bcm2835_clk_probe+0x2c4/0x438 platform_probe+0x98/0x110 really_probe+0x1e4/0x3e8 __driver_probe_device+0xc0/0x1a0 driver_probe_device+0x110/0x1e8 __device_attach_driver+0xf0/0x1a8 bus_for_each_drv+0xf8/0x178 __device_attach+0x120/0x240 device_initial_probe+0x1c/0x30 bus_probe_device+0xdc/0xe8 deferred_probe_work_func+0xe8/0x130 process_one_work+0x2a4/0x698 worker_thread+0x53c/0x708 kthread+0x1b4/0x1c8 ret_from_fork+0x10/0x20 The buggy address belongs to the object at ffffff8043fd0800 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 384 bytes inside of freed 512-byte region [ffffff8043fd0800, ffffff8043fd0a00) The buggy address belongs to the physical page: page:fffffffe010ff400 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffff8043fd0e00 pfn:0x43fd0 head:fffffffe010ff400 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0x4000000000000840(slab|head|zone=1) page_type: 0xffffffff() raw: 4000000000000840 ffffff8040002f40 ffffff8040000a50 ffffff8040000a50 raw: ffffff8043fd0e00 0000000000150002 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffff8043fd0880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffffff8043fd0900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffffff8043fd0980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffffff8043fd0a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffffff8043fd0a80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: fc0c209 ("clk: Allow parents to be specified without string names") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Reviewed-by: Brian Masney <bmasney@redhat.com>
When s_start() fails to allocate memory for set_event_iter, it returns NULL before acquiring event_mutex. However, the corresponding s_stop() function always tries to unlock the mutex, causing a lock imbalance warning: WARNING: bad unlock balance detected! 6.17.0-rc7-00175-g2b2e0c04f78c #7 Not tainted ------------------------------------- syz.0.85611/376514 is trying to release lock (event_mutex) at: [<ffffffff8dafc7a4>] traverse.part.0.constprop.0+0x2c4/0x650 fs/seq_file.c:131 but there are no more locks to release! The issue was introduced by commit b355247 ("tracing: Cache ':mod:' events for modules not loaded yet") which added the kzalloc() allocation before the mutex lock, creating a path where s_start() could return without locking the mutex while s_stop() would still try to unlock it. Fix this by unconditionally acquiring the mutex immediately after allocation, regardless of whether the allocation succeeded. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250929113238.3722055-1-sashal@kernel.org Fixes: b355247 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
The following lockdep splat was observed while kernel auto-online a CXL memory region: [ 51.926183] ====================================================== [ 51.933441] WARNING: possible circular locking dependency detected [ 51.940701] 6.17.0djtest+ #53 Tainted: G W [ 51.947290] ------------------------------------------------------ [ 51.954553] systemd-udevd/3334 is trying to acquire lock: [ 51.960938] ffffffff90346188 (hmem_resource_lock){+.+.}-{4:4}, at: hmem_register_resource+0x31/0x50 [ 51.971429] but task is already holding lock: [ 51.978548] ffffffff90338890 ((node_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x2e/0x70 [ 51.989621] which lock already depends on the new lock. [ 51.999605] the existing dependency chain (in reverse order) is: [ 52.008539] -> torvalds#6 ((node_chain).rwsem){++++}-{4:4}: [ 52.016195] down_read+0x45/0x190 [ 52.020789] blocking_notifier_call_chain+0x2e/0x70 [ 52.027131] node_notify+0x1f/0x30 [ 52.031809] online_pages+0xc1/0x330 [ 52.036684] memory_subsys_online+0x22a/0x280 [ 52.042431] device_online+0x50/0x90 [ 52.047298] state_store+0x9b/0xa0 [ 52.051956] dev_attr_store+0x18/0x30 [ 52.056907] sysfs_kf_write+0x4e/0x70 [ 52.061854] kernfs_fop_write_iter+0x187/0x260 [ 52.067673] vfs_write+0x21f/0x590 [ 52.072313] ksys_write+0x73/0xf0 [ 52.076854] __x64_sys_write+0x1d/0x30 [ 52.081874] x64_sys_call+0x7d/0x1d80 [ 52.086797] do_syscall_64+0x6c/0x2f0 [ 52.091717] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.098198] -> #5 (mem_hotplug_lock){++++}-{0:0}: [ 52.105512] percpu_down_write+0x4b/0x260 [ 52.110825] try_online_node+0x21/0x50 [ 52.115844] cpu_up+0x43/0xd0 [ 52.119989] cpuhp_bringup_mask+0x60/0xa0 [ 52.125305] bringup_nonboot_cpus+0x76/0x110 [ 52.130912] smp_init+0x2e/0x90 [ 52.135235] kernel_init_freeable+0x19a/0x300 [ 52.140930] kernel_init+0x1e/0x140 [ 52.145635] ret_from_fork+0x159/0x200 [ 52.150633] ret_from_fork_asm+0x1a/0x30 [ 52.155826] -> #4 (cpu_hotplug_lock){++++}-{0:0}: [ 52.163081] __cpuhp_state_add_instance+0x51/0x200 [ 52.169238] iova_domain_init_rcaches+0x1ed/0x200 [ 52.175301] iommu_setup_dma_ops+0x1b4/0x500 [ 52.180877] bus_iommu_probe+0xd2/0x180 [ 52.185954] iommu_device_register+0x9f/0xe0 [ 52.191530] intel_iommu_init+0xd3b/0xf20 [ 52.196810] pci_iommu_init+0x16/0x40 [ 52.201695] do_one_initcall+0x5c/0x2d0 [ 52.206767] kernel_init_freeable+0x281/0x300 [ 52.212432] kernel_init+0x1e/0x140 [ 52.217109] ret_from_fork+0x159/0x200 [ 52.222082] ret_from_fork_asm+0x1a/0x30 [ 52.227253] -> #3 (&group->mutex){+.+.}-{4:4}: [ 52.234196] __mutex_lock+0xa9/0x11e0 [ 52.239066] mutex_lock_nested+0x1f/0x30 [ 52.244236] __iommu_probe_device+0x28c/0x5e0 [ 52.249893] probe_iommu_group+0x2f/0x50 [ 52.255064] bus_for_each_dev+0x7e/0xd0 [ 52.260126] bus_iommu_probe+0x3f/0x180 [ 52.265190] iommu_device_register+0x9f/0xe0 [ 52.270751] intel_iommu_init+0xd3b/0xf20 [ 52.276016] pci_iommu_init+0x16/0x40 [ 52.280892] do_one_initcall+0x5c/0x2d0 [ 52.285956] kernel_init_freeable+0x281/0x300 [ 52.291613] kernel_init+0x1e/0x140 [ 52.296284] ret_from_fork+0x159/0x200 [ 52.301253] ret_from_fork_asm+0x1a/0x30 [ 52.306421] -> #2 (iommu_probe_device_lock){+.+.}-{4:4}: [ 52.314333] __mutex_lock+0xa9/0x11e0 [ 52.319201] mutex_lock_nested+0x1f/0x30 [ 52.324372] iommu_probe_device+0x21/0x70 [ 52.329638] iommu_bus_notifier+0x2c/0x80 [ 52.334903] notifier_call_chain+0x4b/0x110 [ 52.340357] blocking_notifier_call_chain+0x4a/0x70 [ 52.346594] bus_notify+0x3b/0x50 [ 52.351079] device_add+0x65d/0x8b0 [ 52.355750] platform_device_add+0xf8/0x250 [ 52.361205] platform_device_register_full+0x154/0x1f0 [ 52.367739] platform_device_register_simple.constprop.0.isra.0+0x37/0x50 [ 52.376119] efisubsys_init+0xaf/0x570 [ 52.381090] do_one_initcall+0x5c/0x2d0 [ 52.386152] kernel_init_freeable+0x281/0x300 [ 52.391809] kernel_init+0x1e/0x140 [ 52.396481] ret_from_fork+0x159/0x200 [ 52.401450] ret_from_fork_asm+0x1a/0x30 [ 52.406620] -> #1 (&(&priv->bus_notifier)->rwsem){++++}-{4:4}: [ 52.415109] down_read+0x45/0x190 [ 52.419593] blocking_notifier_call_chain+0x2e/0x70 [ 52.425828] bus_notify+0x3b/0x50 [ 52.430311] device_add+0x65d/0x8b0 [ 52.434981] platform_device_add+0xf8/0x250 [ 52.440435] __hmem_register_resource+0x70/0xc0 [ 52.446279] hmem_register_resource+0x3b/0x50 [ 52.451923] hmat_register_target+0x3c/0x190 [ 52.457488] hmat_init+0x13f/0x370 [ 52.462067] do_one_initcall+0x5c/0x2d0 [ 52.467132] kernel_init_freeable+0x281/0x300 [ 52.472790] kernel_init+0x1e/0x140 [ 52.477464] ret_from_fork+0x159/0x200 [ 52.482433] ret_from_fork_asm+0x1a/0x30 [ 52.487604] -> #0 (hmem_resource_lock){+.+.}-{4:4}: [ 52.495030] __lock_acquire+0x14a4/0x2290 [ 52.500290] lock_acquire+0xdd/0x2f0 [ 52.505070] __mutex_lock+0xa9/0x11e0 [ 52.509944] mutex_lock_nested+0x1f/0x30 [ 52.515115] hmem_register_resource+0x31/0x50 [ 52.520771] hmat_register_target+0x3c/0x190 [ 52.526319] hmat_callback+0x6b/0x80 [ 52.531098] notifier_call_chain+0x4b/0x110 [ 52.536552] blocking_notifier_call_chain+0x4a/0x70 [ 52.542788] node_notify+0x1f/0x30 [ 52.547369] online_pages+0x288/0x330 [ 52.552246] memory_subsys_online+0x22a/0x280 [ 52.557902] device_online+0x50/0x90 [ 52.562669] state_store+0x9b/0xa0 [ 52.567247] dev_attr_store+0x18/0x30 [ 52.572123] sysfs_kf_write+0x4e/0x70 [ 52.576998] kernfs_fop_write_iter+0x187/0x260 [ 52.582750] vfs_write+0x21f/0x590 [ 52.587327] ksys_write+0x73/0xf0 [ 52.591811] __x64_sys_write+0x1d/0x30 [ 52.596779] x64_sys_call+0x7d/0x1d80 [ 52.601653] do_syscall_64+0x6c/0x2f0 [ 52.606528] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.612968] other info that might help us debug this: [ 52.622356] Chain exists of: hmem_resource_lock --> mem_hotplug_lock --> (node_chain).rwsem [ 52.635550] Possible unsafe locking scenario: [ 52.642495] CPU0 CPU1 [ 52.647752] ---- ---- [ 52.653014] rlock((node_chain).rwsem); [ 52.657589] lock(mem_hotplug_lock); [ 52.664701] lock((node_chain).rwsem); [ 52.672015] lock(hmem_resource_lock); [ 52.676497] *** DEADLOCK *** [ 52.683541] 8 locks held by systemd-udevd/3334: [ 52.688801] #0: ff36b6d49fbf0410 (sb_writers#3){.+.+}-{0:0}, at: ksys_write+0x73/0xf0 [ 52.697870] #1: ff36b6d4ece03a88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x12c/0x260 [ 52.708210] #2: ff36b6d4ece1cbb8 (kn->active#62){.+.+}-{0:0}, at: kernfs_fop_write_iter+0x141/0x260 [ 52.718645] #3: ffffffff90333cc8 (device_hotplug_lock){+.+.}-{4:4}, at: lock_device_hotplug_sysfs+0x1b/0x50 [ 52.729863] #4: ff36b6d4ece4b108 (&dev->mutex){....}-{4:4}, at: device_online+0x23/0x90 [ 52.739130] #5: ffffffff900664d0 (cpu_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x12/0x30 [ 52.749288] torvalds#6: ffffffff9024c810 (mem_hotplug_lock){++++}-{0:0}, at: mem_hotplug_begin+0x1e/0x30 [ 52.759446] torvalds#7: ffffffff90338890 ((node_chain).rwsem){++++}-{4:4}, at: blocking_notifier_call_chain+0x2e/0x70 [ 52.770860] stack backtrace: [ 52.776068] CPU: 0 UID: 0 PID: 3334 Comm: systemd-udevd Tainted: G W 6.17.0djtest+ #53 PREEMPT(voluntary) [ 52.776071] Tainted: [W]=WARN [ 52.776072] Hardware name: Intel Corporation AvenueCity/AvenueCity, BIOS BHSDCRB1.IPC.3545.P03.2509232237 09/23/2025 [ 52.776073] Call Trace: [ 52.776074] <TASK> [ 52.776076] dump_stack_lvl+0x72/0xa0 [ 52.776080] dump_stack+0x14/0x1a [ 52.776082] print_circular_bug.cold+0x188/0x1c6 [ 52.776084] check_noncircular+0x12f/0x160 [ 52.776087] ? __lock_acquire+0x486/0x2290 [ 52.776089] ? __lock_acquire+0x486/0x2290 [ 52.776091] __lock_acquire+0x14a4/0x2290 [ 52.776095] lock_acquire+0xdd/0x2f0 [ 52.776096] ? hmem_register_resource+0x31/0x50 [ 52.776100] ? hmem_register_resource+0x31/0x50 [ 52.776101] __mutex_lock+0xa9/0x11e0 [ 52.776104] ? hmem_register_resource+0x31/0x50 [ 52.776104] ? __kernfs_create_file+0xb5/0x110 [ 52.776110] mutex_lock_nested+0x1f/0x30 [ 52.776112] ? mutex_lock_nested+0x1f/0x30 [ 52.776114] hmem_register_resource+0x31/0x50 [ 52.776115] hmat_register_target+0x3c/0x190 [ 52.776119] hmat_callback+0x6b/0x80 [ 52.776120] notifier_call_chain+0x4b/0x110 [ 52.776123] blocking_notifier_call_chain+0x4a/0x70 [ 52.776125] node_notify+0x1f/0x30 [ 52.776126] online_pages+0x288/0x330 [ 52.776129] memory_subsys_online+0x22a/0x280 [ 52.776132] device_online+0x50/0x90 [ 52.776134] state_store+0x9b/0xa0 [ 52.776136] dev_attr_store+0x18/0x30 [ 52.776137] sysfs_kf_write+0x4e/0x70 [ 52.776139] kernfs_fop_write_iter+0x187/0x260 [ 52.776142] vfs_write+0x21f/0x590 [ 52.776146] ksys_write+0x73/0xf0 [ 52.776148] __x64_sys_write+0x1d/0x30 [ 52.776150] x64_sys_call+0x7d/0x1d80 [ 52.776152] do_syscall_64+0x6c/0x2f0 [ 52.776154] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 52.776156] RIP: 0033:0x7f11142fda57 [ 52.776158] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [ 52.776160] RSP: 002b:00007ffd0bd530f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 52.776163] RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007f11142fda57 [ 52.776164] RDX: 000000000000000e RSI: 00007ffd0bd537c0 RDI: 0000000000000006 [ 52.776166] RBP: 00007ffd0bd537c0 R08: 00007f11143f70a0 R09: 00007ffd0bd53190 [ 52.776167] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000e [ 52.776168] R13: 000055814e03e780 R14: 000000000000000e R15: 00007f11143f69e0 [ 52.776171] </TASK> The lock ordering can cause potential deadlock. There are instances where hmem_resource_lock is taken after (node_chain).rwsem, and vice versa. Narrow the scope of hmem_resource_lock in hmem_register_resource() to avoid the circular locking dependency. The locking is only needed when hmem_active needs to be protected. Fixes: 7dab174 ("dax/hmem: Move hmem device registration to dax_hmem.ko") Signed-off-by: Dave Jiang <dave.jiang@intel.com>
When injecting AER errors under PREEMPT_RT, the kernel may trigger a lockdep warning about an invalid wait context: ``` [ 1850.950780] [ BUG: Invalid wait context ] [ 1850.951152] 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 Not tainted [ 1850.951457] ----------------------------- [ 1850.951680] irq/16-PCIe PME/56 is trying to lock: [ 1850.952004] ffff800082865238 (inject_lock){+.+.}-{3:3}, at: aer_inj_read_config+0x38/0x1dc [ 1850.952731] other info that might help us debug this: [ 1850.952997] context-{5:5} [ 1850.953192] 5 locks held by irq/16-PCIe PME/56: [ 1850.953415] #0: ffff800082647390 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x30/0x268 [ 1850.953931] #1: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.954453] #2: ffff000004bb6c58 (&data->lock){+...}-{3:3}, at: pcie_pme_irq+0x34/0xc4 [ 1850.954949] #3: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.955420] #4: ffff800082863d10 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x5c/0xd8 ``` This happens because the AER injection path (`aer_inj_read_config()`) is called in the context of the PCIe PME interrupt thread, which runs through `irq_forced_thread_fn()` under PREEMPT_RT. In this context, `pci_lock` (a raw_spinlock_t) is held with interrupts disabled (`spin_lock_irqsave()`), and then `aer_inj_read_config()` tries to acquire `inject_lock`, which is a `rt_spin_lock`. (Thanks Waiman Long) `rt_spin_lock` may sleep, so acquiring it while holding a raw spinlock with IRQs disabled violates the lock ordering rules. This leads to the “Invalid wait context” lockdep warning. In other words, the lock order looks like this: ``` raw_spin_lock_irqsave(&pci_lock); ↓ rt_spin_lock(&inject_lock); <-- not allowed ``` To fix this, convert `inject_lock` from an `rt_spin_lock` to a `raw_spinlock_t`, a raw spinlock is safe and consistent with the surrounding locking scheme. This resolves the lockdep “Invalid wait context” warning observed when injecting correctable AER errors through `/dev/aer_inject` on PREEMPT_RT. This was discovered while testing PCIe AER error injection on an arm64 QEMU virtual machine: ``` qemu-system-aarch64 \ -nographic \ -machine virt,highmem=off,gic-version=3 \ -cpu cortex-a72 \ -kernel arch/arm64/boot/Image \ -initrd initramfs.cpio.gz \ -append "console=ttyAMA0 root=/dev/ram rdinit=/linuxrc nokaslr" \ -m 2G \ -smp 1 \ -netdev user,id=net0,hostfwd=tcp::2223-:22 \ -device virtio-net-pci,netdev=net0 \ -device pcie-root-port,id=rp0,chassis=1,slot=0x0 \ -device pci-testdev -s -S ``` Injecting a correctable PCIe error via /dev/aer_inject caused a BUG report with "Invalid wait context" in the irq/PCIe thread. ``` ~ # export HEX="00020000000000000100000000000000000000000000000000000000" ~ # echo -n "$HEX" | xxd -r -p | tee /dev/aer_inject >/dev/null [ 1850.947170] pcieport 0000:00:02.0: aer_inject: Injecting errors 00000001/00000000 into device 0000:00:02.0 [ 1850.949951] [ 1850.950479] ============================= [ 1850.950780] [ BUG: Invalid wait context ] [ 1850.951152] 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 Not tainted [ 1850.951457] ----------------------------- [ 1850.951680] irq/16-PCIe PME/56 is trying to lock: [ 1850.952004] ffff800082865238 (inject_lock){+.+.}-{3:3}, at: aer_inj_read_config+0x38/0x1dc [ 1850.952731] other info that might help us debug this: [ 1850.952997] context-{5:5} [ 1850.953192] 5 locks held by irq/16-PCIe PME/56: [ 1850.953415] #0: ffff800082647390 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x30/0x268 [ 1850.953931] #1: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.954453] #2: ffff000004bb6c58 (&data->lock){+...}-{3:3}, at: pcie_pme_irq+0x34/0xc4 [ 1850.954949] #3: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.955420] #4: ffff800082863d10 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x5c/0xd8 [ 1850.955932] stack backtrace: [ 1850.956412] CPU: 0 UID: 0 PID: 56 Comm: irq/16-PCIe PME Not tainted 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 PREEMPT_{RT,(full)} [ 1850.957039] Hardware name: linux,dummy-virt (DT) [ 1850.957409] Call trace: [ 1850.957727] show_stack+0x18/0x24 (C) [ 1850.958089] dump_stack_lvl+0x40/0xbc [ 1850.958339] dump_stack+0x18/0x24 [ 1850.958586] __lock_acquire+0xa84/0x3008 [ 1850.958907] lock_acquire+0x128/0x2a8 [ 1850.959171] rt_spin_lock+0x50/0x1b8 [ 1850.959476] aer_inj_read_config+0x38/0x1dc [ 1850.959821] pci_bus_read_config_dword+0x80/0xd8 [ 1850.960079] pcie_capability_read_dword+0xac/0xd8 [ 1850.960454] pcie_pme_irq+0x44/0xc4 [ 1850.960728] irq_forced_thread_fn+0x30/0x94 [ 1850.960984] irq_thread+0x1ac/0x3a4 [ 1850.961308] kthread+0x1b4/0x208 [ 1850.961557] ret_from_fork+0x10/0x20 [ 1850.963088] pcieport 0000:00:02.0: AER: Correctable error message received from 0000:00:02.0 [ 1850.963330] pcieport 0000:00:02.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) [ 1850.963351] pcieport 0000:00:02.0: device [1b36:000c] error status/mask=00000001/0000e000 [ 1850.963385] pcieport 0000:00:02.0: [ 0] RxErr (First) ``` Signed-off-by: Guangbo Cui <jckeep.cuiguangbo@gmail.com>
When injecting AER errors under PREEMPT_RT, the kernel may trigger a lockdep warning about an invalid wait context: ``` [ 1850.950780] [ BUG: Invalid wait context ] [ 1850.951152] 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 Not tainted [ 1850.951457] ----------------------------- [ 1850.951680] irq/16-PCIe PME/56 is trying to lock: [ 1850.952004] ffff800082865238 (inject_lock){+.+.}-{3:3}, at: aer_inj_read_config+0x38/0x1dc [ 1850.952731] other info that might help us debug this: [ 1850.952997] context-{5:5} [ 1850.953192] 5 locks held by irq/16-PCIe PME/56: [ 1850.953415] #0: ffff800082647390 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x30/0x268 [ 1850.953931] #1: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.954453] #2: ffff000004bb6c58 (&data->lock){+...}-{3:3}, at: pcie_pme_irq+0x34/0xc4 [ 1850.954949] #3: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.955420] #4: ffff800082863d10 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x5c/0xd8 ``` This happens because the AER injection path (`aer_inj_read_config()`) is called in the context of the PCIe PME interrupt thread, which runs through `irq_forced_thread_fn()` under PREEMPT_RT. In this context, `pci_lock` (a raw_spinlock_t) is held with interrupts disabled (`spin_lock_irqsave()`), and then `aer_inj_read_config()` tries to acquire `inject_lock`, which is a `rt_spin_lock`. (Thanks Waiman Long) `rt_spin_lock` may sleep, so acquiring it while holding a raw spinlock with IRQs disabled violates the lock ordering rules. This leads to the “Invalid wait context” lockdep warning. In other words, the lock order looks like this: ``` raw_spin_lock_irqsave(&pci_lock); ↓ rt_spin_lock(&inject_lock); <-- not allowed ``` To fix this, convert `inject_lock` from an `rt_spin_lock` to a `raw_spinlock_t`, a raw spinlock is safe and consistent with the surrounding locking scheme. This resolves the lockdep “Invalid wait context” warning observed when injecting correctable AER errors through `/dev/aer_inject` on PREEMPT_RT. This was discovered while testing PCIe AER error injection on an arm64 QEMU virtual machine: ``` qemu-system-aarch64 \ -nographic \ -machine virt,highmem=off,gic-version=3 \ -cpu cortex-a72 \ -kernel arch/arm64/boot/Image \ -initrd initramfs.cpio.gz \ -append "console=ttyAMA0 root=/dev/ram rdinit=/linuxrc earlyprintk nokaslr" \ -m 2G \ -smp 1 \ -netdev user,id=net0,hostfwd=tcp::2223-:22 \ -device virtio-net-pci,netdev=net0 \ -device pcie-root-port,id=rp0,chassis=1,slot=0x0 \ -device pci-testdev -s -S ``` Injecting a correctable PCIe error via /dev/aer_inject caused a BUG report with "Invalid wait context" in the irq/PCIe thread. ``` ~ # export HEX="00020000000000000100000000000000000000000000000000000000" ~ # echo -n "$HEX" | xxd -r -p | tee /dev/aer_inject >/dev/null [ 1850.947170] pcieport 0000:00:02.0: aer_inject: Injecting errors 00000001/00000000 into device 0000:00:02.0 [ 1850.949951] [ 1850.950479] ============================= [ 1850.950780] [ BUG: Invalid wait context ] [ 1850.951152] 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 Not tainted [ 1850.951457] ----------------------------- [ 1850.951680] irq/16-PCIe PME/56 is trying to lock: [ 1850.952004] ffff800082865238 (inject_lock){+.+.}-{3:3}, at: aer_inj_read_config+0x38/0x1dc [ 1850.952731] other info that might help us debug this: [ 1850.952997] context-{5:5} [ 1850.953192] 5 locks held by irq/16-PCIe PME/56: [ 1850.953415] #0: ffff800082647390 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x30/0x268 [ 1850.953931] #1: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.954453] #2: ffff000004bb6c58 (&data->lock){+...}-{3:3}, at: pcie_pme_irq+0x34/0xc4 [ 1850.954949] #3: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.955420] #4: ffff800082863d10 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x5c/0xd8 [ 1850.955932] stack backtrace: [ 1850.956412] CPU: 0 UID: 0 PID: 56 Comm: irq/16-PCIe PME Not tainted 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 PREEMPT_{RT,(full)} [ 1850.957039] Hardware name: linux,dummy-virt (DT) [ 1850.957409] Call trace: [ 1850.957727] show_stack+0x18/0x24 (C) [ 1850.958089] dump_stack_lvl+0x40/0xbc [ 1850.958339] dump_stack+0x18/0x24 [ 1850.958586] __lock_acquire+0xa84/0x3008 [ 1850.958907] lock_acquire+0x128/0x2a8 [ 1850.959171] rt_spin_lock+0x50/0x1b8 [ 1850.959476] aer_inj_read_config+0x38/0x1dc [ 1850.959821] pci_bus_read_config_dword+0x80/0xd8 [ 1850.960079] pcie_capability_read_dword+0xac/0xd8 [ 1850.960454] pcie_pme_irq+0x44/0xc4 [ 1850.960728] irq_forced_thread_fn+0x30/0x94 [ 1850.960984] irq_thread+0x1ac/0x3a4 [ 1850.961308] kthread+0x1b4/0x208 [ 1850.961557] ret_from_fork+0x10/0x20 [ 1850.963088] pcieport 0000:00:02.0: AER: Correctable error message received from 0000:00:02.0 [ 1850.963330] pcieport 0000:00:02.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) [ 1850.963351] pcieport 0000:00:02.0: device [1b36:000c] error status/mask=00000001/0000e000 [ 1850.963385] pcieport 0000:00:02.0: [ 0] RxErr (First) ``` Signed-off-by: Guangbo Cui <jckeep.cuiguangbo@gmail.com>
…under PREEMPT_RT In PREEMPT_RT, IRQs are forced to run in threaded. However, lockdep did not correctly account for this case, causing false-positive warnings about hardirq context violations when analyzing lock acquisition in such threaded IRQs (see function `task_wait_context`). This patch updates `irq_forced_thread_fn` to explicitly call `lockdep_hardirq_enter()` and `lockdep_hardirq_exit()` when PREEMPT_RT is enabled, ensuring lockdep correctly tracks the hardirq context even when the IRQ is executed in a forced thread. This was discovered while testing PCIe AER error injection on an arm64 QEMU virtual machine: ``` qemu-system-aarch64 \ -nographic \ -machine virt,highmem=off,gic-version=3 \ -cpu cortex-a72 \ -kernel arch/arm64/boot/Image \ -initrd initramfs.cpio.gz \ -append "console=ttyAMA0 root=/dev/ram rdinit=/linuxrc earlyprintk nokaslr" \ -m 2G \ -smp 1 \ -netdev user,id=net0,hostfwd=tcp::2223-:22 \ -device virtio-net-pci,netdev=net0 \ -device pcie-root-port,id=rp0,chassis=1,slot=0x0 \ -device pci-testdev -s -S ``` Injecting a correctable PCIe error via /dev/aer_inject caused a BUG report with "Invalid wait context" in the irq/PCIe thread. ``` ~ # export HEX="00020000000000000100000000000000000000000000000000000000" ~ # echo -n "$HEX" | xxd -r -p | tee /dev/aer_inject >/dev/null [ 1850.947170] pcieport 0000:00:02.0: aer_inject: Injecting errors 00000001/00000000 into device 0000:00:02.0 [ 1850.949951] [ 1850.950479] ============================= [ 1850.950780] [ BUG: Invalid wait context ] [ 1850.951152] 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 Not tainted [ 1850.951457] ----------------------------- [ 1850.951680] irq/16-PCIe PME/56 is trying to lock: [ 1850.952004] ffff800082865238 (inject_lock){+.+.}-{3:3}, at: aer_inj_read_config+0x38/0x1dc [ 1850.952731] other info that might help us debug this: [ 1850.952997] context-{5:5} [ 1850.953192] 5 locks held by irq/16-PCIe PME/56: [ 1850.953415] #0: ffff800082647390 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x30/0x268 [ 1850.953931] #1: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.954453] #2: ffff000004bb6c58 (&data->lock){+...}-{3:3}, at: pcie_pme_irq+0x34/0xc4 [ 1850.954949] #3: ffff8000826c6b38 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire+0x4/0x48 [ 1850.955420] #4: ffff800082863d10 (pci_lock){....}-{2:2}, at: pci_bus_read_config_dword+0x5c/0xd8 [ 1850.955932] stack backtrace: [ 1850.956412] CPU: 0 UID: 0 PID: 56 Comm: irq/16-PCIe PME Not tainted 6.17.0-11316-g7a405dbb0f03-dirty torvalds#7 PREEMPT_{RT,(full)} [ 1850.957039] Hardware name: linux,dummy-virt (DT) [ 1850.957409] Call trace: [ 1850.957727] show_stack+0x18/0x24 (C) [ 1850.958089] dump_stack_lvl+0x40/0xbc [ 1850.958339] dump_stack+0x18/0x24 [ 1850.958586] __lock_acquire+0xa84/0x3008 [ 1850.958907] lock_acquire+0x128/0x2a8 [ 1850.959171] rt_spin_lock+0x50/0x1b8 [ 1850.959476] aer_inj_read_config+0x38/0x1dc [ 1850.959821] pci_bus_read_config_dword+0x80/0xd8 [ 1850.960079] pcie_capability_read_dword+0xac/0xd8 [ 1850.960454] pcie_pme_irq+0x44/0xc4 [ 1850.960728] irq_forced_thread_fn+0x30/0x94 [ 1850.960984] irq_thread+0x1ac/0x3a4 [ 1850.961308] kthread+0x1b4/0x208 [ 1850.961557] ret_from_fork+0x10/0x20 [ 1850.963088] pcieport 0000:00:02.0: AER: Correctable error message received from 0000:00:02.0 [ 1850.963330] pcieport 0000:00:02.0: PCIe Bus Error: severity=Correctable, type=Physical Layer, (Receiver ID) [ 1850.963351] pcieport 0000:00:02.0: device [1b36:000c] error status/mask=00000001/0000e000 [ 1850.963385] pcieport 0000:00:02.0: [ 0] RxErr (First) ``` Signed-off-by: Guangbo Cui <2407018371@qq.com>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
- treat tailcall count as 32-bit for access and update - change out_offset scope from file to function - minor format/structure changes for consistency Testing: (skipping fentry, fexit, freplace) ======== root@qemu-armhf:/usr/libexec/kselftests-bpf# modprobe test_bpf test_suite=test_tail_calls test_bpf: #0 Tail call leaf jited:1 967 PASS test_bpf: #1 Tail call 2 jited:1 1427 PASS test_bpf: #2 Tail call 3 jited:1 2373 PASS test_bpf: #3 Tail call 4 jited:1 2304 PASS test_bpf: #4 Tail call load/store leaf jited:1 1684 PASS test_bpf: #5 Tail call load/store jited:1 2249 PASS test_bpf: torvalds#6 Tail call error path, max count reached jited:1 22538 PASS test_bpf: torvalds#7 Tail call count preserved across function calls jited:1 1055668 PASS test_bpf: torvalds#8 Tail call error path, NULL target jited:1 513 PASS test_bpf: torvalds#9 Tail call error path, index out of range jited:1 392 PASS test_bpf: test_tail_calls: Summary: 10 PASSED, 0 FAILED, [10/10 JIT'ed] root@qemu-armhf:/usr/libexec/kselftests-bpf# ./test_progs -n 397/1-12,17-18,23-24,27-31 397/1 tailcalls/tailcall_1:OK 397/2 tailcalls/tailcall_2:OK 397/3 tailcalls/tailcall_3:OK 397/4 tailcalls/tailcall_4:OK 397/5 tailcalls/tailcall_5:OK 397/6 tailcalls/tailcall_6:OK 397/7 tailcalls/tailcall_bpf2bpf_1:OK 397/8 tailcalls/tailcall_bpf2bpf_2:OK 397/9 tailcalls/tailcall_bpf2bpf_3:OK 397/10 tailcalls/tailcall_bpf2bpf_4:OK 397/11 tailcalls/tailcall_bpf2bpf_5:OK 397/12 tailcalls/tailcall_bpf2bpf_6:OK 397/17 tailcalls/tailcall_poke:OK 397/18 tailcalls/tailcall_bpf2bpf_hierarchy_1:OK 397/23 tailcalls/tailcall_bpf2bpf_hierarchy_2:OK 397/24 tailcalls/tailcall_bpf2bpf_hierarchy_3:OK 397/27 tailcalls/tailcall_failure:OK 397/28 tailcalls/reject_tail_call_spin_lock:OK 397/29 tailcalls/reject_tail_call_rcu_lock:OK 397/30 tailcalls/reject_tail_call_preempt_lock:OK 397/31 tailcalls/reject_tail_call_ref:OK 397 tailcalls:OK Summary: 1/21 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
No description provided.