Skip to content

Conversation

mdminhazulhaque
Copy link

The USB device with ID 0x015E is known as Huwai WIXUBB116. The module works fine when the ID is changed to 0x015E.

The USB device with ID 0x015E is known as Huwai WIXUBB116. The module works fine when the ID is changed to 0x015E.
@davispuh
Copy link
Contributor

that doesn't change anything...
0x15E is equal to 0x015E (and also equal to 0x0000015E)

http://ideone.com/wzJoG

tklauser pushed a commit to tklauser/linux that referenced this pull request Sep 18, 2012
…d reasons

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     torvalds#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     torvalds#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     torvalds#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     torvalds#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
wrobelda referenced this pull request in wrobelda/linux-sunxi Sep 19, 2012
wrobelda referenced this pull request in wrobelda/linux-sunxi Sep 19, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     linux-sunxi#1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     linux-sunxi#2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     linux-sunxi#3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     linux-sunxi#4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     linux-sunxi#5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     linux-sunxi#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     linux-sunxi#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     linux-sunxi#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     linux-sunxi#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    linux-sunxi#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    linux-sunxi#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    linux-sunxi#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    linux-sunxi#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    linux-sunxi#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    linux-sunxi#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    linux-sunxi#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    linux-sunxi#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    linux-sunxi#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    linux-sunxi#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    linux-sunxi#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    linux-sunxi#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    linux-sunxi#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    linux-sunxi#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    linux-sunxi#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    linux-sunxi#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
heftig referenced this pull request in zen-kernel/zen-kernel Sep 29, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    #10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    #11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    #12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    #13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    #14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    #15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    #16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    #17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    #18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    #19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    #20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    #21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    #22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    #23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    #24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    #25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 2, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 4, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
noamc referenced this pull request in Mellanox/linux Oct 16, 2012
…fpga] IPI fixes

1. IPI was not going from cpu1 -> cpu0. Turns out that the IRQ mode
   programmed IDU_IRQ_MOD_TCPU_ALLRECP mismatched in value in kernel and
   ISS. kernel (per IDU specs) defines it as '3' whereas ISS believes it
   shd be 7 (which is clearly a BUG).

2. the IDU IRQs are hardwaired hence need to be setup as IRQF_PERCPU to
    disallow any irq-affinity changes.

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 17, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
hknkkn pushed a commit to hknkkn/linux-dynticks that referenced this pull request Oct 29, 2012
Printing the "start_ip" for every secondary cpu is very noisy on a large
system - and doesn't add any value. Drop this message.

Console log before:
Booting Node   0, Processors  #1
smpboot cpu 1: start_ip = 96000
 #2
smpboot cpu 2: start_ip = 96000
 #3
smpboot cpu 3: start_ip = 96000
 #4
smpboot cpu 4: start_ip = 96000
       ...
 torvalds#31
smpboot cpu 31: start_ip = 96000
Brought up 32 CPUs

Console log after:
Booting Node   0, Processors  #1 #2 #3 #4 #5 torvalds#6 torvalds#7 Ok.
Booting Node   1, Processors  torvalds#8 torvalds#9 torvalds#10 torvalds#11 torvalds#12 torvalds#13 torvalds#14 torvalds#15 Ok.
Booting Node   0, Processors  torvalds#16 torvalds#17 torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 Ok.
Booting Node   1, Processors  torvalds#24 torvalds#25 torvalds#26 torvalds#27 torvalds#28 torvalds#29 torvalds#30 torvalds#31
Brought up 32 CPUs

Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Oct 31, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
vineetgarc referenced this pull request in foss-for-synopsys-dwc-arc-processors/linux Oct 31, 2012
1. IPI was not going from cpu1 -> cpu0. Turns out that the IRQ mode
   programmed IDU_IRQ_MOD_TCPU_ALLRECP mismatched in value in kernel and
   ISS. kernel (per IDU specs) defines it as '3' whereas ISS believes it
   shd be 7 (which is clearly a BUG).

2. the IDU IRQs are hardwaired hence need to be setup as IRQF_PERCPU to
    disallow any irq-affinity changes.

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Nov 14, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
kees pushed a commit to kees/linux that referenced this pull request Nov 16, 2012
…d reasons

BugLink: http://bugs.launchpad.net/bugs/1035435

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     torvalds#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     torvalds#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     torvalds#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     torvalds#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com>
koenkooi pushed a commit to koenkooi/linux that referenced this pull request Nov 21, 2012
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     #6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     #7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     #8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     #9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
vineetgarc referenced this pull request in foss-for-synopsys-dwc-arc-processors/linux Dec 31, 2012
1. IPI was not going from cpu1 -> cpu0. Turns out that the IRQ mode
   programmed IDU_IRQ_MOD_TCPU_ALLRECP mismatched in value in kernel and
   ISS. kernel (per IDU specs) defines it as '3' whereas ISS believes it
   shd be 7 (which is clearly a BUG).

2. the IDU IRQs are hardwaired hence need to be setup as IRQF_PERCPU to
    disallow any irq-affinity changes.

Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
cianmcgovern pushed a commit to cianmcgovern/linux that referenced this pull request Mar 10, 2013
…d reasons

commit 5cf02d0 upstream.

We've had some reports of a deadlock where rpciod ends up with a stack
trace like this:

    PID: 2507   TASK: ffff88103691ab40  CPU: 14  COMMAND: "rpciod/14"
     #0 [ffff8810343bf2f0] schedule at ffffffff814dabd9
     #1 [ffff8810343bf3b8] nfs_wait_bit_killable at ffffffffa038fc04 [nfs]
     #2 [ffff8810343bf3c8] __wait_on_bit at ffffffff814dbc2f
     #3 [ffff8810343bf418] out_of_line_wait_on_bit at ffffffff814dbcd8
     #4 [ffff8810343bf488] nfs_commit_inode at ffffffffa039e0c1 [nfs]
     #5 [ffff8810343bf4f8] nfs_release_page at ffffffffa038bef6 [nfs]
     torvalds#6 [ffff8810343bf528] try_to_release_page at ffffffff8110c670
     torvalds#7 [ffff8810343bf538] shrink_page_list.clone.0 at ffffffff81126271
     torvalds#8 [ffff8810343bf668] shrink_inactive_list at ffffffff81126638
     torvalds#9 [ffff8810343bf818] shrink_zone at ffffffff8112788f
    torvalds#10 [ffff8810343bf8c8] do_try_to_free_pages at ffffffff81127b1e
    torvalds#11 [ffff8810343bf958] try_to_free_pages at ffffffff8112812f
    torvalds#12 [ffff8810343bfa08] __alloc_pages_nodemask at ffffffff8111fdad
    torvalds#13 [ffff8810343bfb28] kmem_getpages at ffffffff81159942
    torvalds#14 [ffff8810343bfb58] fallback_alloc at ffffffff8115a55a
    torvalds#15 [ffff8810343bfbd8] ____cache_alloc_node at ffffffff8115a2d9
    torvalds#16 [ffff8810343bfc38] kmem_cache_alloc at ffffffff8115b09b
    torvalds#17 [ffff8810343bfc78] sk_prot_alloc at ffffffff81411808
    torvalds#18 [ffff8810343bfcb8] sk_alloc at ffffffff8141197c
    torvalds#19 [ffff8810343bfce8] inet_create at ffffffff81483ba6
    torvalds#20 [ffff8810343bfd38] __sock_create at ffffffff8140b4a7
    torvalds#21 [ffff8810343bfd98] xs_create_sock at ffffffffa01f649b [sunrpc]
    torvalds#22 [ffff8810343bfdd8] xs_tcp_setup_socket at ffffffffa01f6965 [sunrpc]
    torvalds#23 [ffff8810343bfe38] worker_thread at ffffffff810887d0
    torvalds#24 [ffff8810343bfee8] kthread at ffffffff8108dd96
    torvalds#25 [ffff8810343bff48] kernel_thread at ffffffff8100c1ca

rpciod is trying to allocate memory for a new socket to talk to the
server. The VM ends up calling ->releasepage to get more memory, and it
tries to do a blocking commit. That commit can't succeed however without
a connected socket, so we deadlock.

Fix this by setting PF_FSTRANS on the workqueue task prior to doing the
socket allocation, and having nfs_release_page check for that flag when
deciding whether to do a commit call. Also, set PF_FSTRANS
unconditionally in rpc_async_schedule since that function can also do
allocations sometimes.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
@torvalds torvalds merged commit 9045893 into torvalds:master Jul 2, 2013
torvalds pushed a commit that referenced this pull request Jul 10, 2013
Several people reported the warning: "kernel BUG at kernel/timer.c:729!"
and the stack trace is:

	#7 [ffff880214d25c10] mod_timer+501 at ffffffff8106d905
	#8 [ffff880214d25c50] br_multicast_del_pg.isra.20+261 at ffffffffa0731d25 [bridge]
	#9 [ffff880214d25c80] br_multicast_disable_port+88 at ffffffffa0732948 [bridge]
	#10 [ffff880214d25cb0] br_stp_disable_port+154 at ffffffffa072bcca [bridge]
	#11 [ffff880214d25ce8] br_device_event+520 at ffffffffa072a4e8 [bridge]
	#12 [ffff880214d25d18] notifier_call_chain+76 at ffffffff8164aafc
	#13 [ffff880214d25d50] raw_notifier_call_chain+22 at ffffffff810858f6
	#14 [ffff880214d25d60] call_netdevice_notifiers+45 at ffffffff81536aad
	#15 [ffff880214d25d80] dev_close_many+183 at ffffffff81536d17
	#16 [ffff880214d25dc0] rollback_registered_many+168 at ffffffff81537f68
	#17 [ffff880214d25de8] rollback_registered+49 at ffffffff81538101
	#18 [ffff880214d25e10] unregister_netdevice_queue+72 at ffffffff815390d8
	#19 [ffff880214d25e30] __tun_detach+272 at ffffffffa074c2f0 [tun]
	#20 [ffff880214d25e88] tun_chr_close+45 at ffffffffa074c4bd [tun]
	#21 [ffff880214d25ea8] __fput+225 at ffffffff8119b1f1
	#22 [ffff880214d25ef0] ____fput+14 at ffffffff8119b3fe
	#23 [ffff880214d25f00] task_work_run+159 at ffffffff8107cf7f
	#24 [ffff880214d25f30] do_notify_resume+97 at ffffffff810139e1
	#25 [ffff880214d25f50] int_signal+18 at ffffffff8164f292

this is due to I forgot to check if mp->timer is armed in
br_multicast_del_pg(). This bug is introduced by
commit 9f00b2e (bridge: only expire the mdb entry
when query is received).

Same for __br_mdb_del().

Tested-by: poma <pomidorabelisima@gmail.com>
Reported-by: LiYonghua <809674045@qq.com>
Reported-by: Robert Hancock <hancockrwd@gmail.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Sep 11, 2013
When booting secondary CPUs, announce_cpu() is called to show which cpu has
been brought up. For example:

[    0.402751] smpboot: Booting Node   0, Processors  #1 #2 #3 #4 #5 OK
[    0.525667] smpboot: Booting Node   1, Processors  torvalds#6 torvalds#7 torvalds#8 torvalds#9 torvalds#10 torvalds#11 OK
[    0.755592] smpboot: Booting Node   0, Processors  torvalds#12 torvalds#13 torvalds#14 torvalds#15 torvalds#16 torvalds#17 OK
[    0.890495] smpboot: Booting Node   1, Processors  torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23

But the last "OK" is lost, because 'nr_cpu_ids-1' represents the maximum
possible cpu id. It should use the maximum present cpu id in case not all
CPUs booted up.

Signed-off-by: Libin <huawei.libin@huawei.com>
Cc: <guohanjun@huawei.com>
Cc: <wangyijing@huawei.com>
Cc: <fenghua.yu@intel.com>
Cc: <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1378378676-18276-1-git-send-email-huawei.libin@huawei.com
[ tweaked the changelog, removed unnecessary line break, tweaked the format to align the fields vertically. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Oct 1, 2013
If memory allocation of in pcpu_embed_first_chunk() fails, the
allocated memory is not released correctly. In the release loop also
the non-allocated elements are released which leads to the following
kernel BUG on systems with very little memory:

[    0.000000] kernel BUG at mm/bootmem.c:307!
[    0.000000] illegal operation: 0001 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0 torvalds#22
[    0.000000] task: 0000000000a20ae0 ti: 0000000000a08000 task.ti: 0000000000a08000
[    0.000000] Krnl PSW : 0400000180000000 0000000000abda7a (__free+0x116/0x154)
[    0.000000]            R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 EA:3
...
[    0.000000]  [<0000000000abdce2>] mark_bootmem_node+0xde/0xf0
[    0.000000]  [<0000000000abdd9c>] mark_bootmem+0xa8/0x118
[    0.000000]  [<0000000000abcbba>] pcpu_embed_first_chunk+0xe7a/0xf0c
[    0.000000]  [<0000000000abcc96>] setup_per_cpu_areas+0x4a/0x28c

To fix the problem now only allocated elements are released. This then
leads to the correct kernel panic:

[    0.000000] Kernel panic - not syncing: Failed to initialize percpu areas.
...
[    0.000000] Call Trace:
[    0.000000] ([<000000000011307e>] show_trace+0x132/0x150)
[    0.000000]  [<0000000000113160>] show_stack+0xc4/0xd4
[    0.000000]  [<00000000007127dc>] dump_stack+0x74/0xd8
[    0.000000]  [<00000000007123fe>] panic+0xea/0x264
[    0.000000]  [<0000000000b14814>] setup_per_cpu_areas+0x5c/0x28c

tj: Flipped if conditional so that it doesn't need "continue".

Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Oct 14, 2013
As the new x86 CPU bootup printout format code maintainer, I am
taking immediate action to improve and clean (and thus indulge
my OCD) the reporting of the cores when coming up online.

Fix padding to a right-hand alignment, cleanup code and bind
reporting width to the max number of supported CPUs on the
system, like this:

 [    0.074509] smpboot: Booting Node   0, Processors:      #1  #2  #3  #4  #5  torvalds#6  torvalds#7 OK
 [    0.644008] smpboot: Booting Node   1, Processors:  torvalds#8  torvalds#9 torvalds#10 torvalds#11 torvalds#12 torvalds#13 torvalds#14 torvalds#15 OK
 [    1.245006] smpboot: Booting Node   2, Processors: torvalds#16 torvalds#17 torvalds#18 torvalds#19 torvalds#20 torvalds#21 torvalds#22 torvalds#23 OK
 [    1.864005] smpboot: Booting Node   3, Processors: torvalds#24 torvalds#25 torvalds#26 torvalds#27 torvalds#28 torvalds#29 torvalds#30 torvalds#31 OK
 [    2.489005] smpboot: Booting Node   4, Processors: torvalds#32 torvalds#33 torvalds#34 torvalds#35 torvalds#36 torvalds#37 torvalds#38 torvalds#39 OK
 [    3.093005] smpboot: Booting Node   5, Processors: torvalds#40 torvalds#41 torvalds#42 torvalds#43 torvalds#44 torvalds#45 torvalds#46 torvalds#47 OK
 [    3.698005] smpboot: Booting Node   6, Processors: torvalds#48 torvalds#49 torvalds#50 torvalds#51 #52 #53 torvalds#54 torvalds#55 OK
 [    4.304005] smpboot: Booting Node   7, Processors: torvalds#56 torvalds#57 #58 torvalds#59 torvalds#60 torvalds#61 torvalds#62 torvalds#63 OK
 [    4.961413] Brought up 64 CPUs

and this:

 [    0.072367] smpboot: Booting Node   0, Processors:    #1 #2 #3 #4 #5 torvalds#6 torvalds#7 OK
 [    0.686329] Brought up 8 CPUs

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Libin <huawei.libin@huawei.com>
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Link: http://lkml.kernel.org/r/20130927143554.GF4422@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
swarren pushed a commit to swarren/linux-tegra that referenced this pull request Oct 14, 2013
Turn it into (for example):

[    0.073380] x86: Booting SMP configuration:
[    0.074005] .... node   #0, CPUs:          #1   #2   #3   #4   #5   torvalds#6   torvalds#7
[    0.603005] .... node   #1, CPUs:     torvalds#8   torvalds#9  torvalds#10  torvalds#11  torvalds#12  torvalds#13  torvalds#14  torvalds#15
[    1.200005] .... node   #2, CPUs:    torvalds#16  torvalds#17  torvalds#18  torvalds#19  torvalds#20  torvalds#21  torvalds#22  torvalds#23
[    1.796005] .... node   #3, CPUs:    torvalds#24  torvalds#25  torvalds#26  torvalds#27  torvalds#28  torvalds#29  torvalds#30  torvalds#31
[    2.393005] .... node   #4, CPUs:    torvalds#32  torvalds#33  torvalds#34  torvalds#35  torvalds#36  torvalds#37  torvalds#38  torvalds#39
[    2.996005] .... node   #5, CPUs:    torvalds#40  torvalds#41  torvalds#42  torvalds#43  torvalds#44  torvalds#45  torvalds#46  torvalds#47
[    3.600005] .... node   torvalds#6, CPUs:    torvalds#48  torvalds#49  torvalds#50  torvalds#51  #52  #53  torvalds#54  torvalds#55
[    4.202005] .... node   torvalds#7, CPUs:    torvalds#56  torvalds#57  #58  torvalds#59  torvalds#60  torvalds#61  torvalds#62  torvalds#63
[    4.811005] .... node   torvalds#8, CPUs:    torvalds#64  torvalds#65  torvalds#66  torvalds#67  torvalds#68  torvalds#69  #70  torvalds#71
[    5.421006] .... node   torvalds#9, CPUs:    torvalds#72  torvalds#73  torvalds#74  torvalds#75  torvalds#76  torvalds#77  torvalds#78  torvalds#79
[    6.032005] .... node  torvalds#10, CPUs:    torvalds#80  torvalds#81  torvalds#82  torvalds#83  torvalds#84  torvalds#85  torvalds#86  torvalds#87
[    6.648006] .... node  torvalds#11, CPUs:    torvalds#88  torvalds#89  torvalds#90  torvalds#91  torvalds#92  torvalds#93  torvalds#94  torvalds#95
[    7.262005] .... node  torvalds#12, CPUs:    torvalds#96  torvalds#97  torvalds#98  torvalds#99 torvalds#100 torvalds#101 torvalds#102 torvalds#103
[    7.865005] .... node  torvalds#13, CPUs:   torvalds#104 torvalds#105 torvalds#106 torvalds#107 torvalds#108 torvalds#109 torvalds#110 torvalds#111
[    8.466005] .... node  torvalds#14, CPUs:   torvalds#112 torvalds#113 torvalds#114 torvalds#115 torvalds#116 torvalds#117 torvalds#118 torvalds#119
[    9.073006] .... node  torvalds#15, CPUs:   torvalds#120 torvalds#121 torvalds#122 torvalds#123 torvalds#124 torvalds#125 torvalds#126 torvalds#127
[    9.679901] x86: Booted up 16 nodes, 128 CPUs

and drop useless elements.

Change num_digits() to hpa's division-avoiding, cell-phone-typed
version which he went at great lengths and pains to submit on a
Saturday evening.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: huawei.libin@huawei.com
Cc: wangyijing@huawei.com
Cc: fenghua.yu@intel.com
Cc: guohanjun@huawei.com
Cc: paul.gortmaker@windriver.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20130930095624.GB16383@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
torvalds pushed a commit that referenced this pull request Dec 2, 2013
…culation

Currently mx53 (CortexA8) running at 1GHz reports:
Calibrating delay loop... 663.55 BogoMIPS (lpj=3317760)

Tom Evans verified that alignments of 0x0 and 0x8 run the two instructions of __loop_delay in one clock cycle (1 clock/loop), while alignments of 0x4 and 0xc take 3 clocks to run the loop twice. (1.5 clock/loop)

The original object code looks like this:

00000010 <__loop_const_udelay>:
  10:	e3e01000 	mvn	r1, #0
  14:	e51f201c 	ldr	r2, [pc, #-28]	; 0 <__loop_udelay-0x8>
  18:	e5922000 	ldr	r2, [r2]
  1c:	e0800921 	add	r0, r0, r1, lsr #18
  20:	e1a00720 	lsr	r0, r0, #14
  24:	e0822b21 	add	r2, r2, r1, lsr #22
  28:	e1a02522 	lsr	r2, r2, #10
  2c:	e0000092 	mul	r0, r2, r0
  30:	e0800d21 	add	r0, r0, r1, lsr #26
  34:	e1b00320 	lsrs	r0, r0, #6
  38:	01a0f00e 	moveq	pc, lr

0000003c <__loop_delay>:
  3c:	e2500001 	subs	r0, r0, #1
  40:	8afffffe 	bhi	3c <__loop_delay>
  44:	e1a0f00e 	mov	pc, lr

After adding the 'align 3' directive to __loop_delay (align to 8 bytes):

00000010 <__loop_const_udelay>:
  10:	e3e01000 	mvn	r1, #0
  14:	e51f201c 	ldr	r2, [pc, #-28]	; 0 <__loop_udelay-0x8>
  18:	e5922000 	ldr	r2, [r2]
  1c:	e0800921 	add	r0, r0, r1, lsr #18
  20:	e1a00720 	lsr	r0, r0, #14
  24:	e0822b21 	add	r2, r2, r1, lsr #22
  28:	e1a02522 	lsr	r2, r2, #10
  2c:	e0000092 	mul	r0, r2, r0
  30:	e0800d21 	add	r0, r0, r1, lsr #26
  34:	e1b00320 	lsrs	r0, r0, #6
  38:	01a0f00e 	moveq	pc, lr
  3c:	e320f000 	nop	{0}

00000040 <__loop_delay>:
  40:	e2500001 	subs	r0, r0, #1
  44:	8afffffe 	bhi	40 <__loop_delay>
  48:	e1a0f00e 	mov	pc, lr
  4c:	e320f000 	nop	{0}

, which now reports:
Calibrating delay loop... 996.14 BogoMIPS (lpj=4980736)

Some more test results:

On mx31 (ARM1136) running at 532 MHz, before the patch:
Calibrating delay loop... 351.43 BogoMIPS (lpj=1757184)

On mx31 (ARM1136) running at 532 MHz after the patch:
Calibrating delay loop... 528.79 BogoMIPS (lpj=2643968)

Also tested on mx6 (CortexA9) and on mx27 (ARM926), which shows the same
BogoMIPS value before and after this patch.

Reported-by: Tom Evans <tom_usenet@optusnet.com.au>
Suggested-by: Tom Evans <tom_usenet@optusnet.com.au>
Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
gregnietsky pushed a commit to Distrotech/linux that referenced this pull request Apr 9, 2014
commit 2cab86b upstream.

Sometime the ASCONF_ACK parameters can equal to the fourfold of
ASCONF parameters, this only happend in some special case:

  ASCONF parameter is :
    Unrecognized Parameter (4 bytes)
  ASCONF_ACK parameter should be:
    Error Cause Indication parameter (8 bytes header)
     + Error Cause (4 bytes header)
       + Unrecognized Parameter (4bytes)

Four 4bytes Unrecognized Parameters in ASCONF chunk will cause panic.

Pid: 0, comm: swapper Not tainted 2.6.38-next+ torvalds#22 Bochs Bochs
EIP: 0060:[<c0717eae>] EFLAGS: 00010246 CPU: 0
EIP is at skb_put+0x60/0x70
EAX: 00000077 EBX: c09060e2 ECX: dec1dc30 EDX: c09469c0
ESI: 00000000 EDI: de3c8d4 EBP: dec1dc58 ESP: dec1dc2c
 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process swapper (pid: 0, ti=dec1c000 task=c09aef20 task.ti=c0980000)
Stack:
 c09469c0 e1894fa4 00000044 00000004 de3c8d00 de3c8d00 de3c8d44 de3c8d4
 c09060e2 de25dd80 de3c8d4 dec1dc7c e1894fa4 dec1dcb0 00000040 00000004
 00000000 00000800 00000004 00000004 dec1dce0 e1895a2b dec1dcb4 de25d960
Call Trace:
 [<e1894fa4>] ? sctp_addto_chunk+0x4e/0x89 [sctp]
 [<e1894fa4>] sctp_addto_chunk+0x4e/0x89 [sctp]
 [<e1895a2b>] sctp_process_asconf+0x32f/0x3d1 [sctp]
 [<e188d554>] sctp_sf_do_asconf+0xf8/0x173 [sctp]
 [<e1890b02>] sctp_do_sm+0xb8/0x159 [sctp]
 [<e18a2248>] ? sctp_cname+0x0/0x52 [sctp]
 [<e189392d>] sctp_assoc_bh_rcv+0xac/0xe3 [sctp]
 [<e1897d76>] sctp_inq_push+0x2d/0x30 [sctp]
 [<e18a21b2>] sctp_rcv+0x7a7/0x83d [sctp]
 [<c077a95c>] ? ipv4_confirm+0x118/0x125
 [<c073a970>] ? nf_iterate+0x34/0x62
 [<c074789d>] ? ip_local_deliver_finish+0x0/0x194
 [<c074789d>] ? ip_local_deliver_finish+0x0/0x194
 [<c0747992>] ip_local_deliver_finish+0xf5/0x194
 [<c074789d>] ? ip_local_deliver_finish+0x0/0x194
 [<c0747a6e>] NF_HOOK.clone.1+0x3d/0x44
 [<c0747ab3>] ip_local_deliver+0x3e/0x44
 [<c074789d>] ? ip_local_deliver_finish+0x0/0x194
 [<c074775c>] ip_rcv_finish+0x29f/0x2c7
 [<c07474bd>] ? ip_rcv_finish+0x0/0x2c7
 [<c0747a6e>] NF_HOOK.clone.1+0x3d/0x44
 [<c0747cae>] ip_rcv+0x1f5/0x233
 [<c07474bd>] ? ip_rcv_finish+0x0/0x2c7
 [<c071dce3>] __netif_receive_skb+0x310/0x336
 [<c07221f3>] netif_receive_skb+0x4b/0x51
 [<e0a4ed3d>] cp_rx_poll+0x1e7/0x29c [8139cp]
 [<c072275e>] net_rx_action+0x65/0x13a
 [<c0445a54>] __do_softirq+0xa1/0x149
 [<c04459b3>] ? __do_softirq+0x0/0x149
 <IRQ>
 [<c0445891>] ? irq_exit+0x37/0x72
 [<c040a7e9>] ? do_IRQ+0x81/0x95
 [<c07b3670>] ? common_interrupt+0x30/0x38
 [<c0428058>] ? native_safe_halt+0xa/0xc
 [<c040f5d7>] ? default_idle+0x58/0x92
 [<c0408fb0>] ? cpu_idle+0x96/0xb2
 [<c0797989>] ? rest_init+0x5d/0x5f
 [<c09fd90c>] ? start_kernel+0x34b/0x350
 [<c09fd0cb>] ? i386_start_kernel+0xba/0xc1

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
ystk pushed a commit to ystk/linux-ltsi-work that referenced this pull request Apr 29, 2014
Mike Galbraith captered the following:
| >torvalds#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
| >torvalds#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
| >torvalds#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
| >torvalds#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
| >torvalds#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
| >torvalds#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
| >torvalds#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
| >--- <IRQ stack> ---
| >torvalds#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
| >    [exception RIP: task_blocks_on_rt_mutex+51]
| >torvalds#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c
| >torvalds#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf
| >torvalds#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce
| >torvalds#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb
| >torvalds#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5
| >torvalds#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c

lock_timer_base() does a try_lock() which deadlocks on the waiter lock
not the lock itself.
This patch takes the waiter_lock with trylock so it should work from interrupt
context as well. If the fastpath doesn't work and the waiter_lock itself is
taken then it seems that the lock itself taken.
This patch also adds a "rt_spin_try_unlock" to keep lockdep happy. If we
managed to take the wait_lock in the first place we should also be able
to take it in the unlock path.

Cc: stable-rt@vger.kernel.org
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
ystk pushed a commit to ystk/linux-ltsi-work that referenced this pull request May 23, 2014
Mike Galbraith captered the following:
| >torvalds#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
| >torvalds#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
| >torvalds#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
| >torvalds#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
| >torvalds#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
| >torvalds#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
| >torvalds#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
| >--- <IRQ stack> ---
| >torvalds#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
| >    [exception RIP: task_blocks_on_rt_mutex+51]
| >torvalds#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c
| >torvalds#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf
| >torvalds#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce
| >torvalds#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb
| >torvalds#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5
| >torvalds#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c

lock_timer_base() does a try_lock() which deadlocks on the waiter lock
not the lock itself.
This patch takes the waiter_lock with trylock so it should work from interrupt
context as well. If the fastpath doesn't work and the waiter_lock itself is
taken then it seems that the lock itself taken.
This patch also adds a "rt_spin_try_unlock" to keep lockdep happy. If we
managed to take the wait_lock in the first place we should also be able
to take it in the unlock path.

Cc: stable-rt@vger.kernel.org
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
damentz referenced this pull request in zen-kernel/zen-kernel May 29, 2014
Mike Galbraith captered the following:
| >#11 [ffff88017b243e90] _raw_spin_lock at ffffffff815d2596
| >#12 [ffff88017b243e90] rt_mutex_trylock at ffffffff815d15be
| >#13 [ffff88017b243eb0] get_next_timer_interrupt at ffffffff81063b42
| >#14 [ffff88017b243f00] tick_nohz_stop_sched_tick at ffffffff810bd1fd
| >#15 [ffff88017b243f70] tick_nohz_irq_exit at ffffffff810bd7d2
| >#16 [ffff88017b243f90] irq_exit at ffffffff8105b02d
| >#17 [ffff88017b243fb0] reschedule_interrupt at ffffffff815db3dd
| >--- <IRQ stack> ---
| >#18 [ffff88017a2a9bc8] reschedule_interrupt at ffffffff815db3dd
| >    [exception RIP: task_blocks_on_rt_mutex+51]
| >#19 [ffff88017a2a9ce0] rt_spin_lock_slowlock at ffffffff815d183c
| >#20 [ffff88017a2a9da0] lock_timer_base.isra.35 at ffffffff81061cbf
| >#21 [ffff88017a2a9dd0] schedule_timeout at ffffffff815cf1ce
| >#22 [ffff88017a2a9e50] rcu_gp_kthread at ffffffff810f9bbb
| >#23 [ffff88017a2a9ed0] kthread at ffffffff810796d5
| >#24 [ffff88017a2a9f50] ret_from_fork at ffffffff815da04c

lock_timer_base() does a try_lock() which deadlocks on the waiter lock
not the lock itself.
This patch takes the waiter_lock with trylock so it should work from interrupt
context as well. If the fastpath doesn't work and the waiter_lock itself is
taken then it seems that the lock itself taken.
This patch also adds "rt_spin_unlock_after_trylock_in_irq" to keep lockdep
happy. If we managed to take the wait_lock in the first place we should also
be able to take it in the unlock path.

Cc: stable-rt@vger.kernel.org
Reported-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 5, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 8, 2025
If we have 2 instances of sbs-charger in the DTS, the driver probe for the second instance will fail:

[    8.012874] sbs-battery 18-000b: sbs-battery: battery gas gauge device registered
[    8.039094] sbs-charger 18-0009: ltc4100: smart charger device registered
[    8.112911] sbs-battery 20-000b: sbs-battery: battery gas gauge device registered
[    8.134533] sysfs: cannot create duplicate filename '/class/power_supply/sbs-charger'
[    8.143871] CPU: 3 PID: 295 Comm: systemd-udevd Tainted: G           O      5.10.147 torvalds#22
[    8.151974] Hardware name: ALE AMB (DT)
[    8.155828] Call trace:
[    8.158292]  dump_backtrace+0x0/0x1d4
[    8.161960]  show_stack+0x18/0x6c
[    8.165280]  dump_stack+0xcc/0x128
[    8.168687]  sysfs_warn_dup+0x60/0x7c
[    8.172353]  sysfs_do_create_link_sd+0xf0/0x100
[    8.176886]  sysfs_create_link+0x20/0x40
[    8.180816]  device_add+0x270/0x7a4
[    8.184311]  __power_supply_register+0x304/0x560
[    8.188930]  devm_power_supply_register+0x54/0xa0
[    8.193644]  sbs_probe+0xc0/0x214 [sbs_charger]
[    8.198183]  i2c_device_probe+0x2dc/0x2f4
[    8.202196]  really_probe+0xf0/0x510
[    8.205774]  driver_probe_device+0xfc/0x160
[    8.209960]  device_driver_attach+0xc0/0xcc
[    8.214146]  __driver_attach+0xc0/0x170
[    8.218002]  bus_for_each_dev+0x74/0xd4
[    8.221862]  driver_attach+0x24/0x30
[    8.225444]  bus_add_driver+0x148/0x250
[    8.229283]  driver_register+0x78/0x130
[    8.233140]  i2c_register_driver+0x4c/0xe0
[    8.237250]  sbs_driver_init+0x20/0x1000 [sbs_charger]
[    8.242424]  do_one_initcall+0x50/0x1b0
[    8.242434]  do_init_module+0x44/0x230
[    8.242438]  load_module+0x2200/0x27c0
[    8.242442]  __do_sys_finit_module+0xa8/0x11c
[    8.242447]  __arm64_sys_finit_module+0x20/0x30
[    8.242457]  el0_svc_common.constprop.0+0x64/0x154
[    8.242464]  do_el0_svc+0x24/0x8c
[    8.242474]  el0_svc+0x10/0x20
[    8.242481]  el0_sync_handler+0x108/0x114
[    8.242485]  el0_sync+0x180/0x1c0
[    8.243847] sbs-charger 20-0009: Failed to register power supply
[    8.287934] sbs-charger: probe of 20-0009 failed with error -17

This is mainly because the "name" field of power_supply_desc is a constant.
This patch fixes the issue by reusing the same approach as sbs-battery.
With this patch, the result is:
[    7.819532] sbs-charger 18-0009: ltc4100: smart charger device registered
[    7.825305] sbs-battery 18-000b: sbs-battery: battery gas gauge device registered
[    7.887423] sbs-battery 20-000b: sbs-battery: battery gas gauge device registered
[    7.893501] sbs-charger 20-0009: ltc4100: smart charger device registered

Signed-off-by: Fabien Proriol <fabien.proriol@viavisolutions.com>
Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 8, 2025
If we have 2 instances of sbs-charger in the DTS, the driver probe for the second instance will fail:

[    8.012874] sbs-battery 18-000b: sbs-battery: battery gas gauge device registered
[    8.039094] sbs-charger 18-0009: ltc4100: smart charger device registered
[    8.112911] sbs-battery 20-000b: sbs-battery: battery gas gauge device registered
[    8.134533] sysfs: cannot create duplicate filename '/class/power_supply/sbs-charger'
[    8.143871] CPU: 3 PID: 295 Comm: systemd-udevd Tainted: G           O      5.10.147 torvalds#22
[    8.151974] Hardware name: ALE AMB (DT)
[    8.155828] Call trace:
[    8.158292]  dump_backtrace+0x0/0x1d4
[    8.161960]  show_stack+0x18/0x6c
[    8.165280]  dump_stack+0xcc/0x128
[    8.168687]  sysfs_warn_dup+0x60/0x7c
[    8.172353]  sysfs_do_create_link_sd+0xf0/0x100
[    8.176886]  sysfs_create_link+0x20/0x40
[    8.180816]  device_add+0x270/0x7a4
[    8.184311]  __power_supply_register+0x304/0x560
[    8.188930]  devm_power_supply_register+0x54/0xa0
[    8.193644]  sbs_probe+0xc0/0x214 [sbs_charger]
[    8.198183]  i2c_device_probe+0x2dc/0x2f4
[    8.202196]  really_probe+0xf0/0x510
[    8.205774]  driver_probe_device+0xfc/0x160
[    8.209960]  device_driver_attach+0xc0/0xcc
[    8.214146]  __driver_attach+0xc0/0x170
[    8.218002]  bus_for_each_dev+0x74/0xd4
[    8.221862]  driver_attach+0x24/0x30
[    8.225444]  bus_add_driver+0x148/0x250
[    8.229283]  driver_register+0x78/0x130
[    8.233140]  i2c_register_driver+0x4c/0xe0
[    8.237250]  sbs_driver_init+0x20/0x1000 [sbs_charger]
[    8.242424]  do_one_initcall+0x50/0x1b0
[    8.242434]  do_init_module+0x44/0x230
[    8.242438]  load_module+0x2200/0x27c0
[    8.242442]  __do_sys_finit_module+0xa8/0x11c
[    8.242447]  __arm64_sys_finit_module+0x20/0x30
[    8.242457]  el0_svc_common.constprop.0+0x64/0x154
[    8.242464]  do_el0_svc+0x24/0x8c
[    8.242474]  el0_svc+0x10/0x20
[    8.242481]  el0_sync_handler+0x108/0x114
[    8.242485]  el0_sync+0x180/0x1c0
[    8.243847] sbs-charger 20-0009: Failed to register power supply
[    8.287934] sbs-charger: probe of 20-0009 failed with error -17

This is mainly because the "name" field of power_supply_desc is a constant.
This patch fixes the issue by reusing the same approach as sbs-battery.
With this patch, the result is:
[    7.819532] sbs-charger 18-0009: ltc4100: smart charger device registered
[    7.825305] sbs-battery 18-000b: sbs-battery: battery gas gauge device registered
[    7.887423] sbs-battery 20-000b: sbs-battery: battery gas gauge device registered
[    7.893501] sbs-charger 20-0009: ltc4100: smart charger device registered

Signed-off-by: Fabien Proriol <fabien.proriol@viavisolutions.com>
Signed-off-by: Sebastian Reichel <sebastian.reichel@collabora.com>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 9, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 10, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 11, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 12, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 14, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 15, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 15, 2025
Calling intotify_show_fdinfo() on fd watching an overlayfs inode, while
the overlayfs is being unmounted, can lead to dereferencing NULL ptr.

This issue was found by syzkaller.

Race Condition Diagram:

Thread 1                           Thread 2
--------                           --------

generic_shutdown_super()
 shrink_dcache_for_umount
  sb->s_root = NULL

                    |
                    |             vfs_read()
                    |              inotify_fdinfo()
                    |               * inode get from mark *
                    |               show_mark_fhandle(m, inode)
                    |                exportfs_encode_fid(inode, ..)
                    |                 ovl_encode_fh(inode, ..)
                    |                  ovl_check_encode_origin(inode)
                    |                   * deref i_sb->s_root *
                    |
                    |
                    v
 fsnotify_sb_delete(sb)

Which then leads to:

[   32.133461] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   32.134438] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
[   32.135032] CPU: 1 UID: 0 PID: 4468 Comm: systemd-coredum Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)

<snip registers, unreliable trace>

[   32.143353] Call Trace:
[   32.143732]  ovl_encode_fh+0xd5/0x170
[   32.144031]  exportfs_encode_inode_fh+0x12f/0x300
[   32.144425]  show_mark_fhandle+0xbe/0x1f0
[   32.145805]  inotify_fdinfo+0x226/0x2d0
[   32.146442]  inotify_show_fdinfo+0x1c5/0x350
[   32.147168]  seq_show+0x530/0x6f0
[   32.147449]  seq_read_iter+0x503/0x12a0
[   32.148419]  seq_read+0x31f/0x410
[   32.150714]  vfs_read+0x1f0/0x9e0
[   32.152297]  ksys_read+0x125/0x240

IOW ovl_check_encode_origin derefs inode->i_sb->s_root, after it was set
to NULL in the unmount path.

Minimize the window of opportunity by adding explicit check.

Fixes: c45beeb ("ovl: support encoding fid from inode with no alias")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 16, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 17, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 18, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 19, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 19, 2025
When PAGEMAP_SCAN ioctl invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(),
that page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking cur_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly
or transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
linux-kernel@vger.kernel.org
linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 20, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking cur_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in, hence
it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250919142106.43527-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 20, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ioworker0 pushed a commit to ioworker0/linux that referenced this pull request Sep 21, 2025
Patch series "mm: remove nth_page()", v2.

As discussed recently with Linus, nth_page() is just nasty and we would
like to remove it.

To recap, the reason we currently need nth_page() within a folio is
because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the
memmap is allocated per memory section.

While buddy allocations cannot cross memory section boundaries, hugetlb
and dax folios can.

So crossing a memory section means that "page++" could do the wrong thing.
Instead, nth_page() on these problematic configs always goes from
page->pfn, to the go from (++pfn)->page, which is rather nasty.

Likely, many people have no idea when nth_page() is required and when it
might be dropped.

We refer to such problematic PFN ranges and "non-contiguous pages".  If we
only deal with "contiguous pages", there is not need for nth_page().

Besides that "obvious" folio case, we might end up using nth_page() within
CMA allocations (again, could span memory sections), and in one corner
case (kfence) when processing memblock allocations (again, could span
memory sections).

So let's handle all that, add sanity checks, and remove nth_page().

Patch #1 -> #5   : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups
Patch torvalds#6 -> torvalds#13  : disallow folios to have non-contiguous pages
Patch torvalds#14 -> torvalds#20 : remove nth_page() usage within folios
Patch torvalds#22        : disallow CMA allocations of non-contiguous pages
Patch torvalds#23 -> torvalds#33 : sanity+check + remove nth_page() usage within SG entry
Patch torvalds#34        : sanity-check + remove nth_page() usage in
                   unpin_user_page_range_dirty_lock()
Patch torvalds#35        : remove nth_page() in kfence
Patch torvalds#36        : adjust stale comment regarding nth_page
Patch torvalds#37        : mm: remove nth_page()

A lot of this is inspired from the discussion at [1] between Linus, Jason
and me, so cudos to them.


This patch (of 37):

In an ideal world, we wouldn't have to deal with SPARSEMEM without
SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is
considered too costly and consequently not supported.

However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP,
let's forbid the user to disable VMEMMAP: just like we already do for
arm64, s390 and x86.

So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without
SPARSEMEM_VMEMMAP.

This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone
for loongarch, powerpc, riscv and sparc.  All architectures only enable
SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big
downside to using the VMEMMAP (quite the contrary).

This is a preparation for not supporting

(1) folio sizes that exceed a single memory section

(2) CMA allocations of non-contiguous page ranges

in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit
possible impact as much as possible (e.g., gigantic hugetlb page
allocations suddenly fails).

Link: https://lkml.kernel.org/r/20250901150359.867252-1-david@redhat.com
Link: https://lkml.kernel.org/r/20250901150359.867252-2-david@redhat.com
Link: https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A@mail.gmail.com/T/#u [1]
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Alex Dubov <oakad@yahoo.com>
Cc: Alex Willamson <alex.williamson@redhat.com>
Cc: Bart van Assche <bvanassche@acm.org>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Brett Creeley <brett.creeley@amd.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christoph Lameter (Ampere) <cl@gentwo.org>
Cc: Damien Le Maol <dlemoal@kernel.org>
Cc: Dave Airlie <airlied@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Inki Dae <m.szyprowski@samsung.com>
Cc: James Bottomley <james.bottomley@HansenPartnership.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Jason A. Donenfeld <jason@zx2c4.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jesper Nilsson <jesper.nilsson@axis.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Cc: Lars Persson <lars.persson@axis.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marco Elver <elver@google.com>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: Maxim Levitky <maximlevitsky@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Niklas Cassel <cassel@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Robin Murohy <robin.murphy@arm.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleinxer <tglx@linutronix.de>
Cc: Tvrtko Ursulin <tursulin@ursulin.net>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 22, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking cur_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in, hence
it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250919142106.43527-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 22, 2025
When PAGEMAP_SCAN ioctl invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(),
that page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly
or transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Sep 22, 2025
When PAGEMAP_SCAN ioctl invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(),
that page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly
or transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 23, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 25, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bjackman pushed a commit to bjackman/linux that referenced this pull request Sep 26, 2025
When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mj22226 pushed a commit to mj22226/linux that referenced this pull request Sep 29, 2025
commit 28aa299 upstream.

When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mj22226 pushed a commit to mj22226/linux that referenced this pull request Sep 30, 2025
commit 28aa299 upstream.

When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mj22226 pushed a commit to mj22226/linux that referenced this pull request Sep 30, 2025
commit 28aa299 upstream.

When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
intel-lab-lkp pushed a commit to intel-lab-lkp/linux that referenced this pull request Oct 1, 2025
Calling intotify_show_fdinfo() on fd watching an overlayfs inode, while
the overlayfs is being unmounted, can lead to dereferencing NULL ptr.

This issue was found by syzkaller.

Race Condition Diagram:

Thread 1                           Thread 2
--------                           --------

generic_shutdown_super()
 shrink_dcache_for_umount
  sb->s_root = NULL

                    |
                    |             vfs_read()
                    |              inotify_fdinfo()
                    |               * inode get from mark *
                    |               show_mark_fhandle(m, inode)
                    |                exportfs_encode_fid(inode, ..)
                    |                 ovl_encode_fh(inode, ..)
                    |                  ovl_check_encode_origin(inode)
                    |                   * deref i_sb->s_root *
                    |
                    |
                    v
 fsnotify_sb_delete(sb)

Which then leads to:

[   32.133461] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   32.134438] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
[   32.135032] CPU: 1 UID: 0 PID: 4468 Comm: systemd-coredum Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)

<snip registers, unreliable trace>

[   32.143353] Call Trace:
[   32.143732]  ovl_encode_fh+0xd5/0x170
[   32.144031]  exportfs_encode_inode_fh+0x12f/0x300
[   32.144425]  show_mark_fhandle+0xbe/0x1f0
[   32.145805]  inotify_fdinfo+0x226/0x2d0
[   32.146442]  inotify_show_fdinfo+0x1c5/0x350
[   32.147168]  seq_show+0x530/0x6f0
[   32.147449]  seq_read_iter+0x503/0x12a0
[   32.148419]  seq_read+0x31f/0x410
[   32.150714]  vfs_read+0x1f0/0x9e0
[   32.152297]  ksys_read+0x125/0x240

IOW ovl_check_encode_origin derefs inode->i_sb->s_root, after it was set
to NULL in the unmount path.

Fix it by protecting calling exportfs_encode_fid() from
show_mark_fhandle() with s_umount lock.

This form of fix was suggested by Amir in [1].

[1]: https://lore.kernel.org/all/CAOQ4uxhbDwhb+2Brs1UdkoF0a3NSdBAOQPNfEHjahrgoKJpLEw@mail.gmail.com/

Fixes: c45beeb ("ovl: support encoding fid from inode with no alias")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
mj22226 pushed a commit to mj22226/linux that referenced this pull request Oct 2, 2025
commit 28aa299 upstream.

When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mj22226 pushed a commit to mj22226/linux that referenced this pull request Oct 2, 2025
commit 28aa299 upstream.

When the PAGEMAP_SCAN ioctl is invoked with vec_len = 0 reaches
pagemap_scan_backout_range(), kernel panics with null-ptr-deref:

[   44.936808] Oops: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
[   44.937797] KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
[   44.938391] CPU: 1 UID: 0 PID: 2480 Comm: reproducer Not tainted 6.17.0-rc6 torvalds#22 PREEMPT(none)
[   44.939062] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[   44.939935] RIP: 0010:pagemap_scan_thp_entry.isra.0+0x741/0xa80

<snip registers, unreliable trace>

[   44.946828] Call Trace:
[   44.947030]  <TASK>
[   44.949219]  pagemap_scan_pmd_entry+0xec/0xfa0
[   44.952593]  walk_pmd_range.isra.0+0x302/0x910
[   44.954069]  walk_pud_range.isra.0+0x419/0x790
[   44.954427]  walk_p4d_range+0x41e/0x620
[   44.954743]  walk_pgd_range+0x31e/0x630
[   44.955057]  __walk_page_range+0x160/0x670
[   44.956883]  walk_page_range_mm+0x408/0x980
[   44.958677]  walk_page_range+0x66/0x90
[   44.958984]  do_pagemap_scan+0x28d/0x9c0
[   44.961833]  do_pagemap_cmd+0x59/0x80
[   44.962484]  __x64_sys_ioctl+0x18d/0x210
[   44.962804]  do_syscall_64+0x5b/0x290
[   44.963111]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

vec_len = 0 in pagemap_scan_init_bounce_buffer() means no buffers are
allocated and p->vec_buf remains set to NULL.

This breaks an assumption made later in pagemap_scan_backout_range(), that
page_region is always allocated for p->vec_buf_index.

Fix it by explicitly checking p->vec_buf for NULL before dereferencing.

Other sites that might run into same deref-issue are already (directly or
transitively) protected by checking p->vec_buf.

Note:
From PAGEMAP_SCAN man page, it seems vec_len = 0 is valid when no output
is requested and it's only the side effects caller is interested in,
hence it passes check in pagemap_scan_get_args().

This issue was found by syzkaller.

Link: https://lkml.kernel.org/r/20250922082206.6889-1-acsjakub@amazon.de
Fixes: 52526ca ("fs/proc/task_mmu: implement IOCTL to get and optionally clear info about PTEs")
Signed-off-by: Jakub Acs <acsjakub@amazon.de>
Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Jinjiang Tu <tujinjiang@huawei.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Penglei Jiang <superman.xpt@gmail.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: "Michał Mirosław" <mirq-linux@rere.qmqm.pl>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants