Skip to content

Commit d61ea1c

Browse files
xzpeterakpm00
authored andcommitted
userfaultfd: UFFD_FEATURE_WP_ASYNC
Patch series "Implement IOCTL to get and optionally clear info about PTEs", v33. *Motivation* The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch() retrieves the addresses of the pages that are written to in a region of virtual memory. This syscall is used in Windows applications and games etc. This syscall is being emulated in pretty slow manner in userspace. Our purpose is to enhance the kernel such that we translate it efficiently in a better way. Currently some out of tree hack patches are being used to efficiently emulate it in some kernels. We intend to replace those with these patches. So the whole gaming on Linux can effectively get benefit from this. It means there would be tons of users of this code. CRIU use case [2] was mentioned by Andrei and Danylo: > Use cases for migrating sparse VMAs are binaries sanitized with ASAN, > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of > shadow memory [4]. Being able to migrate such binaries allows to highly > reduce the amount of work needed to identify and fix post-migration > crashes, which happen constantly. Andrei defines the following uses of this code: * it is more granular and allows us to track changed pages more effectively. The current interface can clear dirty bits for the entire process only. In addition, reading info about pages is a separate operation. It means we must freeze the process to read information about all its pages, reset dirty bits, only then we can start dumping pages. The information about pages becomes more and more outdated, while we are processing pages. The new interface solves both these downsides. First, it allows us to read pte bits and clear the soft-dirty bit atomically. It means that CRIU will not need to freeze processes to pre-dump their memory. Second, it clears soft-dirty bits for a specified region of memory. It means CRIU will have actual info about pages to the moment of dumping them. * The new interface has to be much faster because basic page filtering is happening in the kernel. With the old interface, we have to read pagemap for each page. *Implementation Evolution (Short Summary)* From the definition of GetWriteWatch(), we feel like kernel's soft-dirty feature can be used under the hood with some additions like: * reset soft-dirty flag for only a specific region of memory instead of clearing the flag for the entire process * get and clear soft-dirty flag for a specific region atomically So we decided to use ioctl on pagemap file to read or/and reset soft-dirty flag. But using soft-dirty flag, sometimes we get extra pages which weren't even written. They had become soft-dirty because of VMA merging and VM_SOFTDIRTY flag. This breaks the definition of GetWriteWatch(). We were able to by-pass this short coming by ignoring VM_SOFTDIRTY until David reported that mprotect etc messes up the soft-dirty flag while ignoring VM_SOFTDIRTY [5]. This wasn't happening until [6] got introduced. We discussed if we can revert these patches. But we could not reach to any conclusion. So at this point, I made couple of tries to solve this whole VM_SOFTDIRTY issue by correcting the soft-dirty implementation: * [7] Correct the bug fixed wrongly back in 2014. It had potential to cause regression. We left it behind. * [8] Keep a list of soft-dirty part of a VMA across splits and merges. I got the reply don't increase the size of the VMA by 8 bytes. At this point, we left soft-dirty considering it is too much delicate and userfaultfd [9] seemed like the only way forward. From there onward, we have been basing soft-dirty emulation on userfaultfd wp feature where kernel resolves the faults itself when WP_ASYNC feature is used. It was straight forward to add WP_ASYNC feature in userfautlfd. Now we get only those pages dirty or written-to which are really written in reality. (PS There is another WP_UNPOPULATED userfautfd feature is required which is needed to avoid pre-faulting memory before write-protecting [9].) All the different masks were added on the request of CRIU devs to create interface more generic and better. [1] https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-getwritewatch [2] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com [3] https://github.com/google/sanitizers [4] https://github.com/google/sanitizers/wiki/AddressSanitizerAlgorithm#64-bit [5] https://lore.kernel.org/all/bfcae708-db21-04b4-0bbe-712badd03071@redhat.com [6] https://lore.kernel.org/all/20220725142048.30450-1-peterx@redhat.com/ [7] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com [8] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com [9] https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com [10] https://lore.kernel.org/all/20230125144529.1630917-1-mdanylo@google.com This patch (of 6): Add a new userfaultfd-wp feature UFFD_FEATURE_WP_ASYNC, that allows userfaultfd wr-protect faults to be resolved by the kernel directly. It can be used like a high accuracy version of soft-dirty, without vma modifications during tracking, and also with ranged support by default rather than for a whole mm when reset the protections due to existence of ioctl(UFFDIO_WRITEPROTECT). Several goals of such a dirty tracking interface: 1. All types of memory should be supported and tracable. This is nature for soft-dirty but should mention when the context is userfaultfd, because it used to only support anon/shmem/hugetlb. The problem is for a dirty tracking purpose these three types may not be enough, and it's legal to track anything e.g. any page cache writes from mmap. 2. Protections can be applied to partial of a memory range, without vma split/merge fuss. The hope is that the tracking itself should not affect any vma layout change. It also helps when reset happens because the reset will not need mmap write lock which can block the tracee. 3. Accuracy needs to be maintained. This means we need pte markers to work on any type of VMA. One could question that, the whole concept of async dirty tracking is not really close to fundamentally what userfaultfd used to be: it's not "a fault to be serviced by userspace" anymore. However, using userfaultfd-wp here as a framework is convenient for us in at least: 1. VM_UFFD_WP vma flag, which has a very good name to suite something like this, so we don't need VM_YET_ANOTHER_SOFT_DIRTY. Just use a new feature bit to identify from a sync version of uffd-wp registration. 2. PTE markers logic can be leveraged across the whole kernel to maintain the uffd-wp bit as long as an arch supports, this also applies to this case where uffd-wp bit will be a hint to dirty information and it will not go lost easily (e.g. when some page cache ptes got zapped). 3. Reuse ioctl(UFFDIO_WRITEPROTECT) interface for either starting or resetting a range of memory, while there's no counterpart in the old soft-dirty world, hence if this is wanted in a new design we'll need a new interface otherwise. We can somehow understand that commonality because uffd-wp was fundamentally a similar idea of write-protecting pages just like soft-dirty. This implementation allows WP_ASYNC to imply WP_UNPOPULATED, because so far WP_ASYNC seems to not usable if without WP_UNPOPULATE. This also gives us chance to modify impl of WP_ASYNC just in case it could be not depending on WP_UNPOPULATED anymore in the future kernels. It's also fine to imply that because both features will rely on PTE_MARKER_UFFD_WP config option, so they'll show up together (or both missing) in an UFFDIO_API probe. vma_can_userfault() now allows any VMA if the userfaultfd registration is only about async uffd-wp. So we can track dirty for all kinds of memory including generic file systems (like XFS, EXT4 or BTRFS). One trick worth mention in do_wp_page() is that we need to manually update vmf->orig_pte here because it can be used later with a pte_same() check - this path always has FAULT_FLAG_ORIG_PTE_VALID set in the flags. The major defect of this approach of dirty tracking is we need to populate the pgtables when tracking starts. Soft-dirty doesn't do it like that. It's unwanted in the case where the range of memory to track is huge and unpopulated (e.g., tracking updates on a 10G file with mmap() on top, without having any page cache installed yet). One way to improve this is to allow pte markers exist for larger than PTE level for PMD+. That will not change the interface if to implemented, so we can leave that for later. Link: https://lkml.kernel.org/r/20230821141518.870589-1-usama.anjum@collabora.com Link: https://lkml.kernel.org/r/20230821141518.870589-2-usama.anjum@collabora.com Signed-off-by: Peter Xu <peterx@redhat.com> Co-developed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrei Vagin <avagin@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Miroslaw <emmir@google.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Paul Gofman <pgofman@codeweavers.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Yun Zhou <yun.zhou@windriver.com> Cc: Michał Mirosław <mirq-linux@rere.qmqm.pl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
1 parent 7bd5bc3 commit d61ea1c

File tree

6 files changed

+129
-22
lines changed

6 files changed

+129
-22
lines changed

Documentation/admin-guide/mm/userfaultfd.rst

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -244,6 +244,41 @@ write-protected (so future writes will also result in a WP fault). These ioctls
244244
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
245245
respectively) to configure the mapping this way.
246246

247+
If the userfaultfd context has ``UFFD_FEATURE_WP_ASYNC`` feature bit set,
248+
any vma registered with write-protection will work in async mode rather
249+
than the default sync mode.
250+
251+
In async mode, there will be no message generated when a write operation
252+
happens, meanwhile the write-protection will be resolved automatically by
253+
the kernel. It can be seen as a more accurate version of soft-dirty
254+
tracking and it can be different in a few ways:
255+
256+
- The dirty result will not be affected by vma changes (e.g. vma
257+
merging) because the dirty is only tracked by the pte.
258+
259+
- It supports range operations by default, so one can enable tracking on
260+
any range of memory as long as page aligned.
261+
262+
- Dirty information will not get lost if the pte was zapped due to
263+
various reasons (e.g. during split of a shmem transparent huge page).
264+
265+
- Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit
266+
set; dirty when uffd-wp bit cleared), it has different semantics on
267+
some of the memory operations. For example: ``MADV_DONTNEED`` on
268+
anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as
269+
dirtying of memory by dropping uffd-wp bit during the procedure.
270+
271+
The user app can collect the "written/dirty" status by looking up the
272+
uffd-wp bit for the pages being interested in /proc/pagemap.
273+
274+
The page will not be under track of uffd-wp async mode until the page is
275+
explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode
276+
flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault
277+
that was tracked by async mode userfaultfd-wp is invalid.
278+
279+
When userfaultfd-wp async mode is used alone, it can be applied to all
280+
kinds of memory.
281+
247282
Memory Poisioning Emulation
248283
---------------------------
249284

fs/userfaultfd.c

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,11 @@ static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx)
123123
return ctx->features & UFFD_FEATURE_INITIALIZED;
124124
}
125125

126+
static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx)
127+
{
128+
return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC);
129+
}
130+
126131
/*
127132
* Whether WP_UNPOPULATED is enabled on the uffd context. It is only
128133
* meaningful when userfaultfd_wp()==true on the vma and when it's
@@ -1325,6 +1330,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
13251330
bool basic_ioctls;
13261331
unsigned long start, end, vma_end;
13271332
struct vma_iterator vmi;
1333+
bool wp_async = userfaultfd_wp_async_ctx(ctx);
13281334
pgoff_t pgoff;
13291335

13301336
user_uffdio_register = (struct uffdio_register __user *) arg;
@@ -1399,7 +1405,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
13991405

14001406
/* check not compatible vmas */
14011407
ret = -EINVAL;
1402-
if (!vma_can_userfault(cur, vm_flags))
1408+
if (!vma_can_userfault(cur, vm_flags, wp_async))
14031409
goto out_unlock;
14041410

14051411
/*
@@ -1460,7 +1466,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
14601466
for_each_vma_range(vmi, vma, end) {
14611467
cond_resched();
14621468

1463-
BUG_ON(!vma_can_userfault(vma, vm_flags));
1469+
BUG_ON(!vma_can_userfault(vma, vm_flags, wp_async));
14641470
BUG_ON(vma->vm_userfaultfd_ctx.ctx &&
14651471
vma->vm_userfaultfd_ctx.ctx != ctx);
14661472
WARN_ON(!(vma->vm_flags & VM_MAYWRITE));
@@ -1561,6 +1567,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
15611567
unsigned long start, end, vma_end;
15621568
const void __user *buf = (void __user *)arg;
15631569
struct vma_iterator vmi;
1570+
bool wp_async = userfaultfd_wp_async_ctx(ctx);
15641571
pgoff_t pgoff;
15651572

15661573
ret = -EFAULT;
@@ -1615,7 +1622,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
16151622
* provides for more strict behavior to notice
16161623
* unregistration errors.
16171624
*/
1618-
if (!vma_can_userfault(cur, cur->vm_flags))
1625+
if (!vma_can_userfault(cur, cur->vm_flags, wp_async))
16191626
goto out_unlock;
16201627

16211628
found = true;
@@ -1631,7 +1638,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
16311638
for_each_vma_range(vmi, vma, end) {
16321639
cond_resched();
16331640

1634-
BUG_ON(!vma_can_userfault(vma, vma->vm_flags));
1641+
BUG_ON(!vma_can_userfault(vma, vma->vm_flags, wp_async));
16351642

16361643
/*
16371644
* Nothing to do: this vma is already registered into this
@@ -2018,6 +2025,11 @@ static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long
20182025
return ret;
20192026
}
20202027

2028+
bool userfaultfd_wp_async(struct vm_area_struct *vma)
2029+
{
2030+
return userfaultfd_wp_async_ctx(vma->vm_userfaultfd_ctx.ctx);
2031+
}
2032+
20212033
static inline unsigned int uffd_ctx_features(__u64 user_features)
20222034
{
20232035
/*
@@ -2051,6 +2063,11 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
20512063
ret = -EPERM;
20522064
if ((features & UFFD_FEATURE_EVENT_FORK) && !capable(CAP_SYS_PTRACE))
20532065
goto err_out;
2066+
2067+
/* WP_ASYNC relies on WP_UNPOPULATED, choose it unconditionally */
2068+
if (features & UFFD_FEATURE_WP_ASYNC)
2069+
features |= UFFD_FEATURE_WP_UNPOPULATED;
2070+
20542071
/* report all available features and ioctls to userland */
20552072
uffdio_api.features = UFFD_API_FEATURES;
20562073
#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
@@ -2063,6 +2080,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ctx,
20632080
#ifndef CONFIG_PTE_MARKER_UFFD_WP
20642081
uffdio_api.features &= ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM;
20652082
uffdio_api.features &= ~UFFD_FEATURE_WP_UNPOPULATED;
2083+
uffdio_api.features &= ~UFFD_FEATURE_WP_ASYNC;
20662084
#endif
20672085
uffdio_api.ioctls = UFFD_API_IOCTLS;
20682086
ret = -EFAULT;

include/linux/userfaultfd_k.h

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,11 +161,22 @@ static inline bool userfaultfd_armed(struct vm_area_struct *vma)
161161
}
162162

163163
static inline bool vma_can_userfault(struct vm_area_struct *vma,
164-
unsigned long vm_flags)
164+
unsigned long vm_flags,
165+
bool wp_async)
165166
{
167+
vm_flags &= __VM_UFFD_FLAGS;
168+
166169
if ((vm_flags & VM_UFFD_MINOR) &&
167170
(!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma)))
168171
return false;
172+
173+
/*
174+
* If wp async enabled, and WP is the only mode enabled, allow any
175+
* memory type.
176+
*/
177+
if (wp_async && (vm_flags == VM_UFFD_WP))
178+
return true;
179+
169180
#ifndef CONFIG_PTE_MARKER_UFFD_WP
170181
/*
171182
* If user requested uffd-wp but not enabled pte markers for
@@ -175,6 +186,8 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
175186
if ((vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma))
176187
return false;
177188
#endif
189+
190+
/* By default, allow any of anon|shmem|hugetlb */
178191
return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) ||
179192
vma_is_shmem(vma);
180193
}
@@ -197,6 +210,7 @@ extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
197210
extern void userfaultfd_unmap_complete(struct mm_struct *mm,
198211
struct list_head *uf);
199212
extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma);
213+
extern bool userfaultfd_wp_async(struct vm_area_struct *vma);
200214

201215
#else /* CONFIG_USERFAULTFD */
202216

@@ -297,6 +311,11 @@ static inline bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma)
297311
return false;
298312
}
299313

314+
static inline bool userfaultfd_wp_async(struct vm_area_struct *vma)
315+
{
316+
return false;
317+
}
318+
300319
#endif /* CONFIG_USERFAULTFD */
301320

302321
static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)

include/uapi/linux/userfaultfd.h

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@
4040
UFFD_FEATURE_EXACT_ADDRESS | \
4141
UFFD_FEATURE_WP_HUGETLBFS_SHMEM | \
4242
UFFD_FEATURE_WP_UNPOPULATED | \
43-
UFFD_FEATURE_POISON)
43+
UFFD_FEATURE_POISON | \
44+
UFFD_FEATURE_WP_ASYNC)
4445
#define UFFD_API_IOCTLS \
4546
((__u64)1 << _UFFDIO_REGISTER | \
4647
(__u64)1 << _UFFDIO_UNREGISTER | \
@@ -216,6 +217,11 @@ struct uffdio_api {
216217
* (i.e. empty ptes). This will be the default behavior for shmem
217218
* & hugetlbfs, so this flag only affects anonymous memory behavior
218219
* when userfault write-protection mode is registered.
220+
*
221+
* UFFD_FEATURE_WP_ASYNC indicates that userfaultfd write-protection
222+
* asynchronous mode is supported in which the write fault is
223+
* automatically resolved and write-protection is un-set.
224+
* It implies UFFD_FEATURE_WP_UNPOPULATED.
219225
*/
220226
#define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0)
221227
#define UFFD_FEATURE_EVENT_FORK (1<<1)
@@ -232,6 +238,7 @@ struct uffdio_api {
232238
#define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12)
233239
#define UFFD_FEATURE_WP_UNPOPULATED (1<<13)
234240
#define UFFD_FEATURE_POISON (1<<14)
241+
#define UFFD_FEATURE_WP_ASYNC (1<<15)
235242
__u64 features;
236243

237244
__u64 ioctls;

mm/hugetlb.c

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6247,21 +6247,27 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
62476247
/* Handle userfault-wp first, before trying to lock more pages */
62486248
if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(ptep)) &&
62496249
(flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) {
6250-
struct vm_fault vmf = {
6251-
.vma = vma,
6252-
.address = haddr,
6253-
.real_address = address,
6254-
.flags = flags,
6255-
};
6250+
if (!userfaultfd_wp_async(vma)) {
6251+
struct vm_fault vmf = {
6252+
.vma = vma,
6253+
.address = haddr,
6254+
.real_address = address,
6255+
.flags = flags,
6256+
};
62566257

6257-
spin_unlock(ptl);
6258-
if (pagecache_folio) {
6259-
folio_unlock(pagecache_folio);
6260-
folio_put(pagecache_folio);
6258+
spin_unlock(ptl);
6259+
if (pagecache_folio) {
6260+
folio_unlock(pagecache_folio);
6261+
folio_put(pagecache_folio);
6262+
}
6263+
hugetlb_vma_unlock_read(vma);
6264+
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
6265+
return handle_userfault(&vmf, VM_UFFD_WP);
62616266
}
6262-
hugetlb_vma_unlock_read(vma);
6263-
mutex_unlock(&hugetlb_fault_mutex_table[hash]);
6264-
return handle_userfault(&vmf, VM_UFFD_WP);
6267+
6268+
entry = huge_pte_clear_uffd_wp(entry);
6269+
set_huge_pte_at(mm, haddr, ptep, entry);
6270+
/* Fallthrough to CoW */
62656271
}
62666272

62676273
/*

mm/memory.c

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
// SPDX-License-Identifier: GPL-2.0-only
23
/*
34
* linux/mm/memory.c
@@ -3349,11 +3350,28 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
33493350
const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
33503351
struct vm_area_struct *vma = vmf->vma;
33513352
struct folio *folio = NULL;
3353+
pte_t pte;
33523354

33533355
if (likely(!unshare)) {
33543356
if (userfaultfd_pte_wp(vma, ptep_get(vmf->pte))) {
3355-
pte_unmap_unlock(vmf->pte, vmf->ptl);
3356-
return handle_userfault(vmf, VM_UFFD_WP);
3357+
if (!userfaultfd_wp_async(vma)) {
3358+
pte_unmap_unlock(vmf->pte, vmf->ptl);
3359+
return handle_userfault(vmf, VM_UFFD_WP);
3360+
}
3361+
3362+
/*
3363+
* Nothing needed (cache flush, TLB invalidations,
3364+
* etc.) because we're only removing the uffd-wp bit,
3365+
* which is completely invisible to the user.
3366+
*/
3367+
pte = pte_clear_uffd_wp(ptep_get(vmf->pte));
3368+
3369+
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
3370+
/*
3371+
* Update this to be prepared for following up CoW
3372+
* handling
3373+
*/
3374+
vmf->orig_pte = pte;
33573375
}
33583376

33593377
/*
@@ -4879,8 +4897,11 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
48794897

48804898
if (vma_is_anonymous(vma)) {
48814899
if (likely(!unshare) &&
4882-
userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd))
4900+
userfaultfd_huge_pmd_wp(vma, vmf->orig_pmd)) {
4901+
if (userfaultfd_wp_async(vmf->vma))
4902+
goto split;
48834903
return handle_userfault(vmf, VM_UFFD_WP);
4904+
}
48844905
return do_huge_pmd_wp_page(vmf);
48854906
}
48864907

@@ -4892,6 +4913,7 @@ static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
48924913
}
48934914
}
48944915

4916+
split:
48954917
/* COW or write-notify handled on pte level: split pmd. */
48964918
__split_huge_pmd(vma, vmf->pmd, vmf->address, false, NULL);
48974919

0 commit comments

Comments
 (0)