@@ -23,177 +23,166 @@ applications can additionally seal security critical data at runtime.
2323A similar feature already exists in the XNU kernel with the
2424VM_FLAGS_PERMANENT flag [1] and on OpenBSD with the mimmutable syscall [2].
2525
26- User API
27- ========
28- mseal()
29- -----------
30- The mseal() syscall has the following signature:
31-
32- ``int mseal(void addr, size_t len, unsigned long flags) ``
33-
34- **addr/len **: virtual memory address range.
35-
36- The address range set by ``addr ``/``len `` must meet:
37- - The start address must be in an allocated VMA.
38- - The start address must be page aligned.
39- - The end address (``addr `` + ``len ``) must be in an allocated VMA.
40- - no gap (unallocated memory) between start and end address.
41-
42- The ``len `` will be paged aligned implicitly by the kernel.
43-
44- **flags **: reserved for future use.
45-
46- **return values **:
47-
48- - ``0 ``: Success.
49-
50- - ``-EINVAL ``:
51- - Invalid input ``flags ``.
52- - The start address (``addr ``) is not page aligned.
53- - Address range (``addr `` + ``len ``) overflow.
54-
55- - ``-ENOMEM ``:
56- - The start address (``addr ``) is not allocated.
57- - The end address (``addr `` + ``len ``) is not allocated.
58- - A gap (unallocated memory) between start and end address.
59-
60- - ``-EPERM ``:
61- - sealing is supported only on 64-bit CPUs, 32-bit is not supported.
62-
63- - For above error cases, users can expect the given memory range is
64- unmodified, i.e. no partial update.
65-
66- - There might be other internal errors/cases not listed here, e.g.
67- error during merging/splitting VMAs, or the process reaching the max
68- number of supported VMAs. In those cases, partial updates to the given
69- memory range could happen. However, those cases should be rare.
70-
71- **Blocked operations after sealing **:
72- Unmapping, moving to another location, and shrinking the size,
73- via munmap() and mremap(), can leave an empty space, therefore
74- can be replaced with a VMA with a new set of attributes.
75-
76- Moving or expanding a different VMA into the current location,
77- via mremap().
78-
79- Modifying a VMA via mmap(MAP_FIXED).
80-
81- Size expansion, via mremap(), does not appear to pose any
82- specific risks to sealed VMAs. It is included anyway because
83- the use case is unclear. In any case, users can rely on
84- merging to expand a sealed VMA.
85-
86- mprotect() and pkey_mprotect().
87-
88- Some destructive madvice() behaviors (e.g. MADV_DONTNEED)
89- for anonymous memory, when users don't have write permission to the
90- memory. Those behaviors can alter region contents by discarding pages,
91- effectively a memset(0) for anonymous memory.
92-
93- Kernel will return -EPERM for blocked operations.
94-
95- For blocked operations, one can expect the given address is unmodified,
96- i.e. no partial update. Note, this is different from existing mm
97- system call behaviors, where partial updates are made till an error is
98- found and returned to userspace. To give an example:
99-
100- Assume following code sequence:
101-
102- - ptr = mmap(null, 8192, PROT_NONE);
103- - munmap(ptr + 4096, 4096);
104- - ret1 = mprotect(ptr, 8192, PROT_READ);
105- - mseal(ptr, 4096);
106- - ret2 = mprotect(ptr, 8192, PROT_NONE);
107-
108- ret1 will be -ENOMEM, the page from ptr is updated to PROT_READ.
109-
110- ret2 will be -EPERM, the page remains to be PROT_READ.
111-
112- **Note **:
113-
114- - mseal() only works on 64-bit CPUs, not 32-bit CPU.
115-
116- - users can call mseal() multiple times, mseal() on an already sealed memory
117- is a no-action (not error).
118-
119- - munseal() is not supported.
120-
121- Use cases:
122- ==========
26+ SYSCALL
27+ =======
28+ mseal syscall signature
29+ -----------------------
30+ ``int mseal(void \* addr, size_t len, unsigned long flags) ``
31+
32+ **addr **/**len **: virtual memory address range.
33+ The address range set by **addr **/**len ** must meet:
34+ - The start address must be in an allocated VMA.
35+ - The start address must be page aligned.
36+ - The end address (**addr ** + **len **) must be in an allocated VMA.
37+ - no gap (unallocated memory) between start and end address.
38+
39+ The ``len `` will be paged aligned implicitly by the kernel.
40+
41+ **flags **: reserved for future use.
42+
43+ **Return values **:
44+ - **0 **: Success.
45+ - **-EINVAL **:
46+ * Invalid input ``flags ``.
47+ * The start address (``addr ``) is not page aligned.
48+ * Address range (``addr `` + ``len ``) overflow.
49+ - **-ENOMEM **:
50+ * The start address (``addr ``) is not allocated.
51+ * The end address (``addr `` + ``len ``) is not allocated.
52+ * A gap (unallocated memory) between start and end address.
53+ - **-EPERM **:
54+ * sealing is supported only on 64-bit CPUs, 32-bit is not supported.
55+
56+ **Note about error return **:
57+ - For above error cases, users can expect the given memory range is
58+ unmodified, i.e. no partial update.
59+ - There might be other internal errors/cases not listed here, e.g.
60+ error during merging/splitting VMAs, or the process reaching the maximum
61+ number of supported VMAs. In those cases, partial updates to the given
62+ memory range could happen. However, those cases should be rare.
63+
64+ **Architecture support **:
65+ mseal only works on 64-bit CPUs, not 32-bit CPUs.
66+
67+ **Idempotent **:
68+ users can call mseal multiple times. mseal on an already sealed memory
69+ is a no-action (not error).
70+
71+ **no munseal **
72+ Once mapping is sealed, it can't be unsealed. The kernel should never
73+ have munseal, this is consistent with other sealing feature, e.g.
74+ F_SEAL_SEAL for file.
75+
76+ Blocked mm syscall for sealed mapping
77+ -------------------------------------
78+ It might be important to note: **once the mapping is sealed, it will
79+ stay in the process's memory until the process terminates **.
80+
81+ Example::
82+
83+ *ptr = mmap(0, 4096, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
84+ rc = mseal(ptr, 4096, 0);
85+ /* munmap will fail */
86+ rc = munmap(ptr, 4096);
87+ assert(rc < 0);
88+
89+ Blocked mm syscall:
90+ - munmap
91+ - mmap
92+ - mremap
93+ - mprotect and pkey_mprotect
94+ - some destructive madvise behaviors: MADV_DONTNEED, MADV_FREE,
95+ MADV_DONTNEED_LOCKED, MADV_FREE, MADV_DONTFORK, MADV_WIPEONFORK
96+
97+ The first set of syscalls to block is munmap, mremap, mmap. They can
98+ either leave an empty space in the address space, therefore allowing
99+ replacement with a new mapping with new set of attributes, or can
100+ overwrite the existing mapping with another mapping.
101+
102+ mprotect and pkey_mprotect are blocked because they changes the
103+ protection bits (RWX) of the mapping.
104+
105+ Certain destructive madvise behaviors, specifically MADV_DONTNEED,
106+ MADV_FREE, MADV_DONTNEED_LOCKED, and MADV_WIPEONFORK, can introduce
107+ risks when applied to anonymous memory by threads lacking write
108+ permissions. Consequently, these operations are prohibited under such
109+ conditions. The aforementioned behaviors have the potential to modify
110+ region contents by discarding pages, effectively performing a memset(0)
111+ operation on the anonymous memory.
112+
113+ Kernel will return -EPERM for blocked syscalls.
114+
115+ When blocked syscall return -EPERM due to sealing, the memory regions may
116+ or may not be changed, depends on the syscall being blocked:
117+
118+ - munmap: munmap is atomic. If one of VMAs in the given range is
119+ sealed, none of VMAs are updated.
120+ - mprotect, pkey_mprotect, madvise: partial update might happen, e.g.
121+ when mprotect over multiple VMAs, mprotect might update the beginning
122+ VMAs before reaching the sealed VMA and return -EPERM.
123+ - mmap and mremap: undefined behavior.
124+
125+ Use cases
126+ =========
123127- glibc:
124128 The dynamic linker, during loading ELF executables, can apply sealing to
125- non-writable memory segments.
126-
127- - Chrome browser: protect some security sensitive data-structures.
129+ mapping segments.
128130
129- Notes on which memory to seal:
130- ==============================
131+ - Chrome browser: protect some security sensitive data structures.
131132
132- It might be important to note that sealing changes the lifetime of a mapping,
133- i.e. the sealed mapping won’t be unmapped till the process terminates or the
134- exec system call is invoked. Applications can apply sealing to any virtual
135- memory region from userspace, but it is crucial to thoroughly analyze the
136- mapping's lifetime prior to apply the sealing.
133+ When not to use mseal
134+ =====================
135+ Applications can apply sealing to any virtual memory region from userspace,
136+ but it is *crucial to thoroughly analyze the mapping's lifetime * prior to
137+ apply the sealing. This is because the sealed mapping *won’t be unmapped *
138+ until the process terminates or the exec system call is invoked.
137139
138140For example:
141+ - aio/shm
142+ aio/shm can call mmap and munmap on behalf of userspace, e.g.
143+ ksys_shmdt() in shm.c. The lifetimes of those mapping are not tied to
144+ the lifetime of the process. If those memories are sealed from userspace,
145+ then munmap will fail, causing leaks in VMA address space during the
146+ lifetime of the process.
147+
148+ - ptr allocated by malloc (heap)
149+ Don't use mseal on the memory ptr return from malloc().
150+ malloc() is implemented by allocator, e.g. by glibc. Heap manager might
151+ allocate a ptr from brk or mapping created by mmap.
152+ If an app calls mseal on a ptr returned from malloc(), this can affect
153+ the heap manager's ability to manage the mappings; the outcome is
154+ non-deterministic.
155+
156+ Example::
157+
158+ ptr = malloc(size);
159+ /* don't call mseal on ptr return from malloc. */
160+ mseal(ptr, size);
161+ /* free will success, allocator can't shrink heap lower than ptr */
162+ free(ptr);
163+
164+ mseal doesn't block
165+ ===================
166+ In a nutshell, mseal blocks certain mm syscall from modifying some of VMA's
167+ attributes, such as protection bits (RWX). Sealed mappings doesn't mean the
168+ memory is immutable.
139169
140- - aio/shm
141-
142- aio/shm can call mmap()/munmap() on behalf of userspace, e.g. ksys_shmdt() in
143- shm.c. The lifetime of those mapping are not tied to the lifetime of the
144- process. If those memories are sealed from userspace, then munmap() will fail,
145- causing leaks in VMA address space during the lifetime of the process.
146-
147- - Brk (heap)
148-
149- Currently, userspace applications can seal parts of the heap by calling
150- malloc() and mseal().
151- let's assume following calls from user space:
152-
153- - ptr = malloc(size);
154- - mprotect(ptr, size, RO);
155- - mseal(ptr, size);
156- - free(ptr);
157-
158- Technically, before mseal() is added, the user can change the protection of
159- the heap by calling mprotect(RO). As long as the user changes the protection
160- back to RW before free(), the memory range can be reused.
161-
162- Adding mseal() into the picture, however, the heap is then sealed partially,
163- the user can still free it, but the memory remains to be RO. If the address
164- is re-used by the heap manager for another malloc, the process might crash
165- soon after. Therefore, it is important not to apply sealing to any memory
166- that might get recycled.
167-
168- Furthermore, even if the application never calls the free() for the ptr,
169- the heap manager may invoke the brk system call to shrink the size of the
170- heap. In the kernel, the brk-shrink will call munmap(). Consequently,
171- depending on the location of the ptr, the outcome of brk-shrink is
172- nondeterministic.
173-
174-
175- Additional notes:
176- =================
177170As Jann Horn pointed out in [3], there are still a few ways to write
178- to RO memory, which is, in a way, by design. Those cases are not covered
179- by mseal(). If applications want to block such cases, sandbox tools (such as
180- seccomp, LSM, etc) might be considered.
171+ to RO memory, which is, in a way, by design. And those could be blocked
172+ by different security measures.
181173
182174Those cases are:
183175
184- - Write to read-only memory through /proc/self/mem interface.
185- - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
186- - userfaultfd.
176+ - Write to read-only memory through /proc/self/mem interface (FOLL_FORCE) .
177+ - Write to read-only memory through ptrace (such as PTRACE_POKETEXT).
178+ - userfaultfd.
187179
188180The idea that inspired this patch comes from Stephen Röttger’s work in V8
189181CFI [4]. Chrome browser in ChromeOS will be the first user of this API.
190182
191- Reference:
192- ==========
193- [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
194-
195- [2] https://man.openbsd.org/mimmutable.2
196-
197- [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
198-
199- [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
183+ Reference
184+ =========
185+ - [1] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
186+ - [2] https://man.openbsd.org/mimmutable.2
187+ - [3] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com
188+ - [4] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
0 commit comments