-
Notifications
You must be signed in to change notification settings - Fork 8.3k
aarch64: userspace (part 1) #32854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64: userspace (part 1) #32854
Conversation
|
On Fri, 5 Mar 2021, jharris-intel wrote:
Interesting. This is a very different approach than the one I'm used to.
The same scheme is heavily used in the Linux kernel. However the linker
is leveraged to gather in a table all the addresses where exceptions may
occur, and where to branch if so. The table is then sorted by the linker
so that the exception code can binary search it. Here we have only two
cases so far, so maybe it isn't worth going through all that complexity
just yet.
(I'm used to instead having a spinlock-like approach, where before
doing something that may abort you record in thread context that an
exception is allowed and the handler thereof, and after you remove
said record.)
The advantage with the approach here is that you have zero overhead at
run time.
One word of warning: this approach does not play well with anything
that splits / inlines/ duplicates functions. Should be fine here
currently because everything is assembly, but worth noting.
This should always be used with assembly code, and ideally with the
smallest range as possible. In fact, in Linux you may flag only
individual instructions not a range.
You probably want an ISB in here somewhere.
As-is, an SError may not be detected inside this region.
Are you sure? Access faults are synchronous aborts.
+ ldtrb w3, [x0]
+ cbz w2, arch_buffer_validate_fault_end
+ sttrb w3, [x0]
I see what this is intended to do; I am rather hesitant about "null" writebacks, as there's a lot that can go wrong.
Perhaps instead use the `AT` instructions? They will trigger a synchronous abort if it's not allowed.
Ah! Interesting. I hadn't noticed those instructions before.
|
Right, I was more thinking of External Aborts, which can be async... but this is purely just checking for MMU faults, which are sync.
Not "zero". "Less" in the non-faulting case, absolutely, but you still can't e.g. inline or const-prop these calls, and you often end up needing to go over memory twice. And at the expense of more in the faulting case. (Which may explain the difference. I'm used to this mechanism more for testing purposes... where faults are much more common.) I am also still rather concerned here about race conditions, especially in SMP. There are a fair number of TOCTTOU vulnerabilities here.
BTW, there's a wrinkle with |
|
On Fri, 5 Mar 2021, jharris-intel wrote:
> The advantage with the approach here is that you have zero overhead at
> run time.
Not "zero". "Less" in the non-faulting case, absolutely, but you still
can't e.g. inline or const-prop these calls, and you often end up
needing to go over memory twice. And at the expense of more in the
faulting case. (Which may explain the difference. I'm used to this
mechanism more for testing purposes... where faults are much more
common.)
Here, the normal case is to never fault. If you fault that's because you
attempted to access memory to which you're not entitled.
In Linux you may inline those as you wish. What they do is similar to
this (for ARM32):
```
#define put_user_word(x, addr, err) \
__asm__ __volatile__( \
"1: strt %1, [%2]\n" \
"2:\n" \
" .pushsection .text.fixup,\"ax\"\n" \
" .align 2\n" \
"3: mov %0, %3\n" \
" b 2b\n" \
" .popsection\n" \
" .pushsection __ex_table,\"a\"\n" \
" .align 3\n" \
" .long 1b, 3b\n" \
" .popsection" \
: "+r" (err) \
: "r" (x), "r" (addr), "i" (-EFAULT) \
: "cc")
```
So this writes to user space memory from kernel space, and if a fault
occurs then _EFAULT is stored in the variable err. Here only the strt is
controlled with its address put in the table gathered in the __ex_table,
section, along with the address to the fixup code which is also out of
line. The good thing with this approach is that you truly have zero
overhead at run time when the access is granted, and this is subject to
dead code elimination by the compiler if that piece of code is unneeded
due to constant prop, etc.
I am also still rather concerned here about race conditions, especially in SMP. There are a fair number of TOCTTOU vulnerabilities here.
Do you have a precise example in mind?
Let's remember that here the kernel is testing memory access on a user
thread's behalf before consuming, or worse writing back user data.
That's also the kernel that sets those access permissions in the first
place. No other thread has any business changing another thread's access
rights, if only to kill it maybe. So there is no opportunity for the
access permission to change between the test and the actual access
(kernel memory won't suddenly become user memory).
|
...exactly my point? Let me elaborate a little. This approach means that you can often leak a limited amount of otherwise-inaccessible data by winning the race condition. I took a quick look, and lo and behold (reformatted to be a bit smaller): static inline int z_vrfy_k_thread_name_set(struct k_thread *t, const char *str) {
[...]
int err;
size_t len = z_user_string_nlen(str, CONFIG_THREAD_MAX_NAME_LEN, &err);
if (err != 0 || Z_SYSCALL_MEMORY_READ(str, len) != 0) {
return -EFAULT;
}
/* RACE HERE */
return z_impl_k_thread_name_set(t, str);
}I am a mostly-unprivileged thread that can at least spawn a thread, and have at least some shared memory with threads I spawn. I point I am hoping to have a write of In this case, the resulting value of ...then In general, That being said, this particular leak is relatively limited, both because it stops copying at a null byte and because the it can only leak directly after a shared accessible region. (And it copies a relatively small amount on top of that.) Hopefully this clarifies what I meant... |
|
On Fri, 5 Mar 2021, jharris-intel wrote:
```c
static inline int z_vrfy_k_thread_name_set(struct k_thread *t, const char *str) {
[...]
int err;
size_t len = z_user_string_nlen(str, CONFIG_THREAD_MAX_NAME_LEN, &err);
if (err != 0 || Z_SYSCALL_MEMORY_READ(str, len) != 0) {
return -EFAULT;
}
/* RACE HERE */
return z_impl_k_thread_name_set(t, str);
}
```
...then `z_impl_k_thread_name_set` does a `strncpy` and copies up to `CONFIG_THREAD_MAX_NAME_LEN-2` bytes worth of kernel memory into the thread name, which we can retrieve easily enough using `k_thread_name_copy`.
Absolutely. That code above is simply broken.
`z_impl_k_thread_name_set()` must take `len` as parameter and copy only
the number of bytes that was vetted by `z_user_string_nlen()`.
And BTW this is broken even without SMP in the picture. The second
thread could be preemptively scheduled, flip the byte and yield. Much
harder to get the timing right but not impossible.
Please create a separate issue for this as this is a generic bug
unrelated to this PR.
In general, `z_user_string_nlen` as basically anything other than a
part of `z_user_string_(alloc_)?copy` is rather unsafe. Yes, if used
absolutely perfectly it's fine - that's what `z_user_string_*` does
after all - but it's far too easy to subtly misuse.
I wouldn't say so. When a validation function tells you that n bytes
from address p are fine, then you must copy n bytes from address p. Not
forget about n along the way! That's a very basic rule. It is pointless
performing another EOS detection anyway.
That being said, this particular leak is relatively limited, both because it stops copying at a null byte and because the it can only leak directly after a shared accessible region. (And it copies a relatively small amount on top of that.)
Hopefully this clarifies what I meant...
Absolutely.
|
|
Perhaps instead use the `AT` instructions? They will trigger a synchronous abort if it's not allowed.
Well, that doesn't work for me. Maybe qemu isn't fully emulating those
instructions, or I don't understand how to use them. In fact I tried
most variations and the test suite always fails (it doesn't fault when
expected).
So I suggest that the code be merged as is for now. Testing for writes
happens in the area that is most likely going to be overwritten anyway.
|
Note the additional branches, no ability to split loads/stores (or combine in AArch64; not as much of an issue in AArch32 because there is no STRDT instruction), no ability to remove redundant stores, and less ability for using the more complex addressing modes. (...also, that assembly isn't marked as affecting memory, which seems like an issue.) To be clear, when I'm talking about a lack of ability to inline I'm not so much talking about the "avoiding caller-save registers / stack frame" portion as I am the additional knock-on optimizations that you can generally do via inlining.
...Sorry about that. It's the low (IIRC?) bit of PAR_EL1 that indicates a failure, not an actual abort.
Ah, I wasn't sure about that. I didn't know if this particular case was potentially in preemptable context or not, and mentioned the SMP case because I know it's broken there.
This is my point, yes. The safe way to use this API is to do A, then B. Only... we have an API to do A, then B. It's called I'll open a PR to fix this particular case, but my stance remains that |
|
On Mon, 8 Mar 2021, jharris-intel wrote:
That's... precisely what I'm disagreeing with you on? The definition
and behavior are, seemingly, clear, and/but this code is _not_
adhering to both.
This is further evidenced by the fact that this kernel API is
currently being used in ways (atomics being the most obvious example)
that work **if the function behaves as documented**, but not if you
allow the function to actually write to the region.
My apologies. I'm a goof. I completely misread your other reply.
You are right about atomic CAS being screwed by the write in
arch_buffer_validate of course. I did my homework and reimplemented it
around the AT instruction.
|
|
@npitre I force pushed your |
something went wrong |
I think that the merge of #33145 is confusing the CI. |
|
Ok, I rebased on latest master and now the only failure on CI is not related to this commit. |
|
Try to kick CI again. |
arch/arm/core/aarch64/userspace.S
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ISB should only be necessary at the end of the loop, not before reading PAR.
arch/arm/core/aarch64/userspace.S
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. The documentation (and most of the other code) treats Z_SYSCALL_MEMORY_WRITE as a superset of Z_SYSCALL_MEMORY_READ, in that write implies read. Do we need to check read also on write?
arch/arm/core/aarch64/userspace.S
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop doesn't appear to check non-page-aligned addresses/sizes properly.
E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.
This may be worth tossing in a test for.
Suggestion:
- At the start of the loop, do
and x0, x0, ~#(CONFIG_MMU_PAGE_SIZE - 1)to reset the address to the start of the page. - Change the increment to
add x0, x0, #(CONFIG_MMU_PAGE_SIZE).
(This won't work if x0 is 2**64-CONFIG_MMU_PAGE_SIZE, but that can't happen in practice.)
|
On Tue, 9 Mar 2021, jharris-intel wrote:
This ISB should only be necessary at the end of the loop, not before reading PAR.
Quoting the ARM ARM:
```
Where an instruction results in an update to a system register, as is
the case with the AT * address translation
instructions, explicit synchronization must be performed before the
result is guaranteed to be visible to subsequent
direct reads of the PAR_EL1.
```
So my interpretation is that the ISB is necessary before each read of
the PAR. Once we're outside of the loop then no ISB is necessary because
we've made our mind already and none of what we've done before might
have any pending side effects that we need to wait for.
Hm. The documentation (and most of the other code) treats
`Z_SYSCALL_MEMORY_WRITE` as a superset of `Z_SYSCALL_MEMORY_READ`, in
that write implies read. Do we need to check read also on write?
Don't think so. Especially since we never map anything write-only.
> +z_arm64_user_string_nlen_fixup:
+ mov x4, #-1
+ mov x0, #0
+
+strlen_done:
+ str w4, [x2]
+ ret
+
+/*
+ * int arch_buffer_validate(void *addr, size_t size, int write)
+ */
+
+GTEXT(arch_buffer_validate)
+SECTION_FUNC(TEXT, arch_buffer_validate)
+
+ add x1, x1, x0
This loop doesn't appear to check non-page-aligned addresses/sizes properly.
E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.
It does check both pages. Note the `orr`.
|
|
Let me quote that one properly.
+ add x1, x1, x0
[...]
+ orr x0, x0, #(CONFIG_MMU_PAGE_SIZE - 1)
+ add x0, x0, #1
+ cmp x0, x1
+ blo abv_loop
This loop doesn't appear to check non-page-aligned addresses/sizes properly.
E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.
As I said, look again. ;-)
|
...BRB need to fix some code. Also, wow that's misleading... The AT* instruction descriptions are simply "Input address for translation. The resulting address can be read from the PAR_EL1". Ditto, the toplevel description of "If the address translation is successful, the resulting output address is returned in PAR_EL1.PA, and PAR_EL1.F is set to 0 to indicate that the translation was successful.
I'm mainly wondering about e.g. compilers being amusing and deciding "hey, you know what would be great here at this write? A read-modify-write".
Fair.
Oh, I see. |
|
Kicking CI again. |
A fix for the build issue has been merged. Please rebase. |
Introduce the necessary macros and defines to have the stack regions correctly aligned and sized. Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Add the arch_syscall_oops hook for the AArch64. Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Introduce the first pieces needed to schedule user threads by defining two different code paths for kernel and user threads. Signed-off-by: Carlo Caione <ccaione@baylibre.com>
The arch_is_user_context() function is relying on the content of the tpidrro_el0 register to determine whether we are in user context or not. This register is set to '1' when in EL1 and set back to '0' when user threads are running in userspace. Signed-off-by: Carlo Caione <ccaione@baylibre.com>
User mode is only allowed to induce oopses and stack check failures via software-triggered system fatal exceptions. Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Introduce the arch_user_string_nlen() assembly routine and the necessary C code bits. Signed-off-by: Carlo Caione <ccaione@baylibre.com> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
This leverages the AT (address translation) instruction to test for given access permission. The result is then provided in the PAR_EL1 register. Thanks to @jharris-intel for the suggestion. Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
This patch adds the code managing the syscalls. The privileged stack is setup before jumping into the real syscall. Signed-off-by: Carlo Caione <ccaione@baylibre.com>
This is the first batch of patches implementing userspace for ARM64.
This is split in several chunks to ease reviewing and merging.