aarch64: userspace (part 1) #32854

npitre · 2021-03-04T03:49:12Z

This is the first batch of patches implementing userspace for ARM64.
This is split in several chunks to ease reviewing and merging.

carlocaione · 2021-03-04T13:10:27Z

@npitre I force pushed your branch to fix the conflicts caused by #32394

arch/arm/core/aarch64/vector_table.S

arch/arm/core/aarch64/fatal.c

arch/arm/core/aarch64/userspace.S

npitre · 2021-03-05T18:49:19Z

On Fri, 5 Mar 2021, jharris-intel wrote: Interesting. This is a very different approach than the one I'm used to.

The same scheme is heavily used in the Linux kernel. However the linker is leveraged to gather in a table all the addresses where exceptions may occur, and where to branch if so. The table is then sorted by the linker so that the exception code can binary search it. Here we have only two cases so far, so maybe it isn't worth going through all that complexity just yet.

(I'm used to instead having a spinlock-like approach, where before doing something that may abort you record in thread context that an exception is allowed and the handler thereof, and after you remove said record.)

The advantage with the approach here is that you have zero overhead at run time.

One word of warning: this approach does not play well with anything that splits / inlines/ duplicates functions. Should be fine here currently because everything is assembly, but worth noting.

This should always be used with assembly code, and ideally with the smallest range as possible. In fact, in Linux you may flag only individual instructions not a range.

You probably want an ISB in here somewhere. As-is, an SError may not be detected inside this region.

Are you sure? Access faults are synchronous aborts.

+ ldtrb w3, [x0] + cbz w2, arch_buffer_validate_fault_end + sttrb w3, [x0] I see what this is intended to do; I am rather hesitant about "null" writebacks, as there's a lot that can go wrong. Perhaps instead use the `AT` instructions? They will trigger a synchronous abort if it's not allowed.

Ah! Interesting. I hadn't noticed those instructions before.

jharris-intel · 2021-03-05T20:15:33Z

Access faults are synchronous aborts.

Right, I was more thinking of External Aborts, which can be async... but this is purely just checking for MMU faults, which are sync.

The advantage with the approach here is that you have zero overhead at
run time.

Not "zero". "Less" in the non-faulting case, absolutely, but you still can't e.g. inline or const-prop these calls, and you often end up needing to go over memory twice. And at the expense of more in the faulting case. (Which may explain the difference. I'm used to this mechanism more for testing purposes... where faults are much more common.)

I am also still rather concerned here about race conditions, especially in SMP. There are a fair number of TOCTTOU vulnerabilities here.

Ah! Interesting. I hadn't noticed those instructions before.

BTW, there's a wrinkle with PSTATE.PAN... but I don't think we use PAN so it's not really relevant currently. Just something to keep in mind.

npitre · 2021-03-05T20:48:40Z

On Fri, 5 Mar 2021, jharris-intel wrote: > The advantage with the approach here is that you have zero overhead at > run time. Not "zero". "Less" in the non-faulting case, absolutely, but you still can't e.g. inline or const-prop these calls, and you often end up needing to go over memory twice. And at the expense of more in the faulting case. (Which may explain the difference. I'm used to this mechanism more for testing purposes... where faults are much more common.)

Here, the normal case is to never fault. If you fault that's because you attempted to access memory to which you're not entitled. In Linux you may inline those as you wish. What they do is similar to this (for ARM32): ``` #define put_user_word(x, addr, err) \ __asm__ __volatile__( \ "1: strt %1, [%2]\n" \ "2:\n" \ " .pushsection .text.fixup,\"ax\"\n" \ " .align 2\n" \ "3: mov %0, %3\n" \ " b 2b\n" \ " .popsection\n" \ " .pushsection __ex_table,\"a\"\n" \ " .align 3\n" \ " .long 1b, 3b\n" \ " .popsection" \ : "+r" (err) \ : "r" (x), "r" (addr), "i" (-EFAULT) \ : "cc") ``` So this writes to user space memory from kernel space, and if a fault occurs then _EFAULT is stored in the variable err. Here only the strt is controlled with its address put in the table gathered in the __ex_table, section, along with the address to the fixup code which is also out of line. The good thing with this approach is that you truly have zero overhead at run time when the access is granted, and this is subject to dead code elimination by the compiler if that piece of code is unneeded due to constant prop, etc.

I am also still rather concerned here about race conditions, especially in SMP. There are a fair number of TOCTTOU vulnerabilities here.

Do you have a precise example in mind? Let's remember that here the kernel is testing memory access on a user thread's behalf before consuming, or worse writing back user data. That's also the kernel that sets those access permissions in the first place. No other thread has any business changing another thread's access rights, if only to kill it maybe. So there is no opportunity for the access permission to change between the test and the actual access (kernel memory won't suddenly become user memory).

jharris-intel · 2021-03-06T03:17:49Z

Let's remember that here the kernel is testing memory access on a user thread's behalf before consuming, or worse writing back user data.

...exactly my point? Let me elaborate a little.

This approach means that you can often leak a limited amount of otherwise-inaccessible data by winning the race condition.

I took a quick look, and lo and behold (reformatted to be a bit smaller):

static inline int z_vrfy_k_thread_name_set(struct k_thread *t, const char *str) {
	[...]
	int err;
	size_t len = z_user_string_nlen(str, CONFIG_THREAD_MAX_NAME_LEN, &err);
	if (err != 0 || Z_SYSCALL_MEMORY_READ(str, len) != 0) {
		return -EFAULT;
	}
	/* RACE HERE */
	return z_impl_k_thread_name_set(t, str);
}

I am a mostly-unprivileged thread that can at least spawn a thread, and have at least some shared memory with threads I spawn.

I point str to the last accessible byte of one of my accessible shared memory regions. I spawn a thread that repeatedly calls k_thread_name_set with str. I repeatedly toggle said byte between 0 and a non-zero value.

I am hoping to have a write of str[0] = nonzero hit at RACE HERE, such that the z_user_string_nlen call saw the byte as zero. If it hits before, no big deal, I see the -EFAULT and retry. Ditto with if it hits after - a later k_thread_name_copy will make that evident and I can retry.

In this case, the resulting value of len is zero, so the checks pass. That byte is accessible for z_user_string_nlen, and zero bytes worth of data is also accessible for Z_SYSCALL_MEMORY_READ.

...then z_impl_k_thread_name_set does a strncpy and copies up to CONFIG_THREAD_MAX_NAME_LEN-2 bytes worth of kernel memory into the thread name, which we can retrieve easily enough using k_thread_name_copy.

In general, z_user_string_nlen as basically anything other than a part of z_user_string_(alloc_)?copy is rather unsafe. Yes, if used absolutely perfectly it's fine - that's what z_user_string_* does after all - but it's far too easy to subtly misuse.

That being said, this particular leak is relatively limited, both because it stops copying at a null byte and because the it can only leak directly after a shared accessible region. (And it copies a relatively small amount on top of that.)

Hopefully this clarifies what I meant...

npitre · 2021-03-06T03:59:06Z

On Fri, 5 Mar 2021, jharris-intel wrote: ```c static inline int z_vrfy_k_thread_name_set(struct k_thread *t, const char *str) { [...] int err; size_t len = z_user_string_nlen(str, CONFIG_THREAD_MAX_NAME_LEN, &err); if (err != 0 || Z_SYSCALL_MEMORY_READ(str, len) != 0) { return -EFAULT; } /* RACE HERE */ return z_impl_k_thread_name_set(t, str); } ``` ...then `z_impl_k_thread_name_set` does a `strncpy` and copies up to `CONFIG_THREAD_MAX_NAME_LEN-2` bytes worth of kernel memory into the thread name, which we can retrieve easily enough using `k_thread_name_copy`.

Absolutely. That code above is simply broken. `z_impl_k_thread_name_set()` must take `len` as parameter and copy only the number of bytes that was vetted by `z_user_string_nlen()`. And BTW this is broken even without SMP in the picture. The second thread could be preemptively scheduled, flip the byte and yield. Much harder to get the timing right but not impossible. Please create a separate issue for this as this is a generic bug unrelated to this PR.

In general, `z_user_string_nlen` as basically anything other than a part of `z_user_string_(alloc_)?copy` is rather unsafe. Yes, if used absolutely perfectly it's fine - that's what `z_user_string_*` does after all - but it's far too easy to subtly misuse.

I wouldn't say so. When a validation function tells you that n bytes from address p are fine, then you must copy n bytes from address p. Not forget about n along the way! That's a very basic rule. It is pointless performing another EOS detection anyway.

That being said, this particular leak is relatively limited, both because it stops copying at a null byte and because the it can only leak directly after a shared accessible region. (And it copies a relatively small amount on top of that.) Hopefully this clarifies what I meant...

Absolutely.

npitre · 2021-03-06T04:09:04Z

Perhaps instead use the `AT` instructions? They will trigger a synchronous abort if it's not allowed.

Well, that doesn't work for me. Maybe qemu isn't fully emulating those instructions, or I don't understand how to use them. In fact I tried most variations and the test suite always fails (it doesn't fault when expected). So I suggest that the code be merged as is for now. Testing for writes happens in the area that is most likely going to be overwritten anyway.

jharris-intel · 2021-03-06T22:05:15Z

In Linux you may inline those as you wish. What they do is similar to
this (for ARM32):

Note the additional branches, no ability to split loads/stores (or combine in AArch64; not as much of an issue in AArch32 because there is no STRDT instruction), no ability to remove redundant stores, and less ability for using the more complex addressing modes.

(...also, that assembly isn't marked as affecting memory, which seems like an issue.)

To be clear, when I'm talking about a lack of ability to inline I'm not so much talking about the "avoiding caller-save registers / stack frame" portion as I am the additional knock-on optimizations that you can generally do via inlining.

Well, that doesn't work for me. Maybe qemu isn't fully emulating those
instructions, or I don't understand how to use them. In fact I tried
most variations and the test suite always fails (it doesn't fault when
expected).

...Sorry about that. It's the low (IIRC?) bit of PAR_EL1 that indicates a failure, not an actual abort.

And BTW this is broken even without SMP in the picture. The second
thread could be preemptively scheduled, flip the byte and yield. Much
harder to get the timing right but not impossible.

Ah, I wasn't sure about that. I didn't know if this particular case was potentially in preemptable context or not, and mentioned the SMP case because I know it's broken there.

When a validation function tells you that n bytes
from address p are fine, then you must copy n bytes from address p. Not
forget about n along the way! That's a very basic rule.

This is my point, yes. The safe way to use this API is to do A, then B. Only... we have an API to do A, then B. It's called
z_user_string_(alloc_)?copy. Doing A without B is (almost) always a bug waiting to happen, because now we're passing around something that's ostensibly a string, but isn't "really".

I'll open a PR to fix this particular case, but my stance remains that z_user_string_nlen as an API is rather unsafe.

npitre · 2021-03-09T04:51:47Z

On Mon, 8 Mar 2021, jharris-intel wrote: That's... precisely what I'm disagreeing with you on? The definition and behavior are, seemingly, clear, and/but this code is _not_ adhering to both. This is further evidenced by the fact that this kernel API is currently being used in ways (atomics being the most obvious example) that work **if the function behaves as documented**, but not if you allow the function to actually write to the region.

My apologies. I'm a goof. I completely misread your other reply. You are right about atomic CAS being screwed by the write in arch_buffer_validate of course. I did my homework and reimplemented it around the AT instruction.

carlocaione · 2021-03-09T09:22:41Z

@npitre I force pushed your userspace1 branch to fix a checkpatch error that was making the CI fail.

nashif · 2021-03-09T11:05:55Z

@npitre I force pushed your userspace1 branch to fix a checkpatch error that was making the CI fail.

something went wrong

carlocaione · 2021-03-09T11:08:12Z

something went wrong

I think that the merge of #33145 is confusing the CI.

carlocaione · 2021-03-09T12:22:24Z

Ok, I rebased on latest master and now the only failure on CI is not related to this commit.

npitre · 2021-03-09T13:34:19Z

Try to kick CI again.

jharris-intel · 2021-03-09T15:49:31Z

arch/arm/core/aarch64/userspace.S

This ISB should only be necessary at the end of the loop, not before reading PAR.

jharris-intel · 2021-03-09T15:51:18Z

arch/arm/core/aarch64/userspace.S

Hm. The documentation (and most of the other code) treats Z_SYSCALL_MEMORY_WRITE as a superset of Z_SYSCALL_MEMORY_READ, in that write implies read. Do we need to check read also on write?

jharris-intel · 2021-03-09T15:56:43Z

arch/arm/core/aarch64/userspace.S

This loop doesn't appear to check non-page-aligned addresses/sizes properly.

E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.

This may be worth tossing in a test for.

Suggestion:

At the start of the loop, do and x0, x0, ~#(CONFIG_MMU_PAGE_SIZE - 1) to reset the address to the start of the page.

Change the increment to add x0, x0, #(CONFIG_MMU_PAGE_SIZE).

(This won't work if x0 is 2**64-CONFIG_MMU_PAGE_SIZE, but that can't happen in practice.)

npitre · 2021-03-09T17:11:32Z

On Tue, 9 Mar 2021, jharris-intel wrote: This ISB should only be necessary at the end of the loop, not before reading PAR.

Quoting the ARM ARM: ``` Where an instruction results in an update to a system register, as is the case with the AT * address translation instructions, explicit synchronization must be performed before the result is guaranteed to be visible to subsequent direct reads of the PAR_EL1. ``` So my interpretation is that the ISB is necessary before each read of the PAR. Once we're outside of the loop then no ISB is necessary because we've made our mind already and none of what we've done before might have any pending side effects that we need to wait for.

Hm. The documentation (and most of the other code) treats `Z_SYSCALL_MEMORY_WRITE` as a superset of `Z_SYSCALL_MEMORY_READ`, in that write implies read. Do we need to check read also on write?

Don't think so. Especially since we never map anything write-only.

> +z_arm64_user_string_nlen_fixup: + mov x4, #-1 + mov x0, #0 + +strlen_done: + str w4, [x2] + ret + +/* + * int arch_buffer_validate(void *addr, size_t size, int write) + */ + +GTEXT(arch_buffer_validate) +SECTION_FUNC(TEXT, arch_buffer_validate) + + add x1, x1, x0 This loop doesn't appear to check non-page-aligned addresses/sizes properly. E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.

It does check both pages. Note the `orr`.

npitre · 2021-03-09T17:21:09Z

Let me quote that one properly.

+ add x1, x1, x0 [...] + orr x0, x0, #(CONFIG_MMU_PAGE_SIZE - 1) + add x0, x0, #1 + cmp x0, x1 + blo abv_loop This loop doesn't appear to check non-page-aligned addresses/sizes properly. E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.

As I said, look again. ;-)

jharris-intel · 2021-03-09T17:36:50Z

Quoting the ARM ARM: Where an instruction results in an update to a system register, as is the case with the AT * address translation instructions, explicit synchronization must be performed before the result is guaranteed to be visible to subsequent direct reads of the PAR_EL1. So my interpretation is that the ISB is necessary before each read of the PAR. Once we're outside of the loop then no ISB is necessary because we've made our mind already and none of what we've done before might have any pending side effects that we need to wait for.

...BRB need to fix some code.

Also, wow that's misleading... The AT* instruction descriptions are simply "Input address for translation. The resulting address can be read from the PAR_EL1". Ditto, the toplevel description of "If the address translation is successful, the resulting output address is returned in PAR_EL1.PA, and PAR_EL1.F is set to 0 to indicate that the translation was successful.

Don't think so.

I'm mainly wondering about e.g. compilers being amusing and deciding "hey, you know what would be great here at this write? A read-modify-write".

Especially since we never map anything write-only.

Fair.

Let me quote that one properly.

add x1, x1, x0 [...] + orr x0, x0, #(CONFIG_MMU_PAGE_SIZE - 1) + add x0, x0, zephyr-master_fork #1 + cmp x0, x1 + blo abv_loop This loop doesn't appear to check non-page-aligned addresses/sizes properly. E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.
As I said, look again. ;-)

Oh, I see.

npitre · 2021-03-10T02:42:06Z

Kicking CI again.

carlescufi · 2021-03-10T15:12:59Z

Kicking CI again.

A fix for the build issue has been merged. Please rebase.

Introduce the necessary macros and defines to have the stack regions correctly aligned and sized. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

Add the arch_syscall_oops hook for the AArch64. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

Introduce the first pieces needed to schedule user threads by defining two different code paths for kernel and user threads. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

The arch_is_user_context() function is relying on the content of the tpidrro_el0 register to determine whether we are in user context or not. This register is set to '1' when in EL1 and set back to '0' when user threads are running in userspace. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

User mode is only allowed to induce oopses and stack check failures via software-triggered system fatal exceptions. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

Introduce the arch_user_string_nlen() assembly routine and the necessary C code bits. Signed-off-by: Carlo Caione <ccaione@baylibre.com> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

@jharris-intel

This leverages the AT (address translation) instruction to test for given access permission. The result is then provided in the PAR_EL1 register. Thanks to @jharris-intel for the suggestion. Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

This patch adds the code managing the syscalls. The privileged stack is setup before jumping into the real syscall. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

npitre requested review from MaureenHelm, carlescufi, carlocaione, galak, ioannisg and nashif as code owners March 4, 2021 03:49

github-actions bot added area: API Changes to public APIs area: ARM ARM (32-bit) Architecture area: ARM_64 labels Mar 4, 2021

carlocaione requested review from Abhishek-brcm and sandeepbrcm March 4, 2021 07:50

carlocaione self-assigned this Mar 4, 2021

nashif requested a review from dcpleung March 4, 2021 11:50

carlocaione force-pushed the userspace1 branch from f4bade5 to 22e9106 Compare March 4, 2021 13:07

dcpleung reviewed Mar 4, 2021

View reviewed changes

arch/arm/core/aarch64/vector_table.S Outdated Show resolved Hide resolved

npitre force-pushed the userspace1 branch from 22e9106 to 6924ccd Compare March 4, 2021 19:29

carlocaione approved these changes Mar 5, 2021

View reviewed changes

jharris-intel reviewed Mar 5, 2021

View reviewed changes

arch/arm/core/aarch64/fatal.c Outdated Show resolved Hide resolved

arch/arm/core/aarch64/fatal.c Outdated Show resolved Hide resolved

arch/arm/core/aarch64/userspace.S Outdated Show resolved Hide resolved

arch/arm/core/aarch64/userspace.S Outdated Show resolved Hide resolved

dcpleung approved these changes Mar 5, 2021

View reviewed changes

npitre force-pushed the userspace1 branch from 6924ccd to 4b598d3 Compare March 6, 2021 19:51

This was referenced Mar 8, 2021

kernel: fix TOCTTOU issue in k_thread_name_set #33152

Merged

logging: fix TOCTTOU issue in z_log_string_from_user #33154

Closed

npitre force-pushed the userspace1 branch from 65bc4bd to 7db6065 Compare March 9, 2021 04:52

carlocaione mentioned this pull request Mar 9, 2021

aarch64: Fix nested counter macros #33145

Merged

carlocaione force-pushed the userspace1 branch from 7db6065 to c1d9c05 Compare March 9, 2021 09:21

carlocaione force-pushed the userspace1 branch from c1d9c05 to 2e2d0f7 Compare March 9, 2021 11:13

npitre force-pushed the userspace1 branch from 2e2d0f7 to 8dfd096 Compare March 9, 2021 13:33

carlocaione mentioned this pull request Mar 9, 2021

aarch64: Remove useless _curr_cpu struct #33181

Merged

jharris-intel reviewed Mar 9, 2021

View reviewed changes

npitre force-pushed the userspace1 branch from 8dfd096 to 3a49dab Compare March 10, 2021 02:41

carlocaione and others added 8 commits March 10, 2021 10:34

aarch64: stack: Rework memory stack allocations

dcf8875

Introduce the necessary macros and defines to have the stack regions correctly aligned and sized. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

aarch64: fatal: Add arch_syscall_oops hook

357bbcf

Add the arch_syscall_oops hook for the AArch64. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

aarch64: userspace: Introduce skeleton code for user-threads

c46bf51

Introduce the first pieces needed to schedule user threads by defining two different code paths for kernel and user threads. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

aarch64: fatal: Restrict oops-es when in user-mode

6575abc

User mode is only allowed to induce oopses and stack check failures via software-triggered system fatal exceptions. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

aarch64: userspace: Introduce arch_user_string_nlen

150b54e

Introduce the arch_user_string_nlen() assembly routine and the necessary C code bits. Signed-off-by: Carlo Caione <ccaione@baylibre.com> Signed-off-by: Nicolas Pitre <npitre@baylibre.com>

aarch64: userspace: Implement syscalls

c3b1764

This patch adds the code managing the syscalls. The privileged stack is setup before jumping into the real syscall. Signed-off-by: Carlo Caione <ccaione@baylibre.com>

npitre force-pushed the userspace1 branch from 3a49dab to c3b1764 Compare March 10, 2021 15:36

nashif merged commit dacd176 into zephyrproject-rtos:master Mar 10, 2021

npitre deleted the userspace1 branch March 10, 2021 21:56

aarch64: userspace (part 1) #32854

aarch64: userspace (part 1) #32854

Uh oh!

Conversation

npitre commented Mar 4, 2021

Uh oh!

carlocaione commented Mar 4, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

npitre commented Mar 5, 2021 via email

Uh oh!

jharris-intel commented Mar 5, 2021

Uh oh!

npitre commented Mar 5, 2021 via email

Uh oh!

jharris-intel commented Mar 6, 2021

Uh oh!

npitre commented Mar 6, 2021 via email

Uh oh!

npitre commented Mar 6, 2021 via email

Uh oh!

jharris-intel commented Mar 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

npitre commented Mar 9, 2021 via email

Uh oh!

carlocaione commented Mar 9, 2021

Uh oh!

nashif commented Mar 9, 2021

Uh oh!

carlocaione commented Mar 9, 2021

Uh oh!

carlocaione commented Mar 9, 2021

Uh oh!

npitre commented Mar 9, 2021

Uh oh!

jharris-intel Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

jharris-intel Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

jharris-intel Mar 9, 2021

Choose a reason for hiding this comment

Uh oh!

npitre commented Mar 9, 2021 via email

Uh oh!

npitre commented Mar 9, 2021 via email

Uh oh!

jharris-intel commented Mar 9, 2021

Uh oh!

npitre commented Mar 10, 2021

Uh oh!

carlescufi commented Mar 10, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jharris-intel commented Mar 6, 2021 •

edited

Loading