Skip to content

Conversation

@npitre
Copy link

@npitre npitre commented Mar 4, 2021

This is the first batch of patches implementing userspace for ARM64.
This is split in several chunks to ease reviewing and merging.

@github-actions github-actions bot added area: API Changes to public APIs area: ARM ARM (32-bit) Architecture area: ARM_64 labels Mar 4, 2021
@carlocaione carlocaione self-assigned this Mar 4, 2021
@nashif nashif requested a review from dcpleung March 4, 2021 11:50
@carlocaione
Copy link
Contributor

@npitre I force pushed your branch to fix the conflicts caused by #32394

@npitre
Copy link
Author

npitre commented Mar 5, 2021 via email

@jharris-intel
Copy link
Contributor

Access faults are synchronous aborts.

Right, I was more thinking of External Aborts, which can be async... but this is purely just checking for MMU faults, which are sync.

The advantage with the approach here is that you have zero overhead at
run time.

Not "zero". "Less" in the non-faulting case, absolutely, but you still can't e.g. inline or const-prop these calls, and you often end up needing to go over memory twice. And at the expense of more in the faulting case. (Which may explain the difference. I'm used to this mechanism more for testing purposes... where faults are much more common.)

I am also still rather concerned here about race conditions, especially in SMP. There are a fair number of TOCTTOU vulnerabilities here.

Ah! Interesting. I hadn't noticed those instructions before.

BTW, there's a wrinkle with PSTATE.PAN... but I don't think we use PAN so it's not really relevant currently. Just something to keep in mind.

@npitre
Copy link
Author

npitre commented Mar 5, 2021 via email

@jharris-intel
Copy link
Contributor

Let's remember that here the kernel is testing memory access on a user thread's behalf before consuming, or worse writing back user data.

...exactly my point? Let me elaborate a little.

This approach means that you can often leak a limited amount of otherwise-inaccessible data by winning the race condition.

I took a quick look, and lo and behold (reformatted to be a bit smaller):

static inline int z_vrfy_k_thread_name_set(struct k_thread *t, const char *str) {
	[...]
	int err;
	size_t len = z_user_string_nlen(str, CONFIG_THREAD_MAX_NAME_LEN, &err);
	if (err != 0 || Z_SYSCALL_MEMORY_READ(str, len) != 0) {
		return -EFAULT;
	}
	/* RACE HERE */
	return z_impl_k_thread_name_set(t, str);
}

I am a mostly-unprivileged thread that can at least spawn a thread, and have at least some shared memory with threads I spawn.

I point str to the last accessible byte of one of my accessible shared memory regions. I spawn a thread that repeatedly calls k_thread_name_set with str. I repeatedly toggle said byte between 0 and a non-zero value.

I am hoping to have a write of str[0] = nonzero hit at RACE HERE, such that the z_user_string_nlen call saw the byte as zero. If it hits before, no big deal, I see the -EFAULT and retry. Ditto with if it hits after - a later k_thread_name_copy will make that evident and I can retry.

In this case, the resulting value of len is zero, so the checks pass. That byte is accessible for z_user_string_nlen, and zero bytes worth of data is also accessible for Z_SYSCALL_MEMORY_READ.

...then z_impl_k_thread_name_set does a strncpy and copies up to CONFIG_THREAD_MAX_NAME_LEN-2 bytes worth of kernel memory into the thread name, which we can retrieve easily enough using k_thread_name_copy.

In general, z_user_string_nlen as basically anything other than a part of z_user_string_(alloc_)?copy is rather unsafe. Yes, if used absolutely perfectly it's fine - that's what z_user_string_* does after all - but it's far too easy to subtly misuse.

That being said, this particular leak is relatively limited, both because it stops copying at a null byte and because the it can only leak directly after a shared accessible region. (And it copies a relatively small amount on top of that.)

Hopefully this clarifies what I meant...

@npitre
Copy link
Author

npitre commented Mar 6, 2021 via email

@npitre
Copy link
Author

npitre commented Mar 6, 2021 via email

@jharris-intel
Copy link
Contributor

jharris-intel commented Mar 6, 2021

In Linux you may inline those as you wish. What they do is similar to
this (for ARM32):

Note the additional branches, no ability to split loads/stores (or combine in AArch64; not as much of an issue in AArch32 because there is no STRDT instruction), no ability to remove redundant stores, and less ability for using the more complex addressing modes.

(...also, that assembly isn't marked as affecting memory, which seems like an issue.)

To be clear, when I'm talking about a lack of ability to inline I'm not so much talking about the "avoiding caller-save registers / stack frame" portion as I am the additional knock-on optimizations that you can generally do via inlining.

Well, that doesn't work for me. Maybe qemu isn't fully emulating those
instructions, or I don't understand how to use them. In fact I tried
most variations and the test suite always fails (it doesn't fault when
expected).

...Sorry about that. It's the low (IIRC?) bit of PAR_EL1 that indicates a failure, not an actual abort.

And BTW this is broken even without SMP in the picture. The second
thread could be preemptively scheduled, flip the byte and yield. Much
harder to get the timing right but not impossible.

Ah, I wasn't sure about that. I didn't know if this particular case was potentially in preemptable context or not, and mentioned the SMP case because I know it's broken there.

When a validation function tells you that n bytes
from address p are fine, then you must copy n bytes from address p. Not
forget about n along the way! That's a very basic rule.

This is my point, yes. The safe way to use this API is to do A, then B. Only... we have an API to do A, then B. It's called
z_user_string_(alloc_)?copy. Doing A without B is (almost) always a bug waiting to happen, because now we're passing around something that's ostensibly a string, but isn't "really".

I'll open a PR to fix this particular case, but my stance remains that z_user_string_nlen as an API is rather unsafe.

@npitre
Copy link
Author

npitre commented Mar 9, 2021 via email

@carlocaione
Copy link
Contributor

@npitre I force pushed your userspace1 branch to fix a checkpatch error that was making the CI fail.

@nashif
Copy link
Member

nashif commented Mar 9, 2021

@npitre I force pushed your userspace1 branch to fix a checkpatch error that was making the CI fail.

something went wrong

@carlocaione
Copy link
Contributor

something went wrong

I think that the merge of #33145 is confusing the CI.

@carlocaione
Copy link
Contributor

Ok, I rebased on latest master and now the only failure on CI is not related to this commit.

@npitre
Copy link
Author

npitre commented Mar 9, 2021

Try to kick CI again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ISB should only be necessary at the end of the loop, not before reading PAR.

Comment on lines +64 to +67
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. The documentation (and most of the other code) treats Z_SYSCALL_MEMORY_WRITE as a superset of Z_SYSCALL_MEMORY_READ, in that write implies read. Do we need to check read also on write?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop doesn't appear to check non-page-aligned addresses/sizes properly.

E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.

This may be worth tossing in a test for.

Suggestion:

  1. At the start of the loop, do and x0, x0, ~#(CONFIG_MMU_PAGE_SIZE - 1) to reset the address to the start of the page.
  2. Change the increment to add x0, x0, #(CONFIG_MMU_PAGE_SIZE).

(This won't work if x0 is 2**64-CONFIG_MMU_PAGE_SIZE, but that can't happen in practice.)

@npitre
Copy link
Author

npitre commented Mar 9, 2021 via email

@npitre
Copy link
Author

npitre commented Mar 9, 2021 via email

@jharris-intel
Copy link
Contributor

Quoting the ARM ARM: Where an instruction results in an update to a system register, as is the case with the AT * address translation instructions, explicit synchronization must be performed before the result is guaranteed to be visible to subsequent direct reads of the PAR_EL1. So my interpretation is that the ISB is necessary before each read of the PAR. Once we're outside of the loop then no ISB is necessary because we've made our mind already and none of what we've done before might have any pending side effects that we need to wait for.

...BRB need to fix some code.

Also, wow that's misleading... The AT* instruction descriptions are simply "Input address for translation. The resulting address can be read from the PAR_EL1". Ditto, the toplevel description of "If the address translation is successful, the resulting output address is returned in PAR_EL1.PA, and PAR_EL1.F is set to 0 to indicate that the translation was successful.

Don't think so.

I'm mainly wondering about e.g. compilers being amusing and deciding "hey, you know what would be great here at this write? A read-modify-write".

Especially since we never map anything write-only.

Fair.

Let me quote that one properly.

  • add x1, x1, x0 [...] + orr x0, x0, #(CONFIG_MMU_PAGE_SIZE - 1) + add x0, x0, zephyr-master_fork #1 + cmp x0, x1 + blo abv_loop This loop doesn't appear to check non-page-aligned addresses/sizes properly. E.g. if I am checking 4KiB starting halfway through a page it only checks the first of the two pages, assuming I'm parsing this correctly.
    As I said, look again. ;-)

Oh, I see.

@npitre
Copy link
Author

npitre commented Mar 10, 2021

Kicking CI again.

@carlescufi
Copy link
Member

Kicking CI again.

A fix for the build issue has been merged. Please rebase.

carlocaione and others added 8 commits March 10, 2021 10:34
Introduce the necessary macros and defines to have the stack regions
correctly aligned and sized.

Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Add the arch_syscall_oops hook for the AArch64.

Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Introduce the first pieces needed to schedule user threads by defining
two different code paths for kernel and user threads.

Signed-off-by: Carlo Caione <ccaione@baylibre.com>
The arch_is_user_context() function is relying on the content of the
tpidrro_el0 register to determine whether we are in user context or not.

This register is set to '1' when in EL1 and set back to '0' when user
threads are running in userspace.

Signed-off-by: Carlo Caione <ccaione@baylibre.com>
User mode is only allowed to induce oopses and stack check failures via
software-triggered system fatal exceptions.

Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Introduce the arch_user_string_nlen() assembly routine and the necessary
C code bits.

Signed-off-by: Carlo Caione <ccaione@baylibre.com>
Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
This leverages the AT (address translation) instruction to test for
given access permission. The result is then provided in the PAR_EL1
register.

Thanks to @jharris-intel for the suggestion.

Signed-off-by: Nicolas Pitre <npitre@baylibre.com>
This patch adds the code managing the syscalls. The privileged stack
is setup before jumping into the real syscall.

Signed-off-by: Carlo Caione <ccaione@baylibre.com>
@nashif nashif merged commit dacd176 into zephyrproject-rtos:master Mar 10, 2021
@npitre npitre deleted the userspace1 branch March 10, 2021 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: API Changes to public APIs area: ARM ARM (32-bit) Architecture

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants