Skip to content

std: stop using TLS in signal handler #140628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

joboet
Copy link
Member

@joboet joboet commented May 3, 2025

TLS is not async-signal-safe, making its use in the signal handler used to detect stack overflows unsound (c.f. #133698). POSIX however lists two thread-specific identifiers that can be obtained in a signal handler: the current pthread_t and the address of errno. Since pthread_equal is not AS-safe, pthread_t should be considered opaque, so for our purposes, &errno is the only option. This however works nicely: we can use the address as a key into a map that stores information for each thread. This PR uses a BTreeMap protected by a spin lock to hold the guard page address and thread name and thus fixes #133698.

@joboet joboet added A-thread-locals Area: Thread local storage (TLS) T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 3, 2025
@rustbot
Copy link
Collaborator

rustbot commented May 3, 2025

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added O-unix Operating system: Unix-like S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 3, 2025
@rust-log-analyzer

This comment has been minimized.

@workingjubilee
Copy link
Member

thread 'main' panicked at tests/pass/weak_memory/weak.rs:145:5:
assertion failed: std::iter::repeat_with(|| f()).take(100).any(|x| x)

well, that doesn't seem good

@rustbot
Copy link
Collaborator

rustbot commented May 4, 2025

The Miri subtree was changed

cc @rust-lang/miri

@joboet
Copy link
Member Author

joboet commented May 4, 2025

well, that doesn't seem good

That's no cause for concern, these miri tests just assumed that there is no synchronisation between threads created independently from each other. That's no longer true however as the threads all acquire the thread info lock. I've fixed this now by moving the operations around a bit and by using yield_now.

Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Miri test changes are right, but I find these tests a lot harder to read now. We should stop using thread::spawn and instead directly call pthread_create or so to avoid the spurious synchronization that std incurs.

Copy link
Member

@Mark-Simulacrum Mark-Simulacrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General approach seems fine to me.

//! Unfortunately, because thread local storage isn't async-signal-safe, we
//! cannot soundly use it in our stack overflow handler. While this works
//! without problems on most platforms, it can lead to undefined behaviour
//! on others (such as GNU/Linux). Luckily, the POSIX specification documents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we expect that all cfg(unix) platforms satisfy POSIX? Is that a safe assumption to make?

It seems a little odd to me that errno is somehow special (vs arbitrary thread-locals)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little odd to me that errno is somehow special (vs arbitrary thread-locals)...

The C runtime can rely on the fact that it's always loaded and use e.g. the initial-exec TLS model – which is async-signal-safe – for errno (that's what glibc does). Or they embed errno into the thread control block, which is accessible through the thread pointer (that's what musl does).

I guess we expect that all cfg(unix) platforms satisfy POSIX? Is that a safe assumption to make?

The expectation must only hold for platforms that enable the stack overflow handler.

Copy link
Member

@workingjubilee workingjubilee May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a safe assumption to make on various niche cases, however if they implement any threading, have signals, yet do not implement POSIX correctly-enough for this, they will have a lot of problems that we will immediately notice. In particular, this is a very mild demand: "implement errno in a POSIX-conforming way". If you have threads and signals AND want to run C code, you pretty much have to do this. Your only choice with being non-conformant in that case is to deliberately pick a non-conforming implementation that still somehow doesn't get stomped by C code running concurrently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice the only other implementation of errno I would expect to see is the global version that gets stomped by running concurrently and thus cannot support threading.

// This uses a `BTreeMap` instead of a hashmap since it supports constant
// initialization and automatically reduces the amount of memory used when
// items are removed.
static mut THREAD_INFO: BTreeMap<usize, ThreadInfo> = BTreeMap::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe s/static mut/ with a SyncUnsafeCell or something like that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this, but I didn't think it any clearer...

// items are removed.
static mut THREAD_INFO: BTreeMap<usize, ThreadInfo> = BTreeMap::new();

struct UnlockOnDrop;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this be SpinGuard<'a, BTreeMap<usize, ThreadInfo>> or so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would necessitate Deref and DerefMut implementations that aren't really adding any clarity.

}
}

pub fn set_current_info(guard_page_range: Range<usize>, thread_name: Option<Box<str>>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why not take the ThreadInfo struct as the argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the struct creation to be unnecessary clutter at the call site.

@Mark-Simulacrum
Copy link
Member

That's no longer true however as the threads all acquire the thread info lock

Are we OK with that guarantee propagating out to "userspace"? I guess it's sort of unavoidable if we want some kind of lock at thread creation/death but it feels a bit like that should be a more intentional decision?

I guess it's undocumented but might still be a bit hard to claw back if we found a way to eliminate the lock later.

@joboet
Copy link
Member Author

joboet commented May 5, 2025

I think the Miri test changes are right, but I find these tests a lot harder to read now.

I've added a helper function that abstracts all of the yield_now logic away, so the tests are hopefully clearer now.

@rust-log-analyzer

This comment has been minimized.

@joboet joboet force-pushed the async_signal_safe branch from e2e4fda to 3280a1f Compare May 5, 2025 09:06
@RalfJung
Copy link
Member

RalfJung commented May 5, 2025

disable the stack overflow handler on miri

Hm, I guess that is the other option. It'd also let us remove some other hacks from Miri probably, since we don't really support the functions that code uses. OTOH it does leave some code without our usual UB test coverage.

joboet added 2 commits May 5, 2025 15:18
TLS is not async-signal-safe, making its use in the signal handler used to detect stack overflows unsound (c.f. rust-lang#133698). POSIX however lists two thread-specific identifiers that can be obtained in a signal handler: the current `pthread_t` and the address of `errno`. Since `pthread_equal` is not AS-safe, `pthread_t` should be considered opaque, so for our purposes, `&errno` is the only option. This however works nicely: we can use the address as a key into a map that stores information for each thread. This PR uses a `BTreeMap` protected by a spin lock to hold the guard page address and thread name and thus fixes rust-lang#133698.
@joboet joboet force-pushed the async_signal_safe branch from 3280a1f to fb91fdb Compare May 5, 2025 13:19
mod thread_info;

// miri doesn't model signals, and this code has some synchronization properties
// that we don't want to expose to user code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// that we don't want to expose to user code.
// that we don't want to expose to user code. Also, Miri fundamentally cannot even have a stack overflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-thread-locals Area: Thread local storage (TLS) O-unix Operating system: Unix-like S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

std assumes that accessing an already-initialized TLS variable is async-signal-safe
7 participants