-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve thread scheduler memory ordering #43418
Conversation
bump! some of this looks very good to have. |
Pickup TSAN/ASAN and LLVM assertion build fixes from #43418
2a6a8ab
to
c3f7396
Compare
src/partr.c
Outdated
if (!multiq_check_empty()) { | ||
// acquire sleep-check lock | ||
jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping); | ||
jl_fence(); // add a sequenced-before edge between the sleep_check_state modify and get_next_task checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
jl_fence(); // add a sequenced-before edge between the sleep_check_state modify and get_next_task checks | |
jl_fence(); // ensures sleep_check_state is sequentially-consistent with enqueuing the task |
as you wrote in jl_wakeup_thread
? Since sequenced-before is a single-thread property (and we get it "for free"), we don't need a fence and I don't think this is very informative here. Rather, we need this for the global acyclicity of SC. (I usually informally reason that this is for "avoiding store-load reordering" because it's a Dekker-like situation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is annoying to describe properly. I probably want seq-cst happens-before here. But it is not with enqueuing of the task, since that might only be atomic release. And anyways, seq-cst stores only create a global ordering with other seq-cst operations and fences, but IIUC, does not necessarily enforce an ordering on other nearby operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IANAL but I think a somewhat concrete argument would be something like the following. Using the terminology by Lahav et al. (2017) (as standardese is hard...) we see that we can have two reads-before edges:
- rb1:
multiq_check_empty
->multiq_insert
- rb2:
jl_atomic_load_relaxed(&ptls->sleep_check_state)
injl_wakeup_thread
->jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)
(Since they are inter-thread edges that start from a read and end with a write, they do not contribute to the usual happens-before edges.)
(Since rb1 touches multiple memory locations, I guess it's more like a "bundle of edges".)
Let us also name the two fences sandwiching these edges:
- F1: the first
jl_fence
injl_wakeup_thread
- F2: the
jl_fence
injl_task_get_next
that I quoted above
These fences do double duty of promoting the two reads-before edges to the partial SC relation edges:
- sc1: F2 -> rb1 -> F1
- sc2: F1 -> rb2 -> F2
Since both sc1 and sc2 are part of the partial SC relation that is acyclic under C+20 memory model, they can't form a cycle. We can use this to show that the following unwated execution is impossible:
- Initial state:
- queue is empty
sleep_check_state
isnot_sleeping
- Thread 1 (dequeuer) does:
- 1a:
jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)
- 1b:
multiq_check_empty
returns true
- 1a:
- Thread 2 (enqueuer) does:
- 2a:
multiq_insert
- 2b:
jl_atomic_load_relaxed(&ptls->sleep_check_state)
injl_wakeup_thread
returnsnot_sleeping
- 2a:
i.e., the dequeuer misses the enqueue and enqueuer misses the sleep state transition.
This is impossible because it forms a cycle
- 2b -> 1a (due to the value read in 2b)
- 1a -> 1b (sequenced-before)
- 1b -> 2a (due to the value read in 1b)
- 2a -> 2b (sequenced-before)
which is forbidden because both 1b -> 2a (rb1) and 2b -> 1a (rb2) are part of the partial SC relation, thanks to the SC fences.
But this argument is essentially a copy of the argument about the standard "store buffering litmus test" so I don't think it's helpful to expand the explanation this much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I think it's very helpful to document which fences "pair up." Maybe we can use markdown footnote-like comment like this
jl_fence(); // [^store_buffering_1]
in the first fence in jl_wakeup_thread
and the fence of jl_task_get_next
that I quoted above? We can then document which stores and loads are taken care of by this fence outside the function body (so that it's OK to write a somewhat lengthy comment).
...
}
// [^store_buffering_1]: These fences are used to avoid the cycle 2b -> 1a -> 1b -> 2a -> 2b where
// * Dequeuer:
// * 1a: `jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)`
// * 1b: `multiq_check_empty` returns true
// * Enqueuer:
// * 2a: `multiq_insert`
// * 2b: `jl_atomic_load_relaxed(&ptls->sleep_check_state)` in `jl_wakeup_thread` returns `not_sleeping`
// i.e., the dequeuer misses the enqueue and enqueuer misses the sleep state transition.
uv_mutex_unlock(&sleep_locks[tid]); | ||
|
||
if (jl_atomic_load_relaxed(&other->sleep_check_state) == sleeping) { | ||
if (jl_atomic_cmpswap_relaxed(&other->sleep_check_state, &state, not_sleeping)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you technically need acq_rel
ordering or a manual fence for the success path for a happens-before edge to jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)
in jl_task_get_next
. But it would be very counter-intuitive if you need one...
(I also wondered if it can disrupt the Dekker-like pattern in jl_task_get_next
and jl_wakeup_thread
. But IIUC this is fine since the RMW edge extends the release sequence edge.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not release this elsewhere, so I think that ordering would be invalid here. In fact, I think the only valid ordering for this operation is relaxed, because we use relaxed stores elsewhere.
The relaxed swap is not permitted to fail spuriously, so if we get here, we know at least one thread will complete the transition from sleeping->not_sleeping, and that thread will also ensure that to wake the condition signal. (we might get more than one thread completing this transition, if we managed to be very slow here, and the thread was already awake for other reasons, finished processing the work, and then went back to sleep already. But we are not worried about spurious wake-ups.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right..., I just realized that fence for jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)
is on the opposite side so it can't start a synchronizes-with edge (unless we put a fence before jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)
as well or strengthen its ordering).
I think the only valid ordering for this operation is relaxed, because we use relaxed stores elsewhere.
I believe it's OK. We can get happens-before edges even with mixed memory orderings as long as they are at least relaxed (not non-atomic) and we put fences before the write and after the read. See fence-atomic and atomic-fence synchronizations in https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence
The relaxed swap is not permitted to fail spuriously,
I think the difference between strong and weak CASes is orthogonal to the memory ordering, at least for the memory model of the abstract machine. compare_exchange_weak
and compare_exchange_strong
both have memory ordering arguments. Also, in the formal declarative memory model (by Lahav et al. 2017, which is the basis of C++20 memory model), the distinction of strong and weak CAS do not exist; IIUC, failed weak CASes just appear in the execution graph as repeated read immediately followed by a write (i.e., failed writes are ignored).
But we are not worried about spurious wake-ups.
I was more worried about the case that somehow the wake-up is missed. Intuitively, I thought you need to send the signal "after" you know that jl_atomic_store_relaxed(&ptls->sleep_check_state, sleeping)
has happened. But since relaxed memory read/write doesn't enforce any ordering, we technically can't assume what the other task (that is supposedly about to sleep) is actually doing. I cannot help wondering if I'm either taking C/C+ committee's FUD way too seriously and/or I'm still misunderstanding how C++20 memory model works. It's really bizarre that we can't rely on the "causality" through relaxed memory accesses alone. IIRC, Hans Boehm has a very intricate example involving GVN and speculative execution but I don't know if that applies here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That page on fences doesn't appear to mention seq_cst, and that is what we require here. Merely knowing acquire and release is insufficient to provide any interesting constraint here on the behavior of the program. What we need to prove is that the release on thread 1 (the store to sleep_check_state) has a global consistent ordering with the release on thread 2 (the store to the workqueue) relative to the subsequent checks on those threads for the contents of the workqueue or the sleep_check_state, respectively.
For the seq_cst properties, we have that the description of the fence on https://en.cppreference.com/w/cpp/atomic/memory_order:
2) the sequentially-consistent fences are only establishing total ordering for the fences themselves, not for the atomic operations in the general case (sequenced-before is not a cross-thread relationship, unlike happens-before)
I was more worried about the case that somehow the wake-up is missed
Once we reach this point, we are into the land of locks and condition objects, which are hopefully much simpler to analyze and declare safe in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://en.cppreference.com/w/cpp/atomic/memory_order:
2) the sequentially-consistent fences are only establishing total ordering for the fences themselves, not for the atomic operations in the general case (sequenced-before is not a cross-thread relationship, unlike happens-before)
I think that's a bit stale info on cppreference side. That's actually one of the major improvements in the C++20 memory model. P0668R4: Revising the C++ memory model includes a patch to remove similar sentences from the standard. In particular, the first paragraphs in the section Strengthen SC fences provides nice background information.
[Notes for people who want to dig into the original resource; e.g., future me] The relevant definition in Lahav et al. is:
The [E^sc]
prefix and suffix in psc_base
indicate that arbitrary SC evens, including reads and writes, can start/end the partial SC relation, provided that read/write starts or ends the SC-before relation scb
. On the other hand, fences ([F^sc]
) can work in more general setting where scb
is prefixed/suffixed by the happens-before relation hb
(i.e., which is indicated by [F^sc]; hb^?
and hb^?; [F^sc]
). For example, this indicates that reads-before rb
relation starting with a SC read and ends with a SC fence ([E^sc]; rb; [F^sc]
) is a part of psc_base
and hence psc
.
Once we reach this point, we are into the land of locks and condition objects, which are hopefully much simpler to analyze and declare safe in this case.
Yes, I agree. I was just curious to know how do I reason about this if I want to do it rigorously.
The svec set method needs to do a little bit more work (for asserts and write barriers), which is not necessary if we know it was just allocated.
These were previously implied (on TSO platforms, such as x86) by the atomic_store to sleeping and the sleep_locks acquire before the wake_check loop, but this makes it more explicit. We might want to consider in the future if it would be better (faster) to acquire each possible lock on the sleeping path instead, so that we do each operation with seq_cst, instead of using a fence to only order the operations we care about directly.
Ensures better that we support nesting of threaded regions and access to this variable from any thread. Previously, it appears that we might toggle this variable on another thread (via thread migration), but would then not successfully wake up the thread 0 and cause it to resume sleeping on the event loop.
c3f7396
to
0eaf35a
Compare
Some of these should probably land separately, but collecting these as I go along with trying to improve thread-safety, and similar support.