-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Description
Describe the bug
The implementation of z_unpend1_no_timeout is non-atomic, and can result in threads in an inconsistent state.
To Reproduce
Inspect
Expected behavior
z_unpend1_no_timeout should leave threads in a consistent state.
Impact
Usages of z_unpend1_no_timeout in preemptive threads, threads and interrupts, or in SMP mode can result in catastrophic system-level behavior (chasing stale/null pointers, threads not being scheduled, threads being scheduled on multiple cores at once, etc.)
Logs and console output
n/a
Environment (please complete the following information):
- OS: Ubuntu 18.04.5 LTS
- Toolchain (e.g Zephyr SDK, ...)
- Commit SHA or Version used: based off of /u/MrVan's AArch64 SMP branch (ARM64 SMP support #30676) with some additional fixes that I'm in the process of upstreaming. That being said, the references below are from trunk.
Additional context
The current implementation of z_unpend1_no_timeout is as follows.
struct k_thread *thread = z_find_first_thread_to_unpend(wait_q, NULL);
if (thread != NULL) {
z_unpend_thread_no_timeout(thread);
}Unfortunately, although both of these individual calls are protected internally by sched_spinlock, there is no guarantee that the combination of the two calls are atomic.
This can result in catastrophic system-level behavior. For one (rather contrived) example:
(some_timer currently is expired, with thread D waiting on it)
A: starts k_timer_stop(&some_timer)
B: starts k_timer_stop(&some_timer)
A: calls into z_abort_timeout(&timer->timeout) -> success, so keep going
C: starts k_timer_start(&some_timer, very_small_duration)
C: runs all the way through k_timer_start, including adding a new timeout
B: calls into z_abort_timeout(&timer->timeout) -> success, so keep going
A: starts z_unpend1_no_timeout
A: calls z_find_first_thread_to_unpend -> returns D
B: starts z_unpend1_no_timeout
B: calls z_find_first_thread_to_unpend -> returns D
A: thread != NULL, so calls _priq_wait_remove which calls z_priq_dumb_remove
A: calls sys_dlist_remove(&thread->base.qnode_dlist);
A: sys_dlist_remove follows thread D.next and thread D.prev
A: sys_dlist_remove sets thread D.next == thread D.prev == NULL
B: thread != NULL, so calls _priq_wait_remove which calls z_priq_dumb_remove
B: calls sys_dlist_remove(&thread->base.qnode_dlist);
B: sys_dlist_remove follows thread D.next and thread D.prev
B: null pointer dereference
The simplest option here that I can see is to change z_unpend1_no_timeout from a function calling two functions which each lock into a function that locks sched_lock and then call two functions that each don't lock (akin to e.g. z_unpend_thread_no_timeout and unpend_thread_no_timeout, where e.g. z_thread_timeout locks then calls unpend_thread_no_timeout and does other things); I haven't examined if the non-atomicity in e.g. z_impl_k_timer_stop is a problem; if so this interface may need to be rethought.
...of course, it's also entirely possible I'm missing something obvious here.
(P.S. there is also a rather suspect comment in z_timer_expiration_handler:
/*
* Interrupts _DO NOT_ have to be locked in this specific
* instance of thread unpending because a) this is the only
* place a thread can be taken off this pend queue, and b) the
* only place a thread can be put on the pend queue is at
* thread level, which of course cannot interrupt the current
* context.
*/...is this actually the case in SMP mode? On a single core, sure, but this comment's logic does not apply if there are multiple cores so far as I can tell. At least not without additional considerations (e.g. if there's a lock in this chain that I'm missing that would prevent a concurrent call to z_impl_k_timer_start). In addition, z_impl_k_timer_stop also potentially takes items off of the pend queue, contradicting a), if I understand the code correctly.)