Skip to content

z_unpend1_no_timeout non-atomic #32136

@jharris-intel

Description

@jharris-intel

Describe the bug

The implementation of z_unpend1_no_timeout is non-atomic, and can result in threads in an inconsistent state.

To Reproduce

Inspect

Expected behavior

z_unpend1_no_timeout should leave threads in a consistent state.

Impact

Usages of z_unpend1_no_timeout in preemptive threads, threads and interrupts, or in SMP mode can result in catastrophic system-level behavior (chasing stale/null pointers, threads not being scheduled, threads being scheduled on multiple cores at once, etc.)

Logs and console output

n/a

Environment (please complete the following information):

  • OS: Ubuntu 18.04.5 LTS
  • Toolchain (e.g Zephyr SDK, ...)
  • Commit SHA or Version used: based off of /u/MrVan's AArch64 SMP branch (ARM64 SMP support #30676) with some additional fixes that I'm in the process of upstreaming. That being said, the references below are from trunk.

Additional context

The current implementation of z_unpend1_no_timeout is as follows.

	struct k_thread *thread = z_find_first_thread_to_unpend(wait_q, NULL);
	if (thread != NULL) {
		z_unpend_thread_no_timeout(thread);
	}

Unfortunately, although both of these individual calls are protected internally by sched_spinlock, there is no guarantee that the combination of the two calls are atomic.

This can result in catastrophic system-level behavior. For one (rather contrived) example:

(some_timer currently is expired, with thread D waiting on it)
A: starts k_timer_stop(&some_timer)
B: starts k_timer_stop(&some_timer)
A: calls into z_abort_timeout(&timer->timeout) -> success, so keep going
C: starts k_timer_start(&some_timer, very_small_duration)
C: runs all the way through k_timer_start, including adding a new timeout
B: calls into z_abort_timeout(&timer->timeout) -> success, so keep going
A: starts z_unpend1_no_timeout
A: calls z_find_first_thread_to_unpend -> returns D
B: starts z_unpend1_no_timeout
B: calls z_find_first_thread_to_unpend -> returns D
A: thread != NULL, so calls _priq_wait_remove which calls z_priq_dumb_remove
A: calls sys_dlist_remove(&thread->base.qnode_dlist);
A: sys_dlist_remove follows thread D.next and thread D.prev
A: sys_dlist_remove sets thread D.next == thread D.prev == NULL
B: thread != NULL, so calls _priq_wait_remove which calls z_priq_dumb_remove
B: calls sys_dlist_remove(&thread->base.qnode_dlist);
B: sys_dlist_remove follows thread D.next and thread D.prev
B: null pointer dereference

The simplest option here that I can see is to change z_unpend1_no_timeout from a function calling two functions which each lock into a function that locks sched_lock and then call two functions that each don't lock (akin to e.g. z_unpend_thread_no_timeout and unpend_thread_no_timeout, where e.g. z_thread_timeout locks then calls unpend_thread_no_timeout and does other things); I haven't examined if the non-atomicity in e.g. z_impl_k_timer_stop is a problem; if so this interface may need to be rethought.

...of course, it's also entirely possible I'm missing something obvious here.

(P.S. there is also a rather suspect comment in z_timer_expiration_handler:

	/*
	 * Interrupts _DO NOT_ have to be locked in this specific
	 * instance of thread unpending because a) this is the only
	 * place a thread can be taken off this pend queue, and b) the
	 * only place a thread can be put on the pend queue is at
	 * thread level, which of course cannot interrupt the current
	 * context.
	 */

...is this actually the case in SMP mode? On a single core, sure, but this comment's logic does not apply if there are multiple cores so far as I can tell. At least not without additional considerations (e.g. if there's a lock in this chain that I'm missing that would prevent a concurrent call to z_impl_k_timer_start). In addition, z_impl_k_timer_stop also potentially takes items off of the pend queue, contradicting a), if I understand the code correctly.)

Metadata

Metadata

Assignees

Labels

area: KernelbugThe issue is a bug, or the PR is fixing a bugpriority: highHigh impact/importance bug

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions