z_unpend1_no_timeout non-atomic

**Describe the bug**

The implementation of `z_unpend1_no_timeout` is non-atomic, and can result in threads in an inconsistent state.

**To Reproduce**

Inspect 

**Expected behavior**

`z_unpend1_no_timeout` should leave threads in a consistent state.

**Impact**

Usages of `z_unpend1_no_timeout` in preemptive threads, threads and interrupts, or in SMP mode can result in catastrophic system-level behavior (chasing stale/null pointers, threads not being scheduled, threads being scheduled on multiple cores at once, etc.)

**Logs and console output**

n/a

**Environment (please complete the following information):**

- OS: Ubuntu 18.04.5 LTS
- Toolchain (e.g Zephyr SDK, ...)
- Commit SHA or Version used: based off of /u/MrVan's AArch64 SMP branch (#30676) with some additional fixes that I'm in the process of upstreaming. That being said, the references below are from trunk.

**Additional context**

The current implementation of `z_unpend1_no_timeout` is [as follows](https://github.com/zephyrproject-rtos/zephyr/blob/d2c71796afc17ae10763b16ad59de9e18831821c/kernel/include/ksched.h#L305).

```c
	struct k_thread *thread = z_find_first_thread_to_unpend(wait_q, NULL);
	if (thread != NULL) {
		z_unpend_thread_no_timeout(thread);
	}
```

Unfortunately, although both of these individual calls are protected internally by `sched_spinlock`, there is no guarantee that the combination of the two calls are atomic.

This can result in catastrophic system-level behavior. For one (rather contrived) example:


> (some_timer currently is expired, with thread D waiting on it)
> A: starts [k_timer_stop(&some_timer)](https://github.com/zephyrproject-rtos/zephyr/blob/0ab314f705948db9053904f20c355da8c711ccf4/kernel/timer.c#L155)
> B: starts [k_timer_stop(&some_timer)](https://github.com/zephyrproject-rtos/zephyr/blob/0ab314f705948db9053904f20c355da8c711ccf4/kernel/timer.c#L155)
> A: calls into [z_abort_timeout(&timer->timeout)](https://github.com/zephyrproject-rtos/zephyr/blob/b5b9f2c458d590e15b5f8302dc2fc430467d0875/kernel/timeout.c#L153) -> success, so keep going
> C: starts [k_timer_start(&some_timer, very_small_duration)](https://github.com/zephyrproject-rtos/zephyr/blob/0ab314f705948db9053904f20c355da8c711ccf4/kernel/timer.c#L140)
> C: runs all the way through [k_timer_start](https://github.com/zephyrproject-rtos/zephyr/blob/0ab314f705948db9053904f20c355da8c711ccf4/kernel/timer.c#L140), including adding a new timeout
> B: calls into [z_abort_timeout(&timer->timeout)](https://github.com/zephyrproject-rtos/zephyr/blob/b5b9f2c458d590e15b5f8302dc2fc430467d0875/kernel/timeout.c#L153) -> success, so keep going
> A: starts [z_unpend1_no_timeout](https://github.com/zephyrproject-rtos/zephyr/blob/d2c71796afc17ae10763b16ad59de9e18831821c/kernel/include/ksched.h#L307)
> A: calls [z_find_first_thread_to_unpend ](https://github.com/zephyrproject-rtos/zephyr/blob/b5b9f2c458d590e15b5f8302dc2fc430467d0875/kernel/sched.c#L702)-> returns D
> B: starts [z_unpend1_no_timeout](https://github.com/zephyrproject-rtos/zephyr/blob/d2c71796afc17ae10763b16ad59de9e18831821c/kernel/include/ksched.h#L307)
> B: calls [z_find_first_thread_to_unpend ](https://github.com/zephyrproject-rtos/zephyr/blob/b5b9f2c458d590e15b5f8302dc2fc430467d0875/kernel/sched.c#L702)-> returns D
> A: [thread != NULL](https://github.com/zephyrproject-rtos/zephyr/blob/b5b9f2c458d590e15b5f8302dc2fc430467d0875/kernel/sched.c#L710), so calls _priq_wait_remove which calls z_priq_dumb_remove
> A: calls sys_dlist_remove(&thread->base.qnode_dlist);
> A: sys_dlist_remove follows thread D.next and thread D.prev
> A: sys_dlist_remove sets thread D.next == thread D.prev == NULL
> B: [thread != NULL](https://github.com/zephyrproject-rtos/zephyr/blob/b5b9f2c458d590e15b5f8302dc2fc430467d0875/kernel/sched.c#L710), so calls _priq_wait_remove which calls z_priq_dumb_remove
> B: calls sys_dlist_remove(&thread->base.qnode_dlist);
> B: sys_dlist_remove follows thread D.next and thread D.prev
> B: null pointer dereference

The simplest option here that I can see is to change `z_unpend1_no_timeout` from a function calling two functions which each lock into a function that locks `sched_lock` and then call two functions that each don't lock (akin to e.g. `z_unpend_thread_no_timeout` and `unpend_thread_no_timeout`, where e.g. `z_thread_timeout` locks then calls `unpend_thread_no_timeout` and does other things); I haven't examined if the non-atomicity in e.g. `z_impl_k_timer_stop` is a problem; if so this interface may need to be rethought.

...of course, it's also _entirely_ possible I'm missing something obvious here.

(P.S. there is also a rather suspect comment in `z_timer_expiration_handler`:

```c
	/*
	 * Interrupts _DO NOT_ have to be locked in this specific
	 * instance of thread unpending because a) this is the only
	 * place a thread can be taken off this pend queue, and b) the
	 * only place a thread can be put on the pend queue is at
	 * thread level, which of course cannot interrupt the current
	 * context.
	 */
```
...is this actually the case in SMP mode? On a single core, sure, but this comment's logic does not apply if there are multiple cores so far as I can tell. At least not without additional considerations (e.g. if there's a lock in this chain that I'm missing that would prevent a concurrent call to `z_impl_k_timer_start`). In addition, `z_impl_k_timer_stop` also potentially takes items off of the pend queue, contradicting a), if I understand the code correctly.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

z_unpend1_no_timeout non-atomic #32136

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

z_unpend1_no_timeout non-atomic #32136

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions