Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/sync: add wait queues #21123

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

derMihai
Copy link
Contributor

@derMihai derMihai commented Jan 8, 2025

Contribution description

Adds Linux-kernel-like wait queue for mutex-less condition signaling.

Motivation

Copy-paste from core/include/wait_queue.h:

Wait queues enable lock-free, IRQ-safe condition signaling. This implementation is inspired from the Linux Kernel.

Wait queues have similar semantics to condition variables, but don't require setting the condition + signaling to be atomic, hence no mutex is needed. In turn, one may safely call QUEUE_WAKE() from an ISR. Note, while cond_signal() and cond_broadcast() are safe to call from an ISR context too, doing so will very probably cause a race condition elsewhere. Consider the following scenario using condition variables:

static uint64_t measurement;
mutex_t cond_lock = MUTEX_INIT;
cond_t cond = COND_INIT;

void measurement_irq(void)
{
    measurement = measure();
    cond_broadcast(&cond);
}

void wait_for_critical_value(void)
{
    mutex_lock(&cond_lock);
    while (atomic_load_u64(&measurement) < THRESHOLD) {
        cond_wait(&cond, cond_lock);
    }
    mutex_unlock(&cond_lock);
}

Note, the mutex is there only for the cond_wait() API call, as we're not allowed to call mutex_lock() inside the ISR. This alone is a hint that something isn't right and indeed, the following sequence of events is possible:

  1. thread sees measurement < THRESHOLD, and is about to call cond_wait()
  2. ISR fires, sets measurement = THRESHOLD, and signals the condition
  3. thread calls cond_wait() and goes to sleep, possibly forever

Using a wait queue, we can do this:

static uint64_t measurement;
wait_queue_t wq = WAIT_QUEUE_INIT;

void measurement_irq(void)
{
    measurement = measure();
    queue_wake(&wq);
}

void wait_for_critical_value(void)
{
    QUEUE_WAIT(&wq, atomic_load_u64(&measurement) >= THRESHOLD);
}

This is free of the race condition above because QUEUE_WAIT() is a macro that checks the condition AFTER queueing the current thread to be waken up.

When to use?

QUEUE_WAIT() is a macro and might come with some additional code size cost due to inlining. If you're not synchronizing with an ISR and care very much about code size then go for condition variables, otherwise there is no reason not to use the wait queue.

Can't I just use a mutex?

You can abuse a mutex by locking it in a loop in the thread context and unlocking from the ISR context. But this will only work for a single waiter and makes improper use of the mutex semantics.

Correctness

Below is a state-diagram of a thread calling QUEUE_WAIT(). I'll argue about its correctness and assume that my code implements it. Note: for simplicty I dropped the early exit optimization from the state diagram as it's trivially clear that it doesn't affect correctness.
Untitled Diagram drawio

events
  • cond true - user's the condition expression evaluated true
  • cond false - user's the condition expression evaluated false
  • wake up - another thread/ISR called queue_wake(), which woke up this thread
  • block - user's condition expression blocks on some other locking primitive (including some other wait queue)
  • unblock - the locking primitive the user's condition expression blocks on got signaled
state sub-states

Each state is a combination of two sub-states: the queuing status and the thread state. A thread is only then enqueued in the wait-queue if it the state is labeled enqueued. Regarding the scheduler states:

  • RUNNABLE - on the scheduler's runqueue (e.g. running or pending)
  • WAITING - blocked, waiting on this wait queue
  • BLOCKED - blocked on some other locking primitive during the user condition expression evaluation, including some other wait queue
transitions

We assume transitions between states to be atomic.

  • enqueued RUNNABLE - the initial state

    • cond true -> EXIT: the condition evaluated true. The thread is then de-queued and QUEUE_WAIT() exits.
    • cond false -> enqueued WAITING: the condition evaluated false. The thread sleeps on the wait queue.
    • wake up -> RUNNABLE: the wait queue got signaled and removed current thread from the wait queue, but the condition expression is still being evaluated.
    • block -> enqueued BLOCKED: the condition expression blocks on some other locking primitive
  • RUNNABLE

    • cond true -> EXIT: the condition evaluated true. QUEUE_WAIT() exits
    • cond false -> enqueued RUNNABLE: the condition evaluated false. Since the thread is not enqueued anymore, it has to queue and do another condition check.
    • block -> BLOCKED: the condition expression blocks on some other locking primitve
  • BLOCKED

    • unblock -> RUNNABLE: the locking primitve is signaled and the thread can continue evaluating the condition expression
  • enqueued BLOCKED

    • unblock -> enqueued RUNNABLE: the locking primitive is signaled and the thread can continue evaluating the condition expression. The thread is still enqueued.
    • wake up -> BLOCKED: the wait queue is signaled. The thread is removed from the wait queue, but it may not wake up yet, as the thread is blocked on something else.
  • enqueued WAITING

    • wake up-> enqueued RUNNABLE: the queue is signaled. We don't know the state of the condition, so we have to queue again and run the condition check again.

Code size

COND_WAIT() is a macro and comes with the expected costs associated with inlining. I haven't run any code-size test comparing it with the condition variable, but I expect it to require more. But not much more, and here's why:

Inlining QUEUE_WAIT(&wq, atomic_load_u64(&cond_val) > THRESHOLD); looks like this:

  do {
    if (atomic_load_u64(&cond_val) > THRESHOLD) {
      break;
    };

    wait_queue_entry_t me;
    _prepare_to_wait(&wq, &me);
    while (!(atomic_load_u64(&cond_val) > THRESHOLD)) {
      _maybe_yield_and_enqueue(&wq, &me);
    }
    _wait_dequeue(&wq, &me);
  } while (0);

A typical condition variable counterpart looks as follows:

mutex_lock(&cond_lock);
while (!(cond_val > THRESHOLD)) {
      cond_wait(&cond, &cond_lock);
}
mutex_unlock(&cond_lock);

Apart from the early exit optimization (which can be turned off with a compile flag), it looks pretty much the same. I don't have any hard numbers but I think the benefits (leaner code, IRQ compatibility, lower dead-lock risk) outweigh the code size penalty s.t. it make sense to use it in all cases where a condition variable is suitable.

Future work

A timed variant of QUEUE_WAIT() will be a nice addition at some point. I've already scketched out a solution but it's not trivial and adds complexity to the already not-so-simple implementation.

Testing procedure

There are two unit tests: tests/core/wait_queue and tests/core/wait_queue_optimized. The former is the extensive one, but it disables the early exit optimization in order to better test some edge cases. I think that's sensible, given that the early exit is not really part of the wait queue logic. But, for the sake of completeness, the second test does enable the optimization. It's just a symlink to the sources in test/core/wait_queue, but with some parts of the test disabled.

I've run the unit tests on the following boards:

  • native
  • nucleo-f401re
  • samd20-xpro

I've copied the board blacklist form the condition variable tests, but tbh. it seems a bit restrictive. I tried to compile for bluepill-stm32f030c8 (which is part of the blacklist) and it worked, but I know the CI builds differently. Should I try removing some boards from the blacklist? And if yes, which?

@github-actions github-actions bot added Area: doc Area: Documentation Area: tests Area: tests and testing framework Area: core Area: RIOT kernel. Handle PRs marked with this with care! labels Jan 8, 2025
@derMihai derMihai force-pushed the mir/wait_queue_head branch 2 times, most recently from a8091c5 to 4b66ba8 Compare January 8, 2025 19:32
@derMihai derMihai force-pushed the mir/wait_queue_head branch from 4b66ba8 to ca2557c Compare January 8, 2025 19:41
@derMihai derMihai marked this pull request as ready for review January 8, 2025 20:12
@benpicco benpicco requested a review from maribu January 9, 2025 14:16
@maribu
Copy link
Member

maribu commented Jan 9, 2025

Just to be sure I understand the motivation: This does cater the same use case as cond, but is faster, correct?

If so, is there a concrete use case where cond actually is too slow?

To be honest, we current already have issues with the mechanisms for thread synchronization and inter-thread communication to work reliably due to premature optimizations (see e.g. #20878). I'm not a fan of adding to that complexity without a very compelling reason. I'd much rather have effort spent on fixing and cleaning up the existing ones.

@derMihai
Copy link
Contributor Author

derMihai commented Jan 9, 2025

Speed is not the scope of this, please read the motivation. RIOT currently doesn't have a way for signaling conditions from ISR, or at least not a way for doing so that covers all the requirements. TLDR:

  • condition variables will almost always lead to some race condition
  • unlocking a mutex from ISR won't work for multiple waiters
  • thread flags are global and shouldn't be used by applications/drivers etc.

With the wait queue you can write context-agnostic code and it's simpler to use.

@maribu
Copy link
Member

maribu commented Jan 9, 2025

What is the issue you have with thread_flags? Why can't unlock more than one mutex from ISR? Why do you need to signal more than one waiter from ISR in the first place?

@derMihai
Copy link
Contributor Author

derMihai commented Jan 9, 2025

Thread flags are fine for building core stuff with them because they are fast and simple. But they are global. If drivers, apllications, libraries and so on use them extensively, they will conflict at some point. With wait queues, just as with condition variables, you can have as many as you want.

You may signal multiple mutexes but you can't have multiple threads waiting on a single mutex, because it will wake just one of them. Mutexes just aren't meant for condition signaling. And as of why multiple threads: think about a simple buffer, with multiple consumers waiting for an ISR to fill it.

@derMihai
Copy link
Contributor Author

derMihai commented Jan 9, 2025

Also, with thread flags, you have to know which thread to wake up in the first place. So you need to manually queue them in some way, which is what the wait queue does for you.

@maribu
Copy link
Member

maribu commented Jan 9, 2025

If drivers, apllications, libraries and so on use them extensively, they will conflict at some point.

There are right now only 2 thread flags in THREAD_FLAG_PREDEFINED_MASK reserved for RIOT. All other thread flags can be used for drivers and apps.

Note that flag BIT3 on thread A may have a completely different meaning as BIT3 for thread B; the flags do not need to be unique across different flags.

You may signal multiple mutexes but you can't have multiple threads waiting on a single mutex, because it will wake just one of them.

Yeah, unlock one mutex per thread to wake up.

And as of why multiple threads: think about a simple buffer, with multiple consumers waiting for an ISR to fill it.

Why would there multiple consumers on a single buffer? RIOT has not SMP support. You would waste CPU cycles on context switching and RAM for stacks by splitting processing of a single buffer into multiple threads.

Even if there really would be the need for wake up multiple threads from ISR (which I'm not convinced of), unlocking multiple mutexes would work fine. After all, the number of threads running with RIOT is quite limited, so there is no need for a mechanism that can efficiently wake up an unbounded amount of threads.

@derMihai
Copy link
Contributor Author

derMihai commented Jan 9, 2025

You can have a library, a driver, and user application using BIT3, each of them meaning something different. You then signal BIT3 meaning the one from the application, while a the thread is waiting for the same BIT3, but in the driver, meaning something completely different. Pray and hope that the driver is hardened against spurious wakeups.

I agree with you with the buffer example, in embedded context you will probably have one consumer thread. But generally, I noticed that threads are spawned way more often than they should be (simply for the reason that, if you're not syncing a lot, it's so much easier than using an e.g. event queue). And then you want these threads to sync on some event.

And in both the thread flag and mutex case, if you have multiple threads you have to register them somehow, manually. If a thread is waiting for a flag, you have to know which. If you have multiple mutexes from multiple threads, you have to know them.

And, just to play devil's advocate: you can actually use one mutex with multiple threads, it's just not intuitive and hacky (and still doesn't cover all cases), which leads to my main point: even if one can improvize with existing mechanisms, it's always some aspect that is overseen, lots of boilerplate, and opportunity for bugs. The main scope of the wait queue is to have one powerful mechanism to cover all these use cases and make code simpler, safer and more portable across contexes (ISR/thread).

maribu
maribu previously approved these changes Jan 9, 2025
Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NACK.

Let's not add thread synchronization mechanisms without a tangible use case. Thread synchronization mechanisms and inter thread communication mechanisms in particular and core in general are crucial to RIOT, so they really need to be well maintained. Adding code and complexity here without a tangible use case is not a good idea. In addition RIOT already has so many synchronization mechanisms that new users can easily be overwhelmed by it and often choose a poor fit for their task. Adding a mechanism that has no tangible use case to the mix will only add to that problem.

This PR also has a number of issues:

  1. It uses a custom linked list scheme, rather than thread_add_to_list() and container_of(), as the other mechanisms do
  2. It breaks OpenOCD's RIOT support (e.g. info threads). Adding a new thread state would require coordination with OpenOCD's source code and adding versioning to RIOT's threading data structures
  3. It probably breaks assumptions in the FreeRTOS threading compatibility layer used in the ESP port to run the binary blob WiFI driver

But to avoid confusion: The main issue is the lack of a tangible use case here.

@maribu maribu dismissed their stale review January 9, 2025 23:10

Wrong button

Copy link
Member

@maribu maribu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the right button, see above for the reasoning.

@derMihai
Copy link
Contributor Author

Let's not make simplistic assumptions about usage, because this is what nudges users to "creative" and often wrong workarounds. We shouldn't punish users for sub-optimal or even bad design decisions, if what they're doing is actually corrrect. If that were the case, RIOT shouldn't have recursive mutexes and even condition variables in the first place. Recursive mutexes are the red flags of concurret design, but people still use them because they make life easier; and condition variables solve solely the problem that you dismiss: multiple waiters. Otherwise one can just use a mutex and call it a day.

I do understand that this is an embedded operating system, and especially regarding core stuff, we can't just bloat it on a whim. But the wait queue is such a simple thing (it's really just condition variables "in reverse") and it covers pretty much every signaling scenario. I think it's worth considering.

In any case, I'm not going to provide any other use cases for you to take apart and provide work-arounds, as the whole point of this is to avoid hacks. You started this disussion with a dismissive stance and without trying to understand what this is about. I'm not willing to commit to any more convincing work. It should also be noted that I didn't invent the wait queues, they're in the Linux kernel for decades and there's plenty of resources about them that back my arguments.

However, you also noted some techincal difficulties that admittedly weren't known to me (OpenOCD, Free RTOS). I'm willing to look into these and try to solve them but I need to know If I get some support here, otherwise I'm just going to close this. While I'd really love to have and use wait queues for my daily work, I have meanwhile enough understanding of RIOT's core mechanics to just keep hacking my way around.

Regarding reliablity and correctness of core stuff: we should resume work on #20940 and #20941.

@kaspar030
Copy link
Contributor

kaspar030 commented Jan 10, 2025

I stumbled over the difficulties of cond_var myself, so I see some need here.
Thanks @derMihai for trying to tackle this!

... but I also think we shouldn't introduce another IPC mechanismn. (The implementation here wouldn't pass our guidelines, at least b/c QUEUE_WAIT() is a macro, and there's a free coded linked list.)

I think the argument that the thread flags namespace is global is overrated. Unless the exact amount of wakeups is required, even multi-use of the same thread flag doesn't harm much, and that can be avoided.

Anyhow, I played with waking up multiple waiters using thread flags, here's some code: https://github.com/kaspar030/RIOT/tree/wq_thread_flags

edit fixed link

@derMihai
Copy link
Contributor Author

derMihai commented Jan 10, 2025

Thanks @kaspar030 for the productive feedback. The reason I coded the linked list myself is to keep the exclusive wake functionality of the condition variable, which requires entries to be inserted at specific location based on thread priority. But here I would argue similarly as @maribu, which is: I don't actually see the need for that exclusive waking, I just wanted to keep features. Exclusive waiting is no guarantee of exclusivity, but rather a performance optimization. Without that requirement in the way, I can use the standard linked list manipulation code.

I really like your variant based on thread flags, and I even considered doing something similar at some point. I went for a new sync mechanism at the end because it's better optimized, allowed me to keep the exclusive wake feature, and mainly because I didn't know of the additional technical difficulties that @maribu pointed out (OpenOCD, FreeRTOS), which definitely are a deal breaker. If that's fine for you, I'd like to continue on top of your sketch. That would also allow the wait queue to be included optionally, as a module.

One more question: what is wrong with wrapping the waiter loop boiler plate within a macro?

@maribu
Copy link
Member

maribu commented Jan 10, 2025

One more question: what is wrong with wrapping the waiter loop boiler plate within a macro?

We generally try to avoid the use of the preprocessor if the same can be done with C code instead.

If that's fine for you, I'd like to continue on top of your sketch.

Implementing the API proposed here on top of existing mechanisms is very welcome.

@kaspar030
Copy link
Contributor

If that's fine for you, I'd like to continue on top of your sketch.

Sure! While writing it, I thought this is so close to thread_flags, its name should reflect that. Maybe thread_flag_group_t/attach()/detach()/...? Expect some bike shedding there.
And, maybe instead hard-coding 0x1 as thread flags, just add the mask as parameter to the wake() function?

One more question: what is wrong with wrapping the waiter loop boiler plate within a macro?

In general, style, coding conventions (somewhat), and just look at how clean the RIOT code base is without all those macros. Also, macros are often not understood by tooling.

In this specific case, nothing specific, maybe that the condition is evaluated twice. And that it just might not be needed.

@derMihai
Copy link
Contributor Author

We generally try to avoid the use of the preprocessor if the same can be done with C code instead.

I see. The issue here is that the user will always have to write the same boilerplate, something like:

enqueue()
while (!(condition_expression)):
   wait()

dequeue()

I wanted all this to be something like queue_wait(condition_expression). This cannot be wrapped by a function because condition expression is well... an expression and we can't pass that to a function.

What I can do instead is to provide this functions als public API, and a macro as a shorthand.

Sure! While writing it, I thought this is so close to thread_flags, its name should reflect that. Maybe thread_flag_group_t/attach()/detach()/...? Expect some bike shedding there.

@kaspar030 I have no preference on where to put the code and how to name the API, I'll follow your suggestion.

[..] maybe that the condition is evaluated twice. And that it just might not be needed.

Probably not needed, which is also why I made that early check a macro that can be turned off. I'll leave it out.

@kaspar030
Copy link
Contributor

I see. The issue here is that the user will always have to write the same boilerplate, something like:

Yeah, that's life for C programmers.

Do you have an actual use case? Is that more like one-off functions as in wait_for_critical_value(){} in the PR description? Or do you have multiple classic waiting-in-a-loop threads?

I'd much prefer if this would turn into a helper around thread flags (basically just adding the attach()/detach() functionality), as it would then compose with msg, mbox, events and timers (timeouts).

@derMihai
Copy link
Contributor Author

derMihai commented Jan 14, 2025

The use-case is generic condition signaling, just as with the condition variable (i.e. the condition is actually defined by the user).

Is that more like one-off functions as in wait_for_critical_value(){} in the PR description? Or do you have multiple classic waiting-in-a-loop threads?

wait_for_critical_value(), although minimal, is the classic wait-in-a-loop example, or do you mean here something different?

In any case, I'm currently stuck because of #20941, this makes waking up impossible:

  • if we're waking threads one by one with interrupts enabled, we might wake up the lower prio first because the threads in the list are not sorted by priority. And I can't sort them because that would mean custom LL code...
  • if we're waking up all of them at once (with interrupts disabled), thread_flags_set() will call thread_yield_higher() and crash

Since we decided there's no value sorting the threads, I will stall this until some progress on #20941 is made.

EDIT:
Actually, I see no way around sorting that list. Even if, say, we fix treaad_yield_higher() by making in interrupt-enabling, we still have the issue of waking up the wrong threads. The only other fix would be to defer the context switch until interrupts are re-enabled, but I don't see how and what this implies.

We need something like list_add_ordered(list, entry, odered_fn).

@kaspar030
Copy link
Contributor

  • if we're waking up all of them at once (with interrupts disabled), thread_flags_set() will call thread_yield_higher() and crash

hasn't this been sorted out? as in, on arm thread_yield_higher() just pends PendSV?

@derMihai
Copy link
Contributor Author

derMihai commented Jan 24, 2025

  • if we're waking up all of them at once (with interrupts disabled), thread_flags_set() will call thread_yield_higher() and crash

hasn't this been sorted out? as in, on arm thread_yield_higher() just pends PendSV?

Not that I know, there aren't any changes in thread_yield_higher() (at least for ARM) in years and when I opened #20941, calling that with interrupts disabled caused an exception once the interrupts were re-enabled. Is this fixed somewhere else?

EDIT: I might be confused about why it was crashing. I it is not thread_yield_higher() that was causing this, but rather core locking logic expecting it to suspend (i.e. mutex_lock()), which is not happening with IRQ disabled.

In that sense, yes, I can just iterate the list and call thread_yield_higher() with interrupts disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Area: doc Area: Documentation Area: tests Area: tests and testing framework
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants