Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in workqueue can lead to work items being lost #51872

Closed
palchak-google opened this issue Nov 1, 2022 · 7 comments · Fixed by #54677
Closed

Race condition in workqueue can lead to work items being lost #51872

palchak-google opened this issue Nov 1, 2022 · 7 comments · Fixed by #54677
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug Release Blocker Use this label for justified release blockers
Milestone

Comments

@palchak-google
Copy link
Contributor

Describe the bug

TL;DR

The unscheduled_locked function in the work queue implementation fails to confirm that a previously scheduled item is fully canceled (fails to check the return code of z_abort_timeout). This leads to a race condition that, in the worst case, results in work items getting "lost". That is, some submitted work items never execute and instead just remain in the submitted state.

Detailed Description

When rescheduling a k_work_delayable item, the k_work_reschedule_for_queue function relies on the unschedule_locked function to clear any existing schedule for the target item.

zephyr/kernel/work.c

Lines 983 to 1005 in 836f582

int k_work_reschedule_for_queue(struct k_work_q *queue,
struct k_work_delayable *dwork,
k_timeout_t delay)
{
__ASSERT_NO_MSG(dwork != NULL);
SYS_PORT_TRACING_OBJ_FUNC_ENTER(k_work, reschedule_for_queue, queue, dwork, delay);
int ret = 0;
k_spinlock_key_t key = k_spin_lock(&lock);
/* Remove any active scheduling. */
(void)unschedule_locked(dwork);
/* Schedule the work item with the new parameters. */
ret = schedule_for_queue_locked(&queue, dwork, delay);
k_spin_unlock(&lock, key);
SYS_PORT_TRACING_OBJ_FUNC_EXIT(k_work, reschedule_for_queue, queue, dwork, delay, ret);
return ret;
}

The unschedule_locked function in turn relies on the z_abort_timeout function to actually remove any existing timeout.

zephyr/kernel/work.c

Lines 915 to 927 in 836f582

static inline bool unschedule_locked(struct k_work_delayable *dwork)
{
bool ret = false;
struct k_work *work = &dwork->work;
/* If scheduled, try to cancel. */
if (flag_test_and_clear(&work->flags, K_WORK_DELAYED_BIT)) {
z_abort_timeout(&dwork->timeout);
ret = true;
}
return ret;
}

However, the unschedule_locked function incorrectly ignores the return value of the z_abort_timeout, leading to a race condition.

zephyr/kernel/timeout.c

Lines 150 to 162 in 836f582

int z_abort_timeout(struct _timeout *to)
{
int ret = -EINVAL;
LOCKED(&timeout_lock) {
if (sys_dnode_is_linked(&to->node)) {
remove_timeout(to);
ret = 0;
}
}
return ret;
}

The z_abort_timeout function will return EINVAL if the specified timeout is not contained in the list of pending timeouts. This condition occurs in exactly two situations:

  1. The specified timeout does not exist or is no longer pending
  2. The specified timeout is still pending, but has been removed from the list in preparation for calling the timeout callback

Situation # 2 is possible because the sys_clock_announce function releases its spinlock before calling the timeout callback:

zephyr/kernel/timeout.c

Lines 267 to 271 in 836f582

remove_timeout(t);
k_spin_unlock(&timeout_lock, key);
t->fn(t);
key = k_spin_lock(&timeout_lock);

The race condition occurs when the sys_clock_announce function is interrupted (or preempted) at any point after remove_timeout returns on line 267 but before the callback grabs whatever lock it needs in the call on line 270. During this span of time the expired timeout is both pending (callback has not executed) and not in the list. We'll call this span of time the "window of opportunity".

Going back to the k_work_rescheduled_for_queue function, consider the sequence of events if k_work_rescheduled_for_queue is called during the "window of opportunity" for a previously scheduled item, and the new schedule has non-zero delay:

  1. The callback for the item is just about to run, but hasn't yet
  2. k_work_rescheduled_for_queue believes that it has canceled that previously scheduled callback, but the cancellation wasn't successful.
  3. k_work_rescheduled_for_queue proceeds to schedule the item again by calling schedule_for_queue_locked
  4. Given that a non-zero delay was specified, schedule_for_queue_locked adds a new timeout and sets the K_WORK_DELAYED_BIT flag

Now let control return to the sys_clock_announce function:

  1. The sys_clock_announce function calls the callback associated with the expired timeout. For delayed work items this callback function is work_timeout
  2. The work_timeout function runs, confirms the K_WORK_DELAYED_BIT flag is set, and immediately submits the work time to the queue.
  3. The work item is left in both the work queue (at the tail of the list) and in the timeout queue.

At this point, the work item is in an inconsistent state, and one of two possible faulty outcomes can occur:

Outcome A) The work item is executed for the first time before the second timeout expires. In this scenario, the work item will then execute a second time when the second timeout expires, resulting in duplicate execution of the work.

Outcome B) The second timeout expires before the work queue executes the work item for the first time. This can happen if the work queue thread is low priority or the timeout was very short.

Outcome B is the most problematic. When Outcome B occurs, the work_timeout function runs a second time and submits the work to the queue a second time. However, the work was already pending (aka in the queue), and the work_timeout function does not account for this. The work_timeout function blindly moves the work to the end of the queue, causing all of the work items that had been added to the queue after the work item was added the first time to be "lost".

Here's a visualization of what happens in Outcome B. Let's assume that the work with the race condition is denoted as (X). Assume this is the contents of the work queue (it's stored as a forward list) before work_timeout executes for the second time:

(V) -> (W) -> (X) -> (Y) -> (Z)

After work_timeout has finished executing for the second time the contents of the queue will be:

(V) -> (W) ----------------> (X)
                 (Y) -> (Z) --^

Work item (X) has been submitted to the end of the queue. However, because (X) was previously in the middle of the queue, the work items (Y) and (Z) get "lost". Items (Y) and (Z) will never get executed, as they are no longer reachable.

Expected behavior

  • Returns codes should be checked
  • Work queue implementation should be interrupt-safe (e.g. no such "window of opportunity" for race conditions)
@palchak-google palchak-google added the bug The issue is a bug, or the PR is fixing a bug label Nov 1, 2022
@andyross
Copy link
Contributor

andyross commented Nov 1, 2022

I buy that. Stated (IMHO) more simply: if z_abort_timeout() fails for a known-submitted object, then the timeout will be called in finite time[1]. It's up to the user of the timeout subsystem to make sure they wait for this to happen[2] before doing anything else with the object.

[1] It's worth pointing out that that announce call happens in interrupt context, so the only way to exercise this race is either from another CPU in SMP contexts or from within a nested interrupt that preempted the timer ISR.

[2] Which, if the context really is a nested interrupt, is actually a very difficult app bug to untangle. You can't wait in an ISR in general, and even if you tried (e.g. by spinning in a SMP system) you preempted the code you're trying to wait for!

@andyross
Copy link
Contributor

andyross commented Nov 1, 2022

In this case, though, it's probably OK to just correctly propagate the error out to the application code that tried to cancel the item. The whole idea is inherently racy (i.e. "oops you didn't get there in time and it ran" is always a possibility even if there's no race in the subsystem code).

@nashif
Copy link
Member

nashif commented Dec 14, 2022

@andyross can you please take a look at this issue please?

@stephanosio
Copy link
Member

@andyross Are you able to look into this issue for v3.3? If not, could someone else take a look? @peter-mitsis @dcpleung

@andyross
Copy link
Contributor

andyross commented Feb 9, 2023

Sorry, I swear I saw a PR come by on this? If not I can come up with something this week.

@stephanosio stephanosio added the Release Blocker Use this label for justified release blockers label Feb 9, 2023
@stephanosio
Copy link
Member

I have marked this issue as a release blocker for now since it is an quite serious kernel bug.

@andyross
Copy link
Contributor

andyross commented Feb 9, 2023

Fix up, see link.

Unfortunately I'm not able to generate a test case to actually exercise the race in question. The only spot I can see that is sensitive to this (all other code just ignores the return value of unschedule_locked()) is in k_work_flush_delayable(), which in theory could end up submitting and in-progress callback. That race is real for sure, I just haven't been successfully exercise it with a test.

But the text above implies deeper problems with general operation, and on close inspection I'm not sure I'm not sure I agree those are present? Is there a test case? The other spots where we're doing an unschedule operation are all idempotent, they just want to know it's "not currently queued".

Basically my considered take is that this is fixed and that most of the concerns above are false positives. Please review and tell me if I've missed something.

@stephanosio stephanosio added this to the v3.3.0 milestone Feb 11, 2023
stephanosio pushed a commit that referenced this issue Feb 11, 2023
The call to unschedule_locked() would return true ("successfully
unscheduled") even in the case where the underlying z_abort_timeout()
failed (because the callback was already unpended and
in-progress/complete/about-to-be-run, remember that timeout callbacks
are unsynchronized), leading to state bugs and races against the
callback behavior.

Correctly detect that case and propagate the error to the caller.

Fixes #51872

Signed-off-by: Andy Ross <andyross@google.com>
asemjonovs pushed a commit to asemjonovs/zephyr that referenced this issue Feb 13, 2023
The call to unschedule_locked() would return true ("successfully
unscheduled") even in the case where the underlying z_abort_timeout()
failed (because the callback was already unpended and
in-progress/complete/about-to-be-run, remember that timeout callbacks
are unsynchronized), leading to state bugs and races against the
callback behavior.

Correctly detect that case and propagate the error to the caller.

Fixes zephyrproject-rtos#51872

Signed-off-by: Andy Ross <andyross@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug Release Blocker Use this label for justified release blockers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants