-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in workqueue can lead to work items being lost #51872
Comments
I buy that. Stated (IMHO) more simply: if z_abort_timeout() fails for a known-submitted object, then the timeout will be called in finite time[1]. It's up to the user of the timeout subsystem to make sure they wait for this to happen[2] before doing anything else with the object. [1] It's worth pointing out that that announce call happens in interrupt context, so the only way to exercise this race is either from another CPU in SMP contexts or from within a nested interrupt that preempted the timer ISR. [2] Which, if the context really is a nested interrupt, is actually a very difficult app bug to untangle. You can't wait in an ISR in general, and even if you tried (e.g. by spinning in a SMP system) you preempted the code you're trying to wait for! |
In this case, though, it's probably OK to just correctly propagate the error out to the application code that tried to cancel the item. The whole idea is inherently racy (i.e. "oops you didn't get there in time and it ran" is always a possibility even if there's no race in the subsystem code). |
@andyross can you please take a look at this issue please? |
@andyross Are you able to look into this issue for v3.3? If not, could someone else take a look? @peter-mitsis @dcpleung |
Sorry, I swear I saw a PR come by on this? If not I can come up with something this week. |
I have marked this issue as a release blocker for now since it is an quite serious kernel bug. |
Fix up, see link. Unfortunately I'm not able to generate a test case to actually exercise the race in question. The only spot I can see that is sensitive to this (all other code just ignores the return value of unschedule_locked()) is in k_work_flush_delayable(), which in theory could end up submitting and in-progress callback. That race is real for sure, I just haven't been successfully exercise it with a test. But the text above implies deeper problems with general operation, and on close inspection I'm not sure I'm not sure I agree those are present? Is there a test case? The other spots where we're doing an unschedule operation are all idempotent, they just want to know it's "not currently queued". Basically my considered take is that this is fixed and that most of the concerns above are false positives. Please review and tell me if I've missed something. |
The call to unschedule_locked() would return true ("successfully unscheduled") even in the case where the underlying z_abort_timeout() failed (because the callback was already unpended and in-progress/complete/about-to-be-run, remember that timeout callbacks are unsynchronized), leading to state bugs and races against the callback behavior. Correctly detect that case and propagate the error to the caller. Fixes #51872 Signed-off-by: Andy Ross <andyross@google.com>
The call to unschedule_locked() would return true ("successfully unscheduled") even in the case where the underlying z_abort_timeout() failed (because the callback was already unpended and in-progress/complete/about-to-be-run, remember that timeout callbacks are unsynchronized), leading to state bugs and races against the callback behavior. Correctly detect that case and propagate the error to the caller. Fixes zephyrproject-rtos#51872 Signed-off-by: Andy Ross <andyross@google.com>
Describe the bug
TL;DR
The
unscheduled_locked
function in the work queue implementation fails to confirm that a previously scheduled item is fully canceled (fails to check the return code ofz_abort_timeout
). This leads to a race condition that, in the worst case, results in work items getting "lost". That is, some submitted work items never execute and instead just remain in the submitted state.Detailed Description
When rescheduling a
k_work_delayable
item, thek_work_reschedule_for_queue
function relies on theunschedule_locked
function to clear any existing schedule for the target item.zephyr/kernel/work.c
Lines 983 to 1005 in 836f582
The
unschedule_locked
function in turn relies on thez_abort_timeout
function to actually remove any existing timeout.zephyr/kernel/work.c
Lines 915 to 927 in 836f582
However, the
unschedule_locked
function incorrectly ignores the return value of thez_abort_timeout
, leading to a race condition.zephyr/kernel/timeout.c
Lines 150 to 162 in 836f582
The
z_abort_timeout
function will returnEINVAL
if the specified timeout is not contained in the list of pending timeouts. This condition occurs in exactly two situations:Situation # 2 is possible because the
sys_clock_announce
function releases its spinlock before calling the timeout callback:zephyr/kernel/timeout.c
Lines 267 to 271 in 836f582
The race condition occurs when the
sys_clock_announce
function is interrupted (or preempted) at any point afterremove_timeout
returns on line 267 but before the callback grabs whatever lock it needs in the call on line 270. During this span of time the expired timeout is both pending (callback has not executed) and not in the list. We'll call this span of time the "window of opportunity".Going back to the
k_work_rescheduled_for_queue
function, consider the sequence of events ifk_work_rescheduled_for_queue
is called during the "window of opportunity" for a previously scheduled item, and the new schedule has non-zero delay:k_work_rescheduled_for_queue
believes that it has canceled that previously scheduled callback, but the cancellation wasn't successful.k_work_rescheduled_for_queue
proceeds to schedule the item again by callingschedule_for_queue_locked
schedule_for_queue_locked
adds a new timeout and sets theK_WORK_DELAYED_BIT
flagNow let control return to the
sys_clock_announce
function:sys_clock_announce
function calls the callback associated with the expired timeout. For delayed work items this callback function iswork_timeout
work_timeout
function runs, confirms theK_WORK_DELAYED_BIT
flag is set, and immediately submits the work time to the queue.At this point, the work item is in an inconsistent state, and one of two possible faulty outcomes can occur:
Outcome A) The work item is executed for the first time before the second timeout expires. In this scenario, the work item will then execute a second time when the second timeout expires, resulting in duplicate execution of the work.
Outcome B) The second timeout expires before the work queue executes the work item for the first time. This can happen if the work queue thread is low priority or the timeout was very short.
Outcome B is the most problematic. When Outcome B occurs, the
work_timeout
function runs a second time and submits the work to the queue a second time. However, the work was already pending (aka in the queue), and thework_timeout
function does not account for this. Thework_timeout
function blindly moves the work to the end of the queue, causing all of the work items that had been added to the queue after the work item was added the first time to be "lost".Here's a visualization of what happens in Outcome B. Let's assume that the work with the race condition is denoted as
(X)
. Assume this is the contents of the work queue (it's stored as a forward list) beforework_timeout
executes for the second time:After
work_timeout
has finished executing for the second time the contents of the queue will be:Work item
(X)
has been submitted to the end of the queue. However, because(X)
was previously in the middle of the queue, the work items(Y)
and(Z)
get "lost". Items(Y)
and(Z)
will never get executed, as they are no longer reachable.Expected behavior
The text was updated successfully, but these errors were encountered: