-
Notifications
You must be signed in to change notification settings - Fork 55
Fix most remaining task cancellation race conditions #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lib/pal/WorkerThread.cpp
Outdated
// Default value: 500ms - sufficient for upload scheduler/batcher task to finish. | ||
// Alternate value: UINT64_MAX - for infinite wait until the task is completed. | ||
#ifndef TASK_CANCEL_TIME_MS | ||
#define TASK_CANCEL_TIME_MS 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not wild about value macros like this, nor are the CoreGuidelines. If this macro is intended as a reconfigurable point, then at the very least, let's put the value of this macro into a constexpr std::chrono::milliseconds
variable.
(*item)(); | ||
self->m_itemInProgress = nullptr; | ||
{ | ||
std::lock_guard<std::timed_mutex> lock(self->m_execution_mutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify why we need this m_execution_mutex scoped lock, since it appears LOCKGUARD(self->m_lock); above is already taking care about the m_itemInProgress assignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, it exists solely as a way to get signaled about the completion of an item in progress (in a cleaner manner than manually sleeping the thread), and extraordinarily rarely it may prevent the item from starting to execute if a cancel request comes in between setting the item in progress and that item actually executing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will impact performance... Since I can't immediately think of a scenario where we need to rapidly process many incoming work items just yet, I am OK with that. But if you can log a future work item to reduce the number of locks in threadFunc , that'd be wonderful. I believe we can consolidate the two locks into one again to achieve better perf, less locks.
{ | ||
PAL::sleep(TASK_CANCEL_WAIT_MS); | ||
waitTime -= TASK_CANCEL_WAIT_MS; | ||
m_itemInProgress = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to understand why m_itemInProgress (which is an attribute of task scheduler) is being assigned by random arbitrary thread trying to cancel a task..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I guess your intention is to cover the moment of time between assignment of m_itemInProgress outside the lock, and before the lock is taken:
cpp_client_telemetry/lib/pal/WorkerThread.cpp
Line 218 in e921995
self->m_itemInProgress = item.get(); |
and
cpp_client_telemetry/lib/pal/WorkerThread.cpp
Line 235 in e921995
std::lock_guard<std::timed_mutex> lock(self->m_execution_mutex); |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, since there's a very small chance this happens before the work starts to execute, I cleared it so that it wouldn't start executing later
if (self->m_itemInProgress != nullptr) { | ||
LOG_TRACE("%10llu Execute item=%p type=%s\n", wakeupCount, item.get(), item.get()->TypeName.c_str() ); | ||
(*item)(); | ||
self->m_itemInProgress = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's the only legit place where m_itemInProgress can be assigned to nullptr on item completion, here inside the worker thread processing function. The other place above in Cancel looks suspicious. Only the worker thread func really knows what item is in progress.
@@ -134,12 +130,19 @@ namespace PAL_NS_BEGIN { | |||
/* Can't recursively wait on completion of our own thread */ | |||
if (m_hThread.get_id() != std::this_thread::get_id()) | |||
{ | |||
while ((waitTime > TASK_CANCEL_WAIT_MS) && (m_itemInProgress == item)) | |||
if (waitTime > 0 && m_execution_mutex.try_lock_for(std::chrono::milliseconds(waitTime))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: there is a minor issue with try_lock_for - the standard does not clarify if it's a yield / sleep (near-0% CPU) or busywait (100% CPU busy-wait on 1 core). Granted the chance of 'collision' (cancelling that same task that is executing right now that same moment) is low, this may work fine.. I'm a bit afraid of a scenario, where on dual-core system we run some longer-running / slow task. Then in another side thread we busy-wait waiting for that task to cancel / complete..
For example:
- Flush to disk in worker
- trying to reschedule / cancel existing Flush, waiting on it indefinitely rather than with a timeout.. Because Flush is a sort of task we may need to wait for reliably with infinite timeout to avoid dataloss.
This may potentially cause a stutter, plus elevated power consumption due to one core busy with CPU+IO (thread that writes to sqlite), while another core busy with CPU busywait (thread that waits for the 1st task cancellation). Chance of this happening though appears to be rather low..
Previous logic with sleep (yield) was guaranteed to wait in 50ms increments, enough to quickly wait, peek and sleep again. Granted - code could've ran slower from execution perspective. The new logic is less predictable (due to try_lock_for not 100% spec'd with respect to yield vs. busywait), and may result in elevated power consumption.
Reference that suggests that you can't really tell whether try_lock_for yields or busy-waits:
https://stackoverflow.com/questions/31750036/does-stdunique-locktry-lock-until-busy-wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That resource claims that it is partially OS-dependent, which is what I was going for here. My concern with the previous implementation is that it gave no hint to the OS that we were waiting on the execution of a specific item. Later in that same conversation, it mentions this is probably the best platform-agnostic way. I'd hope this doesn't cause issues on blocking progress of the item being waited on, because one of the properties of a properly implemented mutex is that it makes progress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with suggestions:
- please log a work item (Enhancement) to reduce the number of locks in worker thread func.
- m_itemInProgress = nullptr assignment in Cancel looks suspicious.
Fix most remaining task cancellation race conditions
Fix most remaining task cancellation race conditions
Fix most remaining task cancellation race conditions
Fix most remaining task cancellation race conditions
Improves worker thread task cancellation by:
It is "mostly fixed" because there's still the possibility of multiple uploads being scheduled at one time, if the first one is currently executing when a new one tries to cancel/override it. Conceivably, we could be unlucky enough to shut down while that first event is still executing -- resulting in the second upload task getting quickly canceled, and the first continuing to run (and cause access violations) while the SDK is shutting down.
It's possible to fully fix by waiting for tasks to finish during all pause calls, but this might negatively impact SDK performance, and doesn't seem to be a likely scenario (even on many millions of devices)
I feel like there's still significant improvements that could be made here, especially as it might relate to #271, but those would require a larger refactor.
I think this should fix the rest of the access-violation-on-shutdown issues we're observing in Edge, which were originally partially fixed by #266, and is related to issue #258