-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection of fixes for various shutdown scenario race conditions #266
Conversation
- possible race condition in WinInet on request cancellation - possible race condition in TPM on task cancellation
Add stress test that inspects PAL ref-count. Rework the safety checks in LogManagerBase.hpp. Add multiple log managers stress-test with concurrent UploadNow.
@@ -112,7 +119,8 @@ namespace ARIASDK_NS_BEGIN { | |||
/// </summary> | |||
bool cancelUploadTask() | |||
{ | |||
bool result = m_scheduledUpload.Cancel(); | |||
uint64_t cancelWaitTimeMs = (m_scheduledUploadAborted) ? UINT64_MAX : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a check on m_isUploadScheduled
first? Either here or in handleStop
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we do perform atomic check outside. As well as DeferredCallbackHandle itself covers both cases:
- no uploading Task ptr associated with that handle, Task ptr is nullptr -> noop
- uploading task ptr is invalid (not found in the running task list) -> noop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any check for m_isUploadScheduled
on any of the calls to cancelUploadTask
, which seems a little weird even if it's going to get handled properly by the callback handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant to say is that the Cancel(...) routine may safely be called even if the DeferredCallbackHandle m_task is nullptr. It's a no-op in case if no upload is scheduled: m_task is nullptr, then no cancellation is attempted. It's equivalent to having another check outside, except that the method itself performs the state check inside. Cancellation routine resets m_task to nullptr after task cancellation here:
m_task = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me send something re. assigning task dispatcher ptr. This is existing code though, before my PR, but I see one possible issue. In general the calls to cancellation at runtime are all under one mutex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check the latest. There are two checks on latest:
- upload task cancellation is done under mutex
- even if task handle is invalid, Cancel now checks for that in a thread-safe manner
I'm trying to understanding your question about the leak... How it works today:
There are two possible flows:
Subsequently, there are two places of code where this is all managed:
Then - why we don't leak, is there's an infinite wait inside CancelAllRequests that ensures that all requests are destroyed. So far you have not been hitting the case in PROD with request being stuck, but we saw that in Functional Test runs (although very infrequently in a stress-test). With my fix we never hit the issue of a functional test being stuck, thus I believe we properly destroy all the request objects - they get removed from its parent map. Stress test is no longer tripping any failures. I can increase the number of reruns for that stress test that used to exhibit this issue. |
@maxgolov thanks for the detailed response about requests leaking (not sure why github doesn't let there be a comment thread there). My confusion was actually a lot simpler - I'd been looking at the PR for a while and somehow missed the I remembered seeing some other places in the SDK where |
….com/Microsoft/cpp_client_telemetry into maxgolov/shutdown_fixes
Eventhough we don't currently use tid anywhere, the intent was that it has to be static - sequentially incrementing across all tasks running under different log managers.
Reworked this to address the following issues:
Please review. On multiple reruns and with the new rigorous concurrency stress-test, including intentional attempts to break the UploadNow vs FlushAndTeardown, and multiple log manager stress - all reruns were passing 100%, with no hangs. |
{ | ||
if ((m_itemInProgress == item)||(item==nullptr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_itemInProgress [](start = 17, length = 16)
Shouldn't m_itemInProgress be protected by a mutex, since it gets set on the worker thread, but gets checked in Cancel
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the logic, is that it should be something like:
- Acquire lock and check if task is in progress
- If task is in progress, release lock and wait for task completion. Re-acquire lock to check the task progress
- If task isn't in progress, will need to maintain the lock while clearing it from the queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll write a separate note on why it's fine :) I was again thinking that you may be right, as it looks racy, but in fact it's not. I'd like to avoid unnecessarily locking here since it might impact perf within the worker thread queue. Deletion of items is done atomically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right --- there was a corner case where we did not wait if item is already popped for execution, but m_itemInProgress hasn't been assigned yet. I made this part of code thread-safe in the next commit.
lib/pal/WorkerThread.cpp
Outdated
/* Can't recursively wait on completion of our own thread */ | ||
if (m_hThread.get_id() != std::this_thread::get_id()) | ||
{ | ||
LOCKGUARD(m_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this lockguard here? It makes me nervous to hold a lock while sleeping. Won't this also prevent the workerthread from getting the shutdown item off the queue until this completes (probably not a big deal, but a little scary in case that code changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I thought that you're correct. But on a second thought, in fact, I think it's probably a good thing. Because it totally eliminates my other concern about another task assigned with the same raw ptr:
Task is already executing in another thread here:
self->m_itemInProgress = item.get();
(*item)();
self->m_itemInProgress = nullptr;
That section of code does not hold the lock. When we're done, we assigned nullptr to the item in progress. Cool. That unblocks the waiter.
Since our idle-waiter holds the lock - we prevent the worker thread to schedule a new item, because in order to schedule another item - it has to acquire the lock. This will cause up to 50ms hiccup to the scheduler, but since cancellation itself is a very infrequent operation and the worker is in a background thread - it won't affect performance in any way.
Since I hope we never cancel the shutdown item :D :D :D ... we should be fine. Yeah, it'd prevent the shutdown item to get scheduled if we're waiting on a cancellation of a current task preceding the shutdown item.
Let me know if it makes sense. If it doesn't, please let me know if I'm mistaken. I'm honestly willing to adjust this if this doesn't seem right to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked this slightly, but the comment remains the same - I kept the lock for the duration of idle-wait, to prevent another item from being scheduled. That is fully addressing my previous reliability concern - what if a new task gets allocated with the same thread ptr. Now we won't run into that condition at all. This may result in a hiccup of up to 50ms for the worker thread scheduling, but only in case when we're currently executing work item that's being cancelled. In other cases we do not block. My logical reasoning here:
- cancellation is not a frequent operation
- cancellation of a currently executing item is even less frequent operation.
- background worker tasks get generally scheduled at 1s intervals (min granularity interval for the upload scheduler), and we are not providing any realtime nor latency guarantees in there. Even the UploadNow API itself is merely a hint, not a guarantee.
Thus, this hypothetical 50ms 'excessive' wait on task cancellation is not impacting perf in any measurable way.
Shouldn't m_itemInProgress be set within the context of the previous lockguard, so we can guarantee that if the item isn't in the queue then the item was either completed or in progress? This current logic leaves a small gap between popping the item from the queue, doing a little bit of logic outside the mutex, setting the item in progress, and waiting on other threads to have that memory sync (if it happens to be on a different core and in cache) Refers to: lib/pal/WorkerThread.cpp:216 in 38b22f1. [](commit_id = 38b22f1, deletion_comment = False) |
Isn't it bad to set Refers to: lib/tpm/TransmissionPolicyManager.cpp:142 in 38b22f1. [](commit_id = 38b22f1, deletion_comment = False) |
It's fine:
I don't think this is an issue. But to make it appear less concerning - I moved this under m_scheduledUploadMutex for the piece of mind:
I think we might want to make it non-atomic. But I'd rather keep it that way. Basically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good to me!
// Onwership semantics: send(...) method self-destroys *this* upon | ||
// receiving WinInet callback. There must be absolutely no methods | ||
// that attempt to use the object after triggering send on it. | ||
// Send operation on request may be issued no more than once. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"may" seems confusing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say that it could be sent at most once. I will try to commit reworded docs in a follow-up PR, something that just reformats this file with no code change.
/// <summary> | ||
/// Atomic counter that returns sequentially incrementing unique Task ID | ||
/// </summary> | ||
static uint64_t GetNewTid() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is an established name, tid
is normally used to represent thread ID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is internal detail right now - I will refactor this in a follow up PR that implements API surface for numbered and maybe named tasks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great in general. I've left some minor suggestions.
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
Collection of fixes for various shutdown scenario race conditions
Fix for shutdown-related scenarios:
#161 #183 #254 #258 #228
I also had to fix the issue #195 because FlushAndTeardown wasn't working well in the new stress-test that I added. It wasn't the lack of ref-count, it was the LogManagerImpl calling into shutdown twice, thus ref-count was ineffective..
Fixes:
TPM task cancellation fix:
WinInet request clean-up: