Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection of fixes for various shutdown scenario race conditions #266

Merged
merged 19 commits into from
Mar 10, 2020

Conversation

maxgolov
Copy link
Contributor

@maxgolov maxgolov commented Feb 3, 2020

Fix for shutdown-related scenarios:
#161 #183 #254 #258 #228

I also had to fix the issue #195 because FlushAndTeardown wasn't working well in the new stress-test that I added. It wasn't the lack of ref-count, it was the LogManagerImpl calling into shutdown twice, thus ref-count was ineffective..

Fixes:

  • possible race condition in WinInet on request cancellation (hang)
  • possible race condition in TPM on task cancellation (AV in case if two LogManager instances are running)
  • FlushAndTeardown wasn't doing good job: decrementing ref-count twice, once in FlushAndTeardown called explicitly and another one in FlushAndTeardown called from the LogManagerImpl destructor. That lead to issues in the new stress test, so I had to fix. We did have a corresponding issue in our backlog (PAL does not work well with multiple log managers #195).

TPM task cancellation fix:

  • if current task being cancelled is pending, make sure we wait on its completion. Before that we ran into scenario where the task is still running in worker thread while its LogManager has been already destroyed, leading to AV in case if system is running with multiple LogManagers.

WinInet request clean-up:

  • if we immediately cancel request that is being sent in another thread before it is fully created and sent, we were not getting a cancellation callback, thus requests were "getting stuck". That manifested as hang on SDK FlushAndTeardown.

- possible race condition in WinInet on request cancellation
- possible race condition in TPM on task cancellation
@maxgolov maxgolov requested review from bliptec, 0ooops and trevorlacey-msft and removed request for 0ooops February 3, 2020 22:38
lib/pal/WorkerThread.cpp Outdated Show resolved Hide resolved
Add stress test that inspects PAL ref-count.
Rework the safety checks in LogManagerBase.hpp.
Add multiple log managers stress-test with concurrent UploadNow.
lib/pal/WorkerThread.cpp Outdated Show resolved Hide resolved
lib/pal/WorkerThread.cpp Outdated Show resolved Hide resolved
@bliptec
Copy link
Contributor

bliptec commented Feb 7, 2020

    // delete WinInetRequestWrapper

Does this mean that each HTTP request leaks?


Refers to: lib/http/HttpClient_WinInet.cpp:472 in d8e4a89. [](commit_id = d8e4a89, deletion_comment = False)

@@ -112,7 +119,8 @@ namespace ARIASDK_NS_BEGIN {
/// </summary>
bool cancelUploadTask()
{
bool result = m_scheduledUpload.Cancel();
uint64_t cancelWaitTimeMs = (m_scheduledUploadAborted) ? UINT64_MAX : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a check on m_isUploadScheduled first? Either here or in handleStop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we do perform atomic check outside. As well as DeferredCallbackHandle itself covers both cases:

  • no uploading Task ptr associated with that handle, Task ptr is nullptr -> noop
  • uploading task ptr is invalid (not found in the running task list) -> noop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any check for m_isUploadScheduled on any of the calls to cancelUploadTask, which seems a little weird even if it's going to get handled properly by the callback handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant to say is that the Cancel(...) routine may safely be called even if the DeferredCallbackHandle m_task is nullptr. It's a no-op in case if no upload is scheduled: m_task is nullptr, then no cancellation is attempted. It's equivalent to having another check outside, except that the method itself performs the state check inside. Cancellation routine resets m_task to nullptr after task cancellation here:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me send something re. assigning task dispatcher ptr. This is existing code though, before my PR, but I see one possible issue. In general the calls to cancellation at runtime are all under one mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the latest. There are two checks on latest:

  • upload task cancellation is done under mutex
  • even if task handle is invalid, Cancel now checks for that in a thread-safe manner

@maxgolov
Copy link
Contributor Author

maxgolov commented Feb 8, 2020

delete WinInetRequestWrapper

I'm trying to understanding your question about the leak...

How it works today:

  • WinInet client maintains a thread-safe map of request ID (string) -> request object
  • WinInet client erase(std::string const& id) is invoked only from onRequestComplete callback, which is typically happening upon successful invocation of that callback by WinInet worker thread

There are two possible flows:

  • we could not kick-off that upload task due to some reason early on
  • or we kicked-off a task in WinInet worker thread pool that is doing the actual HTTP POST

Subsequently, there are two places of code where this is all managed:

  • first section protected by m_connectingMutex and that one would explicitly call into onRequestComplete to destroy the request if it's not been sent
  • second section already not protected by m_connectingMutex (although maybe to simplify the code we should) that is registering the callback and expecting that there's gonna be pending response from WinInet on that callback. The contract with WinInet is if it returns IO_PENDING, then it must invoke the callback at least once. Piece of logic in the callback is filtering further subsequent IO_PENDING from end-state, that's either OK or ERROR, then there's only one path to remove the callback and destroy the request. I also added an assert for debug builds to assert in case if that callback gets invoked twice.

Then - why we don't leak, is there's an infinite wait inside CancelAllRequests that ensures that all requests are destroyed. So far you have not been hitting the case in PROD with request being stuck, but we saw that in Functional Test runs (although very infrequently in a stress-test). With my fix we never hit the issue of a functional test being stuck, thus I believe we properly destroy all the request objects - they get removed from its parent map. Stress test is no longer tripping any failures. I can increase the number of reruns for that stress test that used to exhibit this issue.

@bliptec
Copy link
Contributor

bliptec commented Feb 10, 2020

@maxgolov thanks for the detailed response about requests leaking (not sure why github doesn't let there be a comment thread there). My confusion was actually a lot simpler - I'd been looking at the PR for a while and somehow missed the delete req below that line 🙃

I remembered seeing some other places in the SDK where deletes were commented out, so my brain jumped to it being a commented out line of code instead of just a comment for the line below

Eventhough we don't currently use tid anywhere, the intent was that it has to be static - sequentially incrementing across all tasks running under different log managers.
@maxgolov
Copy link
Contributor Author

maxgolov commented Mar 6, 2020

Reworked this to address the following issues:

  • turned off clang formatting to minimize the diff
  • implemented shorter poll interval (50ms) for task cancellation
  • set time-boxed cancellation for uploader batching task at 500ms, with an option to reconfigure at compile-time using a macro
  • reused HTTP client-level mutex for a lockguard that surrounds request creation, before request is sent down to WinInet for async processing. This is a short-lived scoped lock that prevents modification of a request map, and also this mutex exists even after the request object is destroyed, thus there's no need to manually lock / unlock. Scoped look looks cleaner.
  • added comments that explain in higher level of detail all the intricate logic that interops with WInInet callback function.
  • added PAL friend class tests
  • added rough-in of atomic sequence number tracker for Task Id, but we can't use it yet. There's a major rework needed at Worker Thread -level to provide thread-safe handling of work item queue using Task Id. This is out-of-scope for this PR, logged a separate enhancement issue for this.

Please review. On multiple reruns and with the new rigorous concurrency stress-test, including intentional attempts to break the UploadNow vs FlushAndTeardown, and multiple log manager stress - all reruns were passing 100%, with no hangs.

{
if ((m_itemInProgress == item)||(item==nullptr))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_itemInProgress [](start = 17, length = 16)

Shouldn't m_itemInProgress be protected by a mutex, since it gets set on the worker thread, but gets checked in Cancel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the logic, is that it should be something like:

  • Acquire lock and check if task is in progress
  • If task is in progress, release lock and wait for task completion. Re-acquire lock to check the task progress
  • If task isn't in progress, will need to maintain the lock while clearing it from the queue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll write a separate note on why it's fine :) I was again thinking that you may be right, as it looks racy, but in fact it's not. I'd like to avoid unnecessarily locking here since it might impact perf within the worker thread queue. Deletion of items is done atomically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right --- there was a corner case where we did not wait if item is already popped for execution, but m_itemInProgress hasn't been assigned yet. I made this part of code thread-safe in the next commit.

/* Can't recursively wait on completion of our own thread */
if (m_hThread.get_id() != std::this_thread::get_id())
{
LOCKGUARD(m_lock);
Copy link
Contributor

@bliptec bliptec Mar 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this lockguard here? It makes me nervous to hold a lock while sleeping. Won't this also prevent the workerthread from getting the shutdown item off the queue until this completes (probably not a big deal, but a little scary in case that code changes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first I thought that you're correct. But on a second thought, in fact, I think it's probably a good thing. Because it totally eliminates my other concern about another task assigned with the same raw ptr:

Task is already executing in another thread here:

    self->m_itemInProgress = item.get();
    (*item)();
    self->m_itemInProgress = nullptr;

That section of code does not hold the lock. When we're done, we assigned nullptr to the item in progress. Cool. That unblocks the waiter.

Since our idle-waiter holds the lock - we prevent the worker thread to schedule a new item, because in order to schedule another item - it has to acquire the lock. This will cause up to 50ms hiccup to the scheduler, but since cancellation itself is a very infrequent operation and the worker is in a background thread - it won't affect performance in any way.

Since I hope we never cancel the shutdown item :D :D :D ... we should be fine. Yeah, it'd prevent the shutdown item to get scheduled if we're waiting on a cancellation of a current task preceding the shutdown item.

Let me know if it makes sense. If it doesn't, please let me know if I'm mistaken. I'm honestly willing to adjust this if this doesn't seem right to you.

Copy link
Contributor Author

@maxgolov maxgolov Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked this slightly, but the comment remains the same - I kept the lock for the duration of idle-wait, to prevent another item from being scheduled. That is fully addressing my previous reliability concern - what if a new task gets allocated with the same thread ptr. Now we won't run into that condition at all. This may result in a hiccup of up to 50ms for the worker thread scheduling, but only in case when we're currently executing work item that's being cancelled. In other cases we do not block. My logical reasoning here:

  • cancellation is not a frequent operation
  • cancellation of a currently executing item is even less frequent operation.
  • background worker tasks get generally scheduled at 1s intervals (min granularity interval for the upload scheduler), and we are not providing any realtime nor latency guarantees in there. Even the UploadNow API itself is merely a hint, not a guarantee.

Thus, this hypothetical 50ms 'excessive' wait on task cancellation is not impacting perf in any measurable way.

@bliptec
Copy link
Contributor

bliptec commented Mar 7, 2020

            self->m_itemInProgress = item.get();

Shouldn't m_itemInProgress be set within the context of the previous lockguard, so we can guarantee that if the item isn't in the queue then the item was either completed or in progress? This current logic leaves a small gap between popping the item from the queue, doing a little bit of logic outside the mutex, setting the item in progress, and waiting on other threads to have that memory sync (if it happens to be on a different core and in cache)


Refers to: lib/pal/WorkerThread.cpp:216 in 38b22f1. [](commit_id = 38b22f1, deletion_comment = False)

@bliptec
Copy link
Contributor

bliptec commented Mar 7, 2020

    m_isUploadScheduled = false;    // Allow to schedule another uploadAsync

Isn't it bad to set m_isUploadScheduled to false before potentially calling cancelUploadTask, which would indiscriminately cancel any current upload task potentially set by the non-worker-thread?


Refers to: lib/tpm/TransmissionPolicyManager.cpp:142 in 38b22f1. [](commit_id = 38b22f1, deletion_comment = False)

@maxgolov
Copy link
Contributor Author

maxgolov commented Mar 9, 2020

Isn't it bad to set m_isUploadScheduled to false before potentially calling cancelUploadTask

It's fine:

  • we set it to false because we are the task that is executed right now in worker thread pool. We immediately let this task being scheduled again.

  • cancelUploadTask call below: if we hit the case where we need to cancel (shutdown or abort or pause), call Cancel(..) to invalidate the handle. Cancel has a check for recursive cancellation: we are the thread being executed by worker attempting to wait on our completion, we return right away.

I don't think this is an issue. But to make it appear less concerning - I moved this under m_scheduledUploadMutex for the piece of mind:

            LOCKGUARD(m_scheduledUploadMutex);
            m_isUploadScheduled = false;  // Allow to schedule another uploadAsync
            if ((m_isPaused) || (m_scheduledUploadAborted))
            {
                LOG_TRACE("Paused or upload aborted: cancel pending upload task.");
                cancelUploadTask();  // If there is a pending upload task, kill it
                return;
            }

I think we might want to make it non-atomic. But I'd rather keep it that way. Basically
!m_isUploadScheduled.exchange(true)
is now probably redundant because in all places it's now altered under m_scheduledUploadMutex. That was the place of code that guaranteed that we have at most 1 upload task. I don't want to change that though, since some unit tests and mocks might be making some less restrictive assumptions.

Copy link
Contributor

@bliptec bliptec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me!

// Onwership semantics: send(...) method self-destroys *this* upon
// receiving WinInet callback. There must be absolutely no methods
// that attempt to use the object after triggering send on it.
// Send operation on request may be issued no more than once.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"may" seems confusing here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to say that it could be sent at most once. I will try to commit reworded docs in a follow-up PR, something that just reformats this file with no code change.

/// <summary>
/// Atomic counter that returns sequentially incrementing unique Task ID
/// </summary>
static uint64_t GetNewTid()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is an established name, tid is normally used to represent thread ID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is internal detail right now - I will refactor this in a follow up PR that implements API surface for numbered and maybe named tasks.

Copy link
Member

@reyang reyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great in general. I've left some minor suggestions.

maxgolov and others added 2 commits March 9, 2020 19:51
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
Co-Authored-By: Reiley Yang <reyang@microsoft.com>
@maxgolov maxgolov merged commit 550cc6e into master Mar 10, 2020
@maxgolov maxgolov deleted the maxgolov/shutdown_fixes branch October 6, 2020 16:48
maxgolov added a commit that referenced this pull request Oct 19, 2020
Collection of fixes for various shutdown scenario race conditions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants