Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics: Use dedicated thread instead of the thread pool for the defa… #1028

Merged
merged 5 commits into from
Dec 28, 2018

Conversation

macrogreg
Copy link
Contributor

This is in context of #1026.

Work in progress for now.

…ult aggregation cycle (code only, does not build not NetStandard).
@@ -139,11 +145,13 @@ private void Run()
DateTimeOffset now = DateTimeOffset.Now;
TimeSpan waitPeriod = GetNextCycleTargetTime(now) - now;

Task.Delay(waitPeriod).ConfigureAwait(continueOnCapturedContext: false).GetAwaiter().GetResult();
Thread.Sleep(waitPeriod);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want a cancelable wait. One way is to create a CancellationToken, get its WaitHandle and call WaitOne with the TimeSpan. The CancellationToken takes the place of the shouldBeRunning bool.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we want a cancelable wait at this point?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fast and graceful shutdown. Maybe I misunderstood the point of StopAsync, but I assume it's to be called at shutdown and the expectation is that shutdown is fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Right now we do not do another data dump on a cancelled cycle, but you are right: it would be a good improvement. On the other hand, dealing with WaitHandles here would mean dealing with Disposables. If we go there, I would prefer to do it as a separate change from addressing the issue at hand. Before the proposed change, this wait was not cancelable either.
Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -139,11 +145,13 @@ private void Run()
DateTimeOffset now = DateTimeOffset.Now;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not part of this change, I know, but please use UtcNow instead of Now (also in GetNextCycleTargetTime)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. If I recall correctly, there may have been purpose to it - take a local TS so that the serializer can choose to UTC it or not. But thinking about it today - I agree that using UTC would be better - and if the serializer wanted to really do something local, it would do that consistently for all telemetry. Please feel free to file an issue.

@pharring
Copy link
Member

pharring commented Dec 9, 2018

Perhaps a Timer (System.Threading.Timer) would work? It doesn't burn a thread. Callbacks happen on threadpool threads.

@pharring
Copy link
Member

Another idea. Go back to the original code (using Task.Run) and change private void Run() into private async Task RunAsync(CancellationToken cancellationToken) then you can use an asynchronous (and cancellable) wait:

private async Task RunAsync(CancellationToken cancellationToken)
{
    while (true)
    {
        DateTimeOffset now = DateTimeOffset.Now;
        TimeSpan waitPeriod = GetNextCycleTargetTime(now) - now;
        await Task.Delay(waitPeriod, cancellationToken).ConfigureAwait(false);
        this.FetchAndTrackMetrics();
    }
}

Start becomes:

public bool Start()
{
    if (this.workerTask != null)
    {
        return false; // Was already running or stopped.
    }

    this.workerTask = Task.Run(() => RunAsync(this.cancellationTokenSource.Token));
    return true;
}

StopAsync becomes:

public Task StopAsync()
{
    this.cancellationTokenSource.Cancel();
    return this.workerTask;
}

You don't need the runningState (by my reading it's impossible to re-start once StopAsync has been called) although you might want to protect Start against races with a sync object. The CancellationTokenSource should be created in the constructor.

@macrogreg
Copy link
Contributor Author

@pharring , Thanks for the feedback. Code comments added with this change describe how and why this is specifically targeted to not use the thread pool and thus the most obvious Task.Delay(..) approach. Do you have concerns with that reasoning?

In that context, if we wanted to make this wait cancelable, we would probably create a ManualResetEventSlim object in a try-finally block right before we actually perform the wait, and we would dispose the object in the same finally block after the wait.
We would use the waiting period as the wait timeout and we would signal the event if a previous cancellation is requested, after appropriately making sure we have a valid, undisposed reference.

However, as mentioned, this change aims to address a customer issue in production. As such, I do not want not introduce any additional logic, however useful it might be in a separate change. This wait was not cancellable before and there is no need to make it so in the context of this fix.

However the signal logic described above would be a good improvement. Let's do it as a separate change.

Thanks! :)

private int runningState;
private Task workerTask;
private Thread aggregationThread;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in making this a private static Thread?

I think there is a scenario where a customer is creating multiple TelemetryConfigurations, which would create multiple of these internal classes each with a private thread. Please consider if there is any value in making this thread globally unique.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you could end up with several instances of DefaultAggregationPeriodCycle. Not only if you have several TelemetryConfigurations, but also if you have explicitly opted into aggregating at the scope of a telemetry client.
However, a shared thread would require shared cancellation and waiting logic, which is non-trivial. For a bug fix we should always err on the side of a minimal change. Here such change is to get off the thread pool.

But yes, now that we create threads explicitly, we can consider using one shared thread for all instances of DefaultAggregationPeriodCycle in the future. It would lighten the resource load.

@pharring
Copy link
Member

I have no concerns if the fix addresses the customer issue.

// pool had run user code before. Such user code may be doing an asynchronous wait scheduled to
// continue on the same thread(e.g. this can occur when using a custom synchronization context or a
// custom task scheduler). If such case the waiting user code will never continue.
// By creating our own thread, we guarantee no interactions with potentially incorrectly written async user code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I suggest we just say:

// We create a dedicated thread for aggregation. This is to prevent aggregation
// from being affected by potential thread pool starvation.

The rest is about the old implementation that is (IMO) best left unsaid because that implementation contributed to the very thread pool starvation we're trying to avoid! In addition, I think the "had run user code before" and "may be doing an asynchronous wait" parts are misleading and irrelevant to the crux of the problem which was the .GetAwaiter().GetResult() piece.

@lmolkova
Copy link
Member

@macrogreg I've added missing reference. Please update changelog and we'll be good to go.

@lmolkova lmolkova closed this Dec 28, 2018
@lmolkova lmolkova reopened this Dec 28, 2018
@lmolkova
Copy link
Member

I believe several improvements were discussed for this code. @macrogreg could you please create issues describing these improvements? We'll prioritize them accordingly.

I'd like to merge this PR and ship the fix in beta3.

@lmolkova lmolkova merged commit 2f22410 into develop Dec 28, 2018
lmolkova pushed a commit that referenced this pull request Dec 28, 2018
#1028)

* Metrics: Use dedicated thread instead of the thread pool for the default aggregation cycle (code only, does not build not NetStandard).

* Add missing reference to System.Threading.Thread on netstandard1.3

* changelog
@macrogreg
Copy link
Contributor Author

@lmolkova : The improvements discussed here are good suggestions for the future, but we should approach them separately from this fix. I fee you merged it today, so i am assuming we are on the same page. :)

@lmolkova
Copy link
Member

@macrogreg, please create a separate issues for these improvements so we don't forget about them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants