Skip to content

Conversation

@ToddGrun
Copy link
Contributor

@ToddGrun ToddGrun commented Jan 26, 2025

This is in response to allocations seen when the CA process attempts to operate on a sourcetext version that has not been synchronized over. In those cases, the OOP process requests a full deserialization of the source text causing full buffer allocations in both VS and the CA process.

There are several points of asynchronicity in the current system around sending over buffer changes, reducing any of which would reduce the likelihood of needing this full serialization.

This PR removes one of those points of asynchronicity, specifically in the workspace eventing layer. Previously, all notiications were done on a delayed basis, by wrapping each notification in the WorkspaceSpace.ScheduleTask call from the RaiseWorkspaceChangedEventAsync method. This method allows callers to indicate they need to be called immediately upon the workspace change. As noted by a doc comment in the PR, these handlers should be very fast.

The numbers from speedometer came back looking pretty good (below). Another point of asynchronicity could be investigated around the usage of the _textChangeQueue ABWQ, but it might not be worth it as it looks like this change removes the majority of the serialization during typing. Other alternatives within the text syncing system such as allowing OOP to request a text change instead of the full buffer contents are also possible, but previous investigations into that ended up complicated and incomplete.

*** before allocations during typing in VS process ***
image

*** commit 3 allocations during typing in VS process ***
image

*** commit 17 allocations during typing in VS process ***
image

*** before allocations during typing in CA process ***
image

*** commit 3 allocations during typing in CA process ***
image

*** commit 17 allocations during typing in CA process ***
image

…e processed immediately

This is in response to allocations seen when the OOP process attempts to operate on a sourcetext version that has not been synchronized over. In those cases, the OOP process requests a full deserialization of the source text causing full buffer allocations in both VS and the CA process.

There are several points of asynchronicity in the current system around sending over buffer changes, reducing any of which would reduce the likelihood of needing this full serialization.

This PR removes one of those points of asynchronicity, specifically in the workspace eventing layer. Previously, all notiications were done on a delayed basis, by wrapping each notification in the WorkspaceSpace.ScheduleTask call from the RaiseWorkspaceChangedEventAsync method. This method allows callers to indicate they need to be called immediately upon the workspace change. As noted by a doc comment in the PR, these handlers should be very fast.

Going to mark this as draft and get speedometer numbers off this to see if this helps enough, or if instead the other points of asynchronicity should be investigated (specifically, the usage of the _textChangeQueue ABWQ). There are other alternatives within the text syncing system such as allowing OOP to request a text change instead of the full buffer contents, but previous investigations into that ended up complicated and incomplete.
@ghost ghost added Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead labels Jan 26, 2025
@ToddGrun
Copy link
Contributor Author

ToddGrun commented Jan 26, 2025

@ToddGrun
Copy link
Contributor Author

Moving out of draft mode as the numbers look pretty good. Still some serialization happening during typing, but it's a lot less than before and this change was pretty trivial. If code reviewers think adding this to the workspace level is not desirable, can look into other approaches.

@ToddGrun ToddGrun marked this pull request as ready for review January 27, 2025 03:36
@ToddGrun ToddGrun requested a review from a team as a code owner January 27, 2025 03:36
@ToddGrun ToddGrun changed the title WIP: Add new workspace event that gives handlers the opportunity to be processed immediately Add new workspace event that gives handlers the opportunity to be processed immediately Jan 27, 2025
…r reduce the amount of serialization and deserialization
@ToddGrun
Copy link
Contributor Author

ToddGrun commented Jan 27, 2025

commit 5 PR validation insertion: https://dev.azure.com/devdiv/DevDiv/_git/VS/pullrequest/606025

…ositions to allow composition to include an IThreadingContext
…ecksumUpdater.DispatchSynchronizeTextChanges and RemoteAssetSynchronizationService.SynchronizeTextChangesAsync
using var _ = ArrayBuilder<(DocumentId id, Checksum textChecksum, ImmutableArray<TextChange> changes, Checksum newTextChecksum)>.GetInstance(out var builder);

foreach (var (oldDocument, newDocument) in values)
_ = _threadingContext.JoinableTaskFactory.RunAsync(async () =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should def doc the design here. make it clear why we've written it this way.

note: you can also use IAsyncToken here to track the work, allowing you to properly be able to unit test as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent here that this is fire-and-forget (and will still always yield the thread or not? Because there's absolutely a chance in VS this might be on the UI thread and we're calling those GetChangeRanges methods and such which might (?) be expensive. Or maybe not.

This indeed might be worth documenting, since it's otherwise unclear to me why we're wrapping this in a JTF.RunAsync().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is to fire-and-forget, but only yield the current thread if an async operation necessitates it.

Is GetChangeRanges expensive? It doesn't look to be to me, but maybe I'm missing an override? If it is expensive, the two conditions that use it could probably be removed, as they are very minor and specialized optimizations.

Will add some doc here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought GetChangeRanges could potentially end up in a text diff, but maybe not.

using (Logger.LogBlock(FunctionId.Workspace_EventsImmediate, (s, p, d, k) => $"{s.Id} - {p} - {d} {kind.ToString()}", newSolution, projectId, documentId, kind, CancellationToken.None))
{
args = new WorkspaceChangeEventArgs(kind, oldSolution, newSolution, projectId, documentId);
ev.RaiseEvent(static (handler, arg) => handler(arg.self, arg.args), (self: this, args));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside, are these exception safe if the attached handler throws an exception? i ask because the old system might have been ok ift he Task went into a faulted state. i'm not sure about the new system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops asked this same question here: #76932 (comment) I think this might cause an issue now if this were to throw so probably best to handle that better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concern with me just changing RaiseEvent from using FatalError.ReportAndPropagate to instead use FatalError.ReportAndCatch (also, would it make sense to have the try/catch inside the foreach in that method?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jasonmalinowski -- any thoughts on change RaiseEvent to instead use FatalError.ReportAndCatch (and to move the try/catch inside the loop)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and just made the change to switch to ReportAndCatch

VSCode_Project_Load_Started = 861,
VSCode_Projects_Load_Completed = 862,

// 900-999 for items that don't fit into other categories.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 900-999 for items that don't fit into other categories.
// 900-999 things related to workspace and OOP solution sync

We're not restricted to three digits, so no reason to make this the final catch-all group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted a catch-all group as I was tired scanning through trying to find the next available slot with all the other groups. Is the request here to change the starting range / size of the catch all group, or just not to have it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just don't have it. And if we think groups don't make sense in the first place, just get rid of all the groups. 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really like groups. just give things around 10-20 enries. it's def nicer when you're adding a bunch of new telemetry markers to a speciifc area.

using var _ = ArrayBuilder<(DocumentId id, Checksum textChecksum, ImmutableArray<TextChange> changes, Checksum newTextChecksum)>.GetInstance(out var builder);

foreach (var (oldDocument, newDocument) in values)
_ = _threadingContext.JoinableTaskFactory.RunAsync(async () =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent here that this is fire-and-forget (and will still always yield the thread or not? Because there's absolutely a chance in VS this might be on the UI thread and we're calling those GetChangeRanges methods and such which might (?) be expensive. Or maybe not.

This indeed might be worth documenting, since it's otherwise unclear to me why we're wrapping this in a JTF.RunAsync().

using (Logger.LogBlock(FunctionId.Workspace_EventsImmediate, (s, p, d, k) => $"{s.Id} - {p} - {d} {kind.ToString()}", newSolution, projectId, documentId, kind, CancellationToken.None))
{
args = new WorkspaceChangeEventArgs(kind, oldSolution, newSolution, projectId, documentId);
ev.RaiseEvent(static (handler, arg) => handler(arg.self, arg.args), (self: this, args));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears RaiseEvent will report any exception being thrown but will still propagate it out. That might be dangerous here, because it now means any subscriber's mistake will now completely break the workspace and put the user in a really broken state. I admit I'm not sure why we aren't catching exceptions in RaiseEvent, so maybe that should be changed (or we have a variant that does catch).

use some local functions for cleanup
Update some comments
Copy link
Member

@CyrusNajmabadi CyrusNajmabadi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signing mostly off. I do want Jason to approve as well. And there look to be a few places we can clean up. I would like exception resilience thought about. Thanks!

Workspace workspace,
IAsynchronousOperationListenerProvider listenerProvider,
IThreadingContext threadingContext,
CancellationToken shutdownToken)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if the shutdown token is needed. Instead of just using the disposal token on the threading context

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are different. VisualStudioWorkspaceServiceHubConnector has StartListening/StopListening methods, which drive the shutdownToken. However, I think the threadingContext's DisposalToken is driven by MEF and I don't think is disposed until VS shutdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just saw your other comment about VisualStudioWorkspaceServiceHubConnector. Let me think about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like right now, VisualStudioWorkspaceServiceHubConnector.StopListening is called on Workspace dispose. If I just send over the threadingContext and use it's cancellation token, are we ok with essentially just using a cancellation token from MEF dispose and not workspace dispose?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. let's keep this as is. i've looked at hte code and i'm not sure how to reconcile the different Disposes/CTs. So let's not change that now.


// only push solution snapshot from primary (VS) workspace:
_checksumUpdater = new SolutionChecksumUpdater(workspace, _listenerProvider, _disposalCancellationSource.Token);
_checksumUpdater = new SolutionChecksumUpdater(workspace, _listenerProvider, _threadingContext, _disposalCancellationSource.Token);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the threading context, we might be able to get rid of the disposal logic here and passing that ct along

Update comments
Use string constants instead of allocating
@ToddGrun
Copy link
Contributor Author

@jasonmalinowski -- would love to get your feedback on the changes and any remaining concerns you may have that haven't been addressed as desired

TelemetryLogging.LogAggregatedCounter(FunctionId.ChecksumUpdater_SynchronizeTextChangesStatus, KeyValueLogMessage.Create(m =>
{
m[TelemetryLogging.KeyName] = nameof(SolutionChecksumUpdater) + "." + metricName;
m[TelemetryLogging.KeyValue] = 1L;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the idea that this will add 1 each time this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the LogAggregatedCounter will add the specified amount to the counter

{
m[TelemetryLogging.KeyName] = nameof(SolutionChecksumUpdater) + "." + metricName;
m[TelemetryLogging.KeyValue] = 1L;
m[TelemetryLogging.KeyMetricName] = metricName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this here, if it is encoded in KeyName above? is one of these redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up down in AbstractAggregatingLog.Log. That code requires KeyName/KeyValue to be specified, and if KeyMetricName isn't specified, falls back to KeyName to be the metricName.

I can look into whether there are callers that depend on that behavior, and if not, remove that parameter. However, I'd prefer not to as even if that could be removed, it would add an allocation to strip out the metricname from the keyname.

{
var client = await RemoteHostClient.TryGetClientAsync(_workspace, _shutdownToken).ConfigureAwait(false);
if (client == null)
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slightly odd that this would show up as 'failure'. this means a customer who has OOP off will show has massively failing this op. i think this should maybe count as 'success' or perhaps NA so we don't get the wrong idea about being in a bad statee.

{
m[TelemetryLogging.KeyName] = keyName;
m[TelemetryLogging.KeyValue] = 1L;
m[TelemetryLogging.KeyMetricName] = metricName;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment if we need the keyname and metric name esp if hte keyname contains the metric name.

@CyrusNajmabadi
Copy link
Member

with your latest changes, i feel fine with this going in (even without Jason signing off). he will eventuallyget around to looking at this.

…every invocation and to not log telemetry in the case the RemoteHostClient.TryGetClientAsync call doesn't return a client.
@ToddGrun
Copy link
Contributor Author

ToddGrun commented Feb 3, 2025

@sharwell has an alternative approach which he's going to try out. I'll get perf numbers on his approach once it's ready and then we can evaluate how to proceed.


In reply to: 2628571096

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Feb 4, 2025

Latest test insertion: https://dev.azure.com/devdiv/DevDiv/_git/VS/pullrequest/607891 (based on commit 14)

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Feb 5, 2025

Alternative proposed implementation by Sam here: #77059

Running speedometer tests against that and will update once those are available

WorkspaceChangeEventArgs args = null;
if (ev.HasHandlers)
{
args = new WorkspaceChangeEventArgs(kind, oldSolution, newSolution, projectId, documentId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could just pull this assignment out of the if() block here; practically any workspace where we would care about allocations will have at least one handler. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

// we won't notify all of the callers. The expectation is such an exception would be thrown to the SafeStartNew in the workspace's event queue that
// will raise that as a fatal exception, but OperationCancelledExceptions might be able to propagate through and fault the task we are using in the
// chain. I'm choosing to use ReportWithoutCrashAndPropagate, because if our theory here is correct, it seems the first exception isn't actually
// causing crashes, and so if it turns out this is a very common situation I don't want to make a often-benign situation fatal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is stale now. This looks like it was written by me, but since the comment has a link to a query (rather than the actual bug), it's quite useless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Killed the comment, things are different enough now that it's really not applicable.

using var _ = ArrayBuilder<(DocumentId id, Checksum textChecksum, ImmutableArray<TextChange> changes, Checksum newTextChecksum)>.GetInstance(out var builder);

foreach (var (oldDocument, newDocument) in values)
_ = _threadingContext.JoinableTaskFactory.RunAsync(async () =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought GetChangeRanges could potentially end up in a text diff, but maybe not.

Comment on lines 153 to 156
var oldDocument = e.OldSolution.GetDocument(e.DocumentId);
var newDocument = e.NewSolution.GetDocument(e.DocumentId);

Debug.Assert(oldDocument != null && newDocument != null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var oldDocument = e.OldSolution.GetDocument(e.DocumentId);
var newDocument = e.NewSolution.GetDocument(e.DocumentId);
Debug.Assert(oldDocument != null && newDocument != null);
var oldDocument = e.OldSolution.GetRequiredDocument(e.DocumentId);
var newDocument = e.NewSolution.GetRequiredDocument(e.DocumentId);

If we're wrong, telemetry will tell us since you're adding that catch-and-report. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +271 to +272
var state = await oldDocument.State.GetStateChecksumsAsync(_shutdownToken).ConfigureAwait(false);
var newState = await newDocument.State.GetStateChecksumsAsync(_shutdownToken).ConfigureAwait(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused a bit by the semantics we're trying to have here for OnWorkspaceChangedImmediate. The statement was "it must be very fast", but now we're computing checksums? I imagine the async-ness here is we use an AsyncLazy<> under the covers, and so if somebody else is asking we won't block our thread. But this newDocument is very new (indeed, we're still raising events for it!) so would this be the first place it'll get called? And would that then be consuming CPU right here when we wanted this to be cheap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very valid concern, debugged into the hash computation for the newDocument, and this is forcing it on the calling (UI) thread. Need to think on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the task.Yield as Cyrus suggested, and verified that all the calculations have moved to a bg thread, but still saw pretty good allocation numbers as outlined in the PR updated description.

// about the text changes between oldDocument and newDocument. By doing this, we can
// reduce the likelihood of the remote side encountering an unknown checksum and
// requiring a synchronization of the full document contents.
// This method uses JTF.RunAsync to create a fire-and-forget task. JTF.RunAsync will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment now stale? RunAsync not being used anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the stale bit.

@CyrusNajmabadi
Copy link
Member

could we just do the simple thing of moving off the UI thread (use a TPL.Run or Yield), and just seeing if we still get the good results of not needing to sync anything?

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Feb 6, 2025

Yeah, maybe a Task.Yield would be good enough. Can try that out and see what the numbers say.


In reply to: 2638638591

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Feb 6, 2025

And here's the commit 17 test insertion: https://dev.azure.com/devdiv/DevDiv/_git/VS/pullrequest/608770

Numbers looked pretty good, some serialization still happens, but it's a lot less. Will update the PR description

@ToddGrun
Copy link
Contributor Author

ToddGrun commented Feb 9, 2025

@jasonmalinowski, @sharwell -- Will go ahead and merge on Monday unless either of you feels otherwise

@ToddGrun ToddGrun merged commit 1adf017 into dotnet:main Feb 10, 2025
25 checks passed
@dotnet-policy-service dotnet-policy-service bot added this to the Next milestone Feb 10, 2025
@akhera99 akhera99 modified the milestones: Next, 17.14 P2 Feb 25, 2025
@ToddGrun ToddGrun deleted the WorkspaceChangedImmediate branch March 5, 2025 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area-IDE untriaged Issues and PRs which have not yet been triaged by a lead VSCode

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants