Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NuGet-based MSBuild project SDK resolver #4522

Closed
wants to merge 2 commits into from

Conversation

jeffkl
Copy link
Contributor

@jeffkl jeffkl commented Mar 28, 2022

Bug

Regression? Last working version:

Description

I've decided to make one big change to the NuGet-based MSBuild project SDK resolver in order to fix the major problems which was a lot easier to test all at once.

Concurrency NuGet/Home#11443, NuGet/Home#8652

The original design of the SDK resolver assumed it would only be called once per SDK by MSBuild but we've since discovered that its being called a lot more. Particularly, in Visual Studio, the SDK resolver is being called a lot for design-time builds which leads to thread starvation during solution load. This is because the locking happens very late in the APIs, when a restore is taking place. This means a lot of threads are all trying to restore the same package, one of them finishes, and the other notice there's nothing to do after they did quite a bit to get that far.

To resolve this, I've added a ConcurrentDictionary<LibraryIdentity, Lazy<SdkResult>> to ensure that the same SDK is never resolved more than once per process. For MSBuild.exe this works the same as before but for Visual Studio it greatly improves SDK resolution times as some overhead isn't needed (loading global.json and NuGet.Config/ISettings).

To further avoid thread starvation, I also added a SempahoreSlim around downloading of a package to ensure that the SDK resolver is never trying to do this work more than once at a time.

How packages are restored NuGet/Home#10935, NuGet/Home#7777

In the original implementation, I made a new API around generating a fake project on the fly so a DependencyGraphSpec could be passed to RestoreRunner. This is a lot of overhead to restore a single project.

The new API uses a RemoteWalkContext to resolve the package and an IPackageDownloader to download the package. Under the covers, these are the same classes used by RestoreRunner without the other stuff.

Logging NuGet/Home#11445

I added some low importance/verbose messages to help with debugging the SDK resolver. I also added an EventSource which lets you create traces with start/stop information. Here's an example of the events:

image

In a solution with 5 SDKs and 100 projects, I can see the SDK resolver called 204 times:

image

And with this change, the logic to resolve an SDK really only happens once per SDK:

image

PR Checklist

  • PR has a meaningful title

  • PR has a linked issue.

  • Described changes

  • Tests

    • Automated tests added
    • OR
    • Test exception
    • OR
    • N/A
  • Documentation

    • Documentation PR or issue filled
    • OR
    • N/A

@jeffkl
Copy link
Contributor Author

jeffkl commented Mar 28, 2022

/cc @davkean I'm accessing .Result on async methods since this MSBuild API is not async, should I made the top-level method in the resolver async and access .Result on it once rather than calling .Result in a few places? Any other suggestions? In my testing the NuGet SDK resolver only runs once per SDK in VS and combined with other changes in MSBuild, the whole process uses a lot less resources.

@erdembayar
Copy link
Contributor

@jeffkl
It looks like few tests are failing. 1 doesn't seems to be flaky one.

@jeffkl
Copy link
Contributor Author

jeffkl commented Mar 29, 2022

@erdembayar

It looks like few tests are failing. 1 doesn't seems to be flaky one.

I don't see how my changes would have affected those tests, I've re-run them yet again...

nkolev92
nkolev92 previously approved these changes Mar 29, 2022
Copy link
Member

@nkolev92 nkolev92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

/// <summary>
/// Represents an <see cref="EventSource" /> for the NuGet-based MSBuild project SDK resolver.
/// </summary>
[EventSource(Name = "Microsoft-NuGet-SdkResolver")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bear in mind for traces, Perf DDRITs, feedback bugs, targeted tracing etc to capture this, you'll need to opt them into this provider, or alternatively, reuse "Microsoft-VisualStudio-Common". The latter I'm less unclear of how that works to avoid clashes, etc.

@davkean
Copy link
Contributor

davkean commented Mar 29, 2022

@jeffkl Can you share out a zip/merged trace with the above changes? I want to see how much time is spent blocked on evaluation.

@jeffkl
Copy link
Contributor Author

jeffkl commented Mar 29, 2022

@davkean

Can you share out a zip/merged trace with the above changes? I want to see how much time is spent blocked on evaluation.

My test scenario is 100 projects that each randomly reference one of five MSBuild project SDKs. The packages that represent the SDKs are hosted on a local feed so its not technically hitting the network (NuGet does have to check compatibility and unzip them). I do the following:

  1. git clean the repo to remove output folders, intermediates, and the VS cache (.vs folder)
  2. Clear the NuGet global packages folder
  3. Load the solution
  4. Build the solution

WithPackageRestore.PerfViewData.etl.zip

I also did the steps again without clearing the NuGet global packages folder:

WithoutPackageRestore.PerfViewData.etl.zip

I see 500 total project evaluations and the NuGet SDK resolver is called 195 times but only does 5 unique look ups (the others receive a cached value and return almost immediately). global.json is only parsed once.

There is some waiting for package restore since I'm using a semaphore to ensure only one restore is happening at a time:

image

It looks like the first package restore is slower than the rest, I'm assuming because of JIT:

image

@jeffkl jeffkl added the Merge next release PRs that should not be merged until the dev branch targets the next release label Mar 29, 2022
/// <param name="logger">A <see cref="ILogger" /> to use when logging messages.</param>
/// <param name="sdkLogger">A <see cref="NuGetSdkLogger" /> to use when logging errors or warnings.</param>
/// <returns>An <see cref="SdkResult" /> representing the details of the package if it was found or errors if any occured.</returns>
private SdkResult ResolveSdk(string id, NuGetVersion nuGetVersion, SdkResolverContext context, SdkResultFactory factory, ISettings settings, VersionFolderPathResolver versionFolderPathResolver, ILogger logger, NuGetSdkLogger sdkLogger)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new method that does package restore

@davkean
Copy link
Contributor

davkean commented Mar 30, 2022

@jeffkl Can you give me a trace with these options turned on after clearing the cache? That trace is missing CPU/thread time.

image

@jeffkl
Copy link
Contributor Author

jeffkl commented Mar 30, 2022

@davkean Okay, I ran it with the following:

image

The file is too big to upload to GitHub so I shared it with you on OneDrive, hopefully it doesn't take too long to download it. If so, I can try putting it on a VM that you can remote into.

@davkean
Copy link
Contributor

davkean commented Mar 31, 2022

@jeffkl I took a look:

image

  • I don't see any network I/O under the resolution - was this hitting the cache? I would really love to see a trace with network included.
  • JITing is high, I will mark this down to private bits and I've excluded in numbers below.
  • NuGet SDK resolution in total in the trace takes about 800ms seconds of thread time (which is great!), 85% of which is just blocked evaluations waiting for other thread to read settings & resolve the SDK
  • This leaves about 100ms of total CPU time, most of which is I/O related. This is pretty good!

This looks great, and happy for this to go in. I think we still need to solve a few problems before we call this done done:

  1. Network time is going to result in a lot more blocking time, we're going to have evaluations (and hence project loads) blocked on a single evaluation hitting the network and downloading the packages. If the user changes this version during a branch switch or by manually change the file, then the UI will hang if it needs anything that requires an evaluation.
  2. Wonder if we can pre-prime the pump to avoid evaluating projects together that are going to block on the same resolver, currently we evaluate only as many projects as there are CPUs (minus 1), if we're blocked its a waste of thread time.
  3. SDK resolution outside of the NuGet resolver is just plain expensive. Can we work together to get some actionable bugs filed on this:

image

@jeffkl
Copy link
Contributor Author

jeffkl commented Mar 31, 2022

I don't see any network I/O under the resolution - was this hitting the cache? I would really love to see a trace with network included.

The packages are on a local feed so NuGet is just finding the .nupkg and unzipping it. I'll work on another where packages are actually downloaded. Bear in mind my connection is fast...

JITing is high, I will mark this down to private bits and I've excluded in numbers below.

Yeah not surprising. There's another problem I was looking into a few years ago, the SDK resolvers are initialized via LoadFrom() context so NGen'd assemblies aren't used. I had experimented with having MSBuild load them with the assembly full name which would load NGen'd assemblies and should reduce JIT time. That said, the NuGet SDK resolver does load Newtonsoft.Json.dll which I don't think we NGen. But my dream was to have everything NGen'd and JIT time should go away right?

NuGet SDK resolution in total in the trace takes about 800ms seconds of thread time (which is great!), 85% of which is just blocked evaluations waiting for other thread to read settings & resolve the SDK

Woohoo, that is fast! I did optimize loading of global.json but NuGet.Config is being parsed in the same way. Beyond that its just the process of Path.Combine() for existing packages. Downloading and unzipping a package is always going to be the most expensive operation but SDK packages are very small and honestly we don't rev them that much so most of the time we shouldn't see network hits. But I am using the mindset of worst case scenario here so these improvements should still help that.

SDK resolution outside of the NuGet resolver is just plain expensive. Can we work together to get some actionable bugs filed on this

100% agree. I have a few ideas in this regard:

  • Profile the MSBuild side of things around SDK resolver initialization and see what's taking so long
  • Ship a single resolver that resolves .NET and NuGet-based SDKs instead of having two
  • Load SDK resolvers with a load context that gets us NGen'd assemblies

I'll send you a link to another perfview ZIP with a network download where I did the following:

  1. Git clone MSBuild repo (https://github.com/dotnet/msbuild)
  2. build once to populate the package cache with all packages
  3. Clear NuGet's HTTP cache
  4. Delete Microsoft.Build.CentralPackageVersions and Microsoft.DotNet.Arcade.Sdk
  5. PerfView devenv.exe loading the solution

Keep in mind Speedtest.net says I'm getting 180Mbps down with a 6ms ping, so NuGet reportedly took 2.1 sec to download Arcade and 631ms to download CentralPackageVersions.

@davkean
Copy link
Contributor

davkean commented Apr 2, 2022

This is a different project, right? So I can't compare evaluation time.

This trace shows what I'm worried about. Now you haven't introduced this, but this lock contention is just an insane amount of block time. This is a whole lot of threads sitting there not making progress both with the resolver and restore itself (both are intermixed so we would need to tease them apart). When the UI thread blocks on one of these evaluations, it will hang:

image

I'm away for next week, but when I get back this is a treasure trove of optimizations that we can make to this. However, due to the design of how SDK resolution works,. it is going to be completely unavoidable to some of these locks, and hence I cannot yet recommend that folks use this in VS scenarios.

@davkean
Copy link
Contributor

davkean commented Apr 2, 2022

One more thing in this trace that I've noticed, is the impact this is having on other VS features. Because we're seeing more blocked time, more thread pool threads are being fired up, this is resulting in more work items getting worked on at the same, which results in more blocked time in features that we don't see elsewhere:

image

For example, I never see the above in any real world solution (everything in set_HasAllInformation is blocked time).

@jeffkl
Copy link
Contributor Author

jeffkl commented Apr 4, 2022

@davkean I'll schedule some time we can discuss this!

@jeffkl jeffkl force-pushed the dev-jeffkl-better-nuget-sdk-resolver branch from 211e309 to 03f48de Compare April 4, 2022 16:12
@jeffkl jeffkl force-pushed the dev-jeffkl-better-nuget-sdk-resolver branch from 03f48de to 90442c4 Compare April 5, 2022 15:45
@jeffkl jeffkl removed the Merge next release PRs that should not be merged until the dev branch targets the next release label Apr 5, 2022
@ghost ghost added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Apr 12, 2022
@ghost
Copy link

ghost commented Apr 12, 2022

This PR has been automatically marked as stale because it has no activity for 7 days. It will be closed if no further activity occurs within another 7 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

@ghost ghost closed this Apr 19, 2022
@jeffkl jeffkl reopened this Apr 21, 2022
@ghost ghost removed the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Apr 21, 2022
@ghost ghost added the Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed label Apr 28, 2022
@ghost
Copy link

ghost commented Apr 28, 2022

This PR has been automatically marked as stale because it has no activity for 7 days. It will be closed if no further activity occurs within another 7 days of this comment. If it is closed, you may reopen it anytime when you're ready again, as long as you don't delete the branch.

@ghost ghost closed this May 5, 2022
@nkolev92 nkolev92 deleted the dev-jeffkl-better-nuget-sdk-resolver branch August 30, 2022 01:43
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status:No recent activity PRs that have not had any recent activity and will be closed if the label is not removed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants