Skip to content

Conversation

@AR-May
Copy link
Member

@AR-May AR-May commented Oct 5, 2022

Fixes #7993

Summary

In the MSBuild server, we identify the state of the server via mutexes. Sometimes, for reason yet unknown to us, mutex could throw the exception System.IO.IOException: Connection timed out. When this occurs, we fallback to old behavior building without server. We fixed some of those in #8000, but now found more situations when this happens.

Customer Impact

MSBuild non-Windows users could have intermittent error when building with dotnet build.
This does not affect Visual Studio.

Regression?

Yes, this is a regression.

Testing

Unit tests.

Risk

Low risk. The fix adds additional try-catch blocks to process this situation.

Code Reviewers

[TODO]

Description of the fix

  • Add a try-catch block to catch and process the IOException exception when mutexes are used.
  • Add a new client exit type for this kind of error.

@AR-May AR-May force-pushed the catch-mutex-errors-in-msbuild-server branch from 14f4f9f to 133463e Compare October 5, 2022 15:39
/// The client is not able to identify the server state.
/// </summary>
/// <remarks>
/// This may happen when mutex that is regulating the server state throws.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add link to related GH issue to this comment.

{
CommunicationsUtilities.Trace("Failed to obtain the current build server state: {0}", ex);
_exitResult.MSBuildClientExitType = MSBuildClientExitType.UnknownServerState;
return false;
Copy link
Member

@rokonec rokonec Oct 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caller of TryLaunchServer do override _exitResult.MSBuildClientExitType of false. I recommend caller does it only if MSBuildClientExitType has not been initialized yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also Trace IOException.HResult

{
// For unknown root cause, Mutex.TryOpenExisting can sometimes throw 'Connection timed out' exception preventing to obtain the build server state through it (Running or not, Busy or not).
// See: https://github.com/dotnet/msbuild/issues/7993
CommunicationsUtilities.Trace("Failed to obtain the current build server state: {0}", ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After communication with Jiri Vorlicek please also Trace IOException.HResult
It will help runtime team to narrow it down.

{
CommunicationsUtilities.Trace("Failed to obtain the current build server state: {0}", ex);
_exitResult.MSBuildClientExitType = MSBuildClientExitType.UnknownServerState;
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also Trace IOException.HResult

@Forgind
Copy link
Contributor

Forgind commented Oct 6, 2022

I'm confused about a few things for this PR.

First, is it intended to go into 17.4 or 17.5? You filled out the template as if it's aimed at 17.4, but I heard we were turning server off for 17.4, which makes this seem like it isn't high enough priority to meet the QB approval bar?

Second, is this its final form? It seems more like it's trying to gather information about what's going wrong when it crashes rather than actually fixing the underlying problem. Are you planning to have a follow-up PR to fix the real issue and remove these try/catches? If so, should we remove the "fixes" line?

Third, how should "UnknownServerState" be treated by the rest of MSBuild? I'm wondering if you should just shut down the server in that case and restart it with the next build rather than get into an unknown state.

@rokonec
Copy link
Member

rokonec commented Oct 7, 2022

@Forgind

Second, is this its final form?

  1. So far we have seen this bug only on CI and linux. Since Server is going to be be opt-in, it does not matter if we release it in 17.4 or 17.5 as the risk this bug will hunt someone whom opt-in is very small and IMHO acceptable.

  2. We believe it is bug in runtime and shall be fixed in runtime. Since we don't understand root cause of this bug we are backing to non-server as the best-effort. Once we realize this bug is no more, I'd vote for reverting this code.

  3. We can't connect to server so we can't instruct it to shut down. Proper recovery is unknown.

@Forgind Forgind merged commit a440ea9 into dotnet:main Oct 10, 2022
@DmitriyShepelev
Copy link
Contributor

DmitriyShepelev commented Dec 19, 2022

Looks like src\Build\PublicAPI\net\PublicAPI.Unshipped.txt is missing
Microsoft.Build.Experimental.MSBuildClientExitType.UnknownServerState = 5 -> Microsoft.Build.Experimental.MSBuildClientExitType...

Edit: Disregard given #8116 merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MSBuild Server fallback mechanism doesn't work when Mutex throws exception

4 participants