Skip to content

HELIX failures investigations #41563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
May 13, 2022
Merged

HELIX failures investigations #41563

merged 22 commits into from
May 13, 2022

Conversation

DamianEdwards
Copy link
Member

Investigating helix failures.

@DamianEdwards DamianEdwards added the * NO MERGE * Do not merge this PR as long as this label is present. label May 6, 2022
@DamianEdwards DamianEdwards requested a review from Pilchie as a code owner May 6, 2022 21:36
@ghost ghost added the area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates label May 6, 2022
@DamianEdwards DamianEdwards requested review from a team, dougbu and wtgodbe as code owners May 9, 2022 16:30
@dougbu
Copy link
Contributor

dougbu commented May 9, 2022

Suggest adding CollectDump to the --blame option at https://github.com/dotnet/aspnetcore/blob/main/eng/tools/HelixTestRunner/TestRunner.cs#L224

May also be worthwhile adding VSTEST_ DISABLE_FASTER_JSON_SERIALIZATION=1 and DOTNET_CLI_VSTEST_TRACE=1 after `https://github.com/dotnet/aspnetcore/blob/main/eng/tools/HelixTestRunner/TestRunner.cs#L37

@dougbu dougbu requested a review from HaoK May 9, 2022 17:28
@dougbu
Copy link
Contributor

dougbu commented May 9, 2022

/fyi @MarcoRossignoli this PR incorporates your suggestions

@MarcoRossignoli
Copy link
Member

cc: @nohwnd

@nohwnd
Copy link
Member

nohwnd commented May 11, 2022

I had a look on your logs, and there is no indication of what is happening, other than that a template test always runs when crash happens. Testhost seems to abruptly die.

So there are few things you can do:

  • Try enabling --blame-crash to see if you can capture a crash dump. the datacollector.log will have useful info about why it failed to capture the dump. If you are targetting <net5 tfm you should have Procdump in your PATH. Azure has a Choco task that can install it easily.
  • If the crash happens on Windows, I would try to get few of the last Application Error logs from Event Log they have useful info.
get-eventlog -LogName application -entrytype error -Source "application error" | Select -First 10 | Format-List
  • Are you by any chance connecting the processes into a Job (group of processes), so when one process dies the others follow?

@HaoK
Copy link
Member

HaoK commented May 11, 2022

@nohwnd so what we are trying in this PR is:

var commonTestArgs = $"test {Options.Target} --diag:{diagLog} --logger:xunit --logger:\"console;verbosity=normal\" --blame \"CollectHangDump;CollectDump;TestTimeout=15m\"";

It looks like some new blame variants have been added like the one you suggested, can we drop our blame with CollectHangDump/Dump/TestTimeout and replace with blame-crash and blame-hang-timeout:

var commonTestArgs = $"test {Options.Target} --diag:{diagLog} --logger:xunit --logger:\"console;verbosity=normal\" --blame-crash --blame-hang-timeout \"15m\"";

@DamianEdwards
Copy link
Member Author

@HaoK @nohwnd I've updated it here 3af7352

@nohwnd
Copy link
Member

nohwnd commented May 12, 2022

@HaoK the newer --blame-crash syntax translates to what you already had. I did not see you are already using CollectDump in the code, I probably overlooked it or saw the code before it was added.


In the latest build with abort, the vstest logs look like processes are killed externally, and this log indicates that this is the case:

2022-05-12T00:27:28.8279660Z Non-quarantined tests exceeded configured timeout: 40m.

So I guess you have one more timeout to change? :)

@DamianEdwards DamianEdwards force-pushed the damianedwards-template-tests branch from 2f51c1b to e06b8d7 Compare May 12, 2022 16:55
Copy link
Member

@HaoK HaoK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for grinding through this even after falling into all the pits of doom, things look way better!

@@ -62,6 +62,101 @@ Then, use one of:
previous step, it is NOT advised that you install templates created on your local machine using just
`dotnet new -i [nupkgPath]`.

#### Conditional tests & skipping test platforms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this is specific to template tests? Should it be documented in another location instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah @HaoK said the same thing in a comment on a commit before I rebased (so it got lost from here) but I'll do that shuffle once this is in main.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe docs/Helix.md for most of this❔

@DamianEdwards DamianEdwards enabled auto-merge (squash) May 12, 2022 20:32
Copy link
Contributor

@dougbu dougbu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to clean up and a few questions…

[ConditionalTheory]
[InlineData("Individual", new string[] { ArgConstants.UseLocalDb })]
[InlineData("Individual", new string[] { ArgConstants.UseProgramMain, ArgConstants.UseLocalDb })]
[OSSkipCondition(OperatingSystems.Linux | OperatingSystems.MacOSX, SkipReason = "No LocalDb on non-Windows")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea how these tests ever succeeded on Linux or macOS before❔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pages don't fail unless you try to actually register a user (i.e. hit the database).

@@ -62,6 +62,101 @@ Then, use one of:
previous step, it is NOT advised that you install templates created on your local machine using just
`dotnet new -i [nupkgPath]`.

#### Conditional tests & skipping test platforms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe docs/Helix.md for most of this❔

@DamianEdwards DamianEdwards merged commit cab4c6e into main May 13, 2022
@DamianEdwards DamianEdwards deleted the damianedwards-template-tests branch May 13, 2022 00:36
@ghost ghost added this to the 7.0-preview5 milestone May 13, 2022
@dougbu dougbu removed the * NO MERGE * Do not merge this PR as long as this label is present. label May 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-mvc Includes: MVC, Actions and Controllers, Localization, CORS, most templates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants