Better handle resx scenarios #38012

ericstj · 2019-05-29T07:41:05Z

There were two resx scenarios that we weren't handling well.

BinaryFormatted data that is missing the type information

BinaryFormatted data never takes the type into account, it's only used
to check the deserialized data after it's read. In the old resx reader
it would deserialize the data in the build task, only to reserialize it
back, recording the type information. Since we're eliminating build
time deserialziation we cannot do this, so just permit the payload to
flow through without recording the type information. This is
effectively what happened before since the user never recorded the
type information in the resx so it isn't introducing any new opportunity
for inconsistencies. To implement this I used the existing ResX format
with a sentinel type to indicate that the BinaryFormatter payload type
was unknown.

Primitive types stored as string

ResX reader deserialized all types during the build, we're trying to
eliminate this as it results in build time / cross-framework type
loading. In doing so we lose the ability to handle primitive types
since the only way we currently write primitive types is when they
are passed in as live objects. To fix this, we'll make the string-based
type converter method aware of primitive types, and permit it to
deserialize those primitive types (IOW: parse the string via
typeconverter) so that we still write these as primitive resources.
We'll rename this method to AddResource to indicate it is more
generic than just handling pre-serialized data. To identify primitive types
we use a string comparer to match the type name written in the resx,
and map it to a known type (in the build framework).

There were two resx scenarios that we weren't handling well. 1. BinaryFormatted data that is missing the type information BinaryFormatted data never takes the type into account, it's only used to check the deserialized data after it's read. In the old resx reader it would deserialize the data in the build task, only to reserialize it back, recording the type information. Since we're eliminating build time deserialziation we cannot do this, so just permit the payload to flow through without recording the type information. This is effectively what happened before since the user never recorded the type information in the resx so it isn't introducing any new opportunity for inconsistencies. To implement this I used the existing ResX format with a sentinel type to indicate that the BinaryFormatter payload type was unknown. 2. Primitive types stored as string ResX reader deserialized all types during the build, we're trying to eliminate this as it results in build time / cross-framework type loading. In doing so we lose the ability to handle primitive types since the only way we currently write primitive types is when they are passed in as live objects. To fix this, we'll make the string-based type converter method aware of primitive types, and permit it to deserialize those primitive types (IOW: parse the string via typeconverter) so that we still write these as primitive resources. We'll rename this method to AddResource to indicate it is more generic than just handling pre-serialized data. To identify primitive types we use a string comparer to match the type name written in the resx, and map it to a known type (in the build framework).

ericstj · 2019-05-29T13:09:51Z

@ViktorHofer have a look at this: https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38012~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190529.1/workItem/System.Runtime.WindowsRuntime.Tests/wilogs. Tests passed but the job was fire-balled because some script activity after the test timed out. What is it doing?

src/System.Resources.Extensions/ref/System.Resources.Extensions.cs

ericstj · 2019-05-29T13:24:16Z

Linux musl failures all appear to be https://github.com/dotnet/core-eng/issues/6327. Which @MattGal is working on.

ViktorHofer · 2019-05-29T13:29:52Z

@ViktorHofer have a look at this: https://mc.dot.net/#/user/dotnet-bot/pr~2Fdotnet~2Fcorefx~2Frefs~2Fpull~2F38012~2Fmerge/test~2Ffunctional~2Fcli~2Finnerloop~2F/20190529.1/workItem/System.Runtime.WindowsRuntime.Tests/wilogs. Tests passed but the job was fire-balled because some script activity after the test timed out. What is it doing?

@MattGal it seems you are doing additional logic after the test invocation which modifies the ExitCode. I thought you were capturing the ExitCode and set it at the end of the script but apparently that's not the case.

MattGal · 2019-05-29T15:57:14Z

@ViktorHofer yes, this is what is called a "timeout". That means the work item ran longer than it was expected to, and was killed. As a timed-out work item has no exit code, I had to pick one, which years ago I arbitrarily picked -3 for (as it doesn't match common Windows/Linux exit codes). We also send events to make it disambiguate-able from a real exit -3, though either is a failure.

You can tell this from the logs:

2019-05-29 08:45:22,436: ERROR: job(46): kill: Job running for too long. Killing...
2019-05-29 08:45:22,436: ERROR: executor(561): _execute_command: Executor timed out after 900 seconds and was killed.
2019-05-29 08:45:22,436: INFO: event(44): send: Sending event type WorkItemTimeout
2019-05-29 08:45:22,577: INFO: saferequests(87): request_with_retry: Response complete with status code '201'
2019-05-29 08:45:22,577: INFO: executor(581): _execute_command: Finished _execute_command, exit code: -3

Common causes for this include variable-length work item timing, too-short of a timeout being set for a given work item, extreme network slowness, or issues with a service the work item is talking to. Guessing from the logs here something took a long time in test-reporter.py but it didn't keep logging.

ViktorHofer · 2019-05-29T16:00:05Z

OK I didn't look into the py script. Thanks!

ericstj · 2019-05-29T16:05:31Z

I recognized this is a timeout, however the test itself finished with plenty of time left. The thing that timed out was a bunch of shady post-test activity involving installing python which I don't think is any of ours...

Test log:

----- start  8:30:25.47
...
=== TEST EXECUTION SUMMARY ===
   System.Runtime.WindowsRuntime.Tests  Total: 292, Errors: 0, Failed: 0, Skipped: 0, Time: 2.440s
...
----- end  8:34:28.75 ----- exit code 0 ----------------------------------------------------------
 Wed 05/29/2019- 8:34:28.92
Using base prefix 'C:\\python3.7.0'
...

Then in the run_client.py log:

2019-05-29 08:45:22,436: ERROR: job(46): kill: Job running for too long. Killing...
2019-05-29 08:45:22,436: ERROR: executor(561): _execute_command: Executor timed out after 900 seconds and was killed.

So whatever started after the test completed ran for another 10 minutes and eventually caused the test to be killed.

MattGal · 2019-05-29T16:28:20Z

@ericstj That's a good question for @alexperovich when he gets in, he wrote the test-reporter python script which comes from dotnet/arcade. Usually that script logs more to the console, but my most likely guess as to what could have gone wrong would be network slowness in the VSTS API, which is what this reporter script spends most of its time doing. From a Helix client's perspective it's currently a black box. You told it to do something and kill it if it ran > 15 minutes, it took > 15 minutes.

I know there's an effort to make the VSTS API reporting occur outside your timeout, but I wouldn't expect this for about a month or so (as it will require a fair bit of plumbing and teaching the Helix client about VSTS)

ericstj · 2019-05-29T23:29:57Z

Looked to me like it was just spending 10 minutes installing python, but it could be that he just forgot to log anything from that script.

alexperovich · 2019-05-29T23:40:14Z

That test reporter script needs to use pip to install the vsts package so it can talk to azure devops to report test results. If the network is having problems on that machine I can see that install taking 10 minutes. It is only done once per machine though so it won't happen again on the same machine.
The first line of the python script is a print() so it never actually got to running the reporter. It did some pip installing and then hit a timeout.

We could probably add some machine setup stuff to do this install when we setup the machines.

ViktorHofer · 2019-05-29T23:51:08Z

We could probably add some machine setup stuff to do this install when we setup the machines.

That would be a good idea.