Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix tensorflow test hanging issue #4997

Merged
merged 4 commits into from
Apr 4, 2020

Conversation

frank-dong-ms-zz
Copy link
Contributor

fixed 2 issue of tensorflow tests:

  1. fix issue in Resource Manager, use ReadAsync instead of Read to let thread timeout corrently
  2. move download ImageSet to constructor of test class to avoid duplicate download for different tests

@frank-dong-ms-zz frank-dong-ms-zz requested a review from a team as a code owner April 3, 2020 01:10
string fullImagesetFolderPath = Path.Combine(
imagesDownloadFolderPath, finalImagesFolderName);
// set timeout to 3 minutes, download sometimes will stuck so set smaller timeout to retry download
string timoOutOldValue = Environment.GetEnvironmentVariable(ResourceManagerUtils.TimeoutEnvVariable);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want do this only for these tests and not for all tests? We can choose a smaller timeout and higher retry count for all downloads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this setting (timeout) is hardcoded in our source code as 10 minutes but not a option that user can set at https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Vision/ImageClassificationTrainer.cs#L1321

this setting can be override by set the env variable I'm doing this as if everything is fine 3 minutes is enough if something go wrong then we hope this to fail fast as we are retrying 5 times.

we are not seeing other tests have similar issue so I only do this in affected tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are already specifying a timeout in code, why not fix it there directly in one place instead of many places here? Would changing the timeout to 3 minutes in ImageClassificationTrainer have the same effect?


In reply to: 403157318 [](ancestors = 403157318)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our CI environment 3 minutes is enough but if I don't want to change the 10 minutes to 3 minutes as customer's network maybe not that good then 3 minutes maybe not enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we start running tests in parallel and different tests are using different timeout values?


In reply to: 403230917 [](ancestors = 403230917)

imagesDownloadFolderPath, finalImagesFolderName);
// set timeout to 3 minutes, download sometimes will stuck so set smaller timeout to retry download
string timoOutOldValue = Environment.GetEnvironmentVariable(ResourceManagerUtils.TimeoutEnvVariable);
Environment.SetEnvironmentVariable(ResourceManagerUtils.TimeoutEnvVariable, (3 * 60 * 1000).ToString());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will setting the environment variables here and resetting it at the end of the test have any impact?
It appears that you have moved the download code to the constructor. But the timeout is not being set there. So by the time the code gets here wouldn't the download already have completed with a different timeout value?

Copy link
Contributor Author

@frank-dong-ms-zz frank-dong-ms-zz Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are downloading different things.

In constructor we are downloading some image set and all tests use same images so no need to download duplicate images and I move this to constructor. this is refine the code performance and remove unnecessary downloads.

this env variable is to set timeout for download some metadata for test and metadata download is causing the hanging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It this environment variable is affecting the download of metadata by ImageClassificationTrainer, then can we fix this directly there in the timeout that ImageClassificationTrainer uses?


In reply to: 403159407 [](ancestors = 403159407)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as previous, in source code I want to make sure timeout is enough which is 10 minutes as before (this metadata file is nearly 100MB). But in our test code as in CI 3 minutes is enough per our experience so I want to make this timeout as 3 minutes to fail fast if we met download issue.

{
// ignore any Exception and retrying download
ch.Warning($"{i+1} - th try: Dowload {fileName} from {url} fail with exception {ex.Message}");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have lots of trouble with failed downloads. Can we make the exceptions more visible?

That is, we should log all the failures as you have done. But if in the end the download still fails after the retries, how about throwing an exception that download failed even after retries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this download code is used for things other than Tensorflow as well...right? I meant should we throw here? If not make we have to make sure we are throwing whereever this is called from to make sure that download failures are propagated as errors.


In reply to: 403162052 [](ancestors = 403162052)

Copy link
Contributor Author

@frank-dong-ms-zz frank-dong-ms-zz Apr 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can see one other code is also using this. We already have this "downloadResult" returned if download failed after retry, and ResourceDownloadResults will treat returned "downloadResult" as error message and take care the DownloadResults at caller method.

see https://github.com/dotnet/machinelearning/blob/master/src/Microsoft.ML.Transforms/Text/WordEmbeddingsExtractor.cs#L631-L638 for the other caller of this ensure resource.


var task = s.ReadAsync(buffer, 0, blockSize);
task.Wait(ct);
int count = task.Result;
// REVIEW: use a progress channel instead.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you need to check the result after the first call, you can change the loop to a do{} loop instead of a while() loop. If you still want to have a while loop, you can change the loop to a while(true) loop and move this inside the while loop. In either case you won't need to have these three lines repeated twice.

int count;

var task = s.ReadAsync(buffer, 0, blockSize);
task.Wait(ct);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReadAsync allows you to pass a CancellationToken directly. You can choose to use that.

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@harishsk harishsk merged commit f94f359 into dotnet:master Apr 4, 2020
@frank-dong-ms-zz frank-dong-ms-zz deleted the fix-tensorflow-hanging branch April 15, 2020 20:36
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants