Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed & Reactivated AutoFitImageClassificationTrainTest hanging by freeing old Tensor objects #4978

Conversation

mstfbl
Copy link
Contributor

@mstfbl mstfbl commented Mar 27, 2020

AutoFitImageClassificationTrainTest is occasionally hanging, even after PR #4939 . The issue here is that Tensor objects saved in ITransformer model (now renamed as 'estimatorModel`) are not being automatically freed by C#'s Garbage Collector, as these Tensor objects are made in TensorFlow's C libraries.

Edited on 4/9/2020: This PR makes ExperimentResult and CrossValidationExperimentResult implements IDisposable to free remaining C Tensor objects in memory in a deterministic manner. This method of freeing these objects ensures the user will not face null-reference/use-after-free errors when trying to access the model, as this clean up is done after GC runs.

Edited on 4/9/2020: I confirmed that this fixes the hanging "out of memory" and/or "long running test" issues by running AutoFitImageClassificationTrainTest for 100 iterations, in addition to running all the other unit tests in this build. In all of these builds, none of the issues described occur. These builds all time-out because running 100 iterations of AutoFitImageClassificationTrainTest takes more than 1 hour.

I have also reactivated the AutoFitImageClassificationTrainTest unit test with this fix.

**Edit: **

@mstfbl mstfbl requested a review from a team as a code owner March 27, 2020 21:41
@mstfbl mstfbl changed the title Fixed AutoFitImageClassificationTrainTest hanging by freeing old Tensor objects Fixed & Reactivated AutoFitImageClassificationTrainTest hanging by freeing old Tensor objects Mar 30, 2020
@mstfbl mstfbl requested a review from a team as a code owner April 9, 2020 04:05
@mstfbl
Copy link
Contributor Author

mstfbl commented Apr 10, 2020

I am now disposing of Tensor models and their C-library objects in ExperimentResult.cs and CrossValidateExperimentResult.cs, and have added manual result.Dispose commands to the AutoML test cases. I have tested this approach in my debugging PR, and saw that no out-of-memory errors occured. Here's that test build of running AutoFitImageClassificationTrainTest for 100 iterations: https://dev.azure.com/dnceng/public/_build/results?buildId=595447&view=results #Resolved

@mstfbl mstfbl closed this Apr 12, 2020
@mstfbl mstfbl force-pushed the AutoFitImageClassificationTrainTest-memoryFix branch from 6e64d60 to 8660ecc Compare April 12, 2020 04:00
@mstfbl mstfbl reopened this Apr 12, 2020
@harishsk
Copy link
Contributor

harishsk commented Apr 16, 2020

        DataViewSchema modelInputSchema,

If you are not going to support modelFileInfo being null, you should throw an exception if it is null


Refers to: src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs:22 in 087c0d5. [](commit_id = 087c0d5, deletion_comment = False)

@mstfbl
Copy link
Contributor Author

mstfbl commented Apr 16, 2020

        DataViewSchema modelInputSchema,

With the current implementation, modelFileInfo is never null as in the case where the given root directory is null, a temp file in a temp path is returned.


In reply to: 614421392 [](ancestors = 614421392)


Refers to: src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs:22 in 087c0d5. [](commit_id = 087c0d5, deletion_comment = False)

@harishsk
Copy link
Contributor

        DataViewSchema modelInputSchema,

Yes, but that knowledge does not exist here in this function. modelFileInfo gets passed without any validation to ModelContainer. Tomorrow, someone else may make a code change without realizing that.
It is good practice to validate your function parameters. Please add either a debug assert explicitly throw if parameter is null.


In reply to: 614783589 [](ancestors = 614783589,614421392)


Refers to: src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs:22 in 087c0d5. [](commit_id = 087c0d5, deletion_comment = False)

@mstfbl mstfbl marked this pull request as draft April 28, 2020 06:53
@codecov
Copy link

codecov bot commented Jun 4, 2020

Codecov Report

Merging #4978 into master will decrease coverage by 2.48%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4978      +/-   ##
==========================================
- Coverage   75.81%   73.32%   -2.49%     
==========================================
  Files         993     1007      +14     
  Lines      181224   188128    +6904     
  Branches    19510    20246     +736     
==========================================
+ Hits       137387   137937     +550     
- Misses      38538    44665    +6127     
- Partials     5299     5526     +227     
Flag Coverage Δ
#Debug 73.32% <100.00%> (-2.49%) ⬇️
#production 69.07% <100.00%> (-2.64%) ⬇️
#test 87.41% <ø> (-1.52%) ⬇️
Impacted Files Coverage Δ
...crosoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs 60.00% <ø> (ø)
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs 86.77% <ø> (ø)
...oft.ML.AutoML/Experiment/Runners/CrossValRunner.cs 92.50% <100.00%> (ø)
...L.AutoML/Experiment/Runners/TrainValidateRunner.cs 97.29% <100.00%> (ø)
.../Microsoft.ML.Vision/ImageClassificationTrainer.cs 91.16% <100.00%> (-0.05%) ⬇️
...ML.Tests/Scenarios/IrisPlantClassificationTests.cs 0.00% <0.00%> (-100.00%) ⬇️
test/Microsoft.ML.AutoML.Tests/Util.cs 50.00% <0.00%> (-50.00%) ⬇️
...st/Microsoft.ML.Functional.Tests/Datasets/Adult.cs 58.82% <0.00%> (-41.18%) ⬇️
...rc/Microsoft.ML.AutoML/API/RunDetails/RunDetail.cs 77.27% <0.00%> (-18.19%) ⬇️
...enerator/CodeGenerator/CSharp/PipelineExtension.cs 67.34% <0.00%> (-16.33%) ⬇️
... and 226 more

@mstfbl mstfbl closed this Jul 28, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants