Fixed & Reactivated AutoFitImageClassificationTrainTest hanging by freeing old Tensor objects #4978

mstfbl · 2020-03-27T21:41:24Z

AutoFitImageClassificationTrainTest is occasionally hanging, even after PR #4939 . The issue here is that Tensor objects saved in ITransformer model (now renamed as 'estimatorModel`) are not being automatically freed by C#'s Garbage Collector, as these Tensor objects are made in TensorFlow's C libraries.

Edited on 4/9/2020: This PR makes ExperimentResult and CrossValidationExperimentResult implements IDisposable to free remaining C Tensor objects in memory in a deterministic manner. This method of freeing these objects ensures the user will not face null-reference/use-after-free errors when trying to access the model, as this clean up is done after GC runs.

Edited on 4/9/2020: I confirmed that this fixes the hanging "out of memory" and/or "long running test" issues by running AutoFitImageClassificationTrainTest for 100 iterations, in addition to running all the other unit tests in this build. In all of these builds, none of the issues described occur. These builds all time-out because running 100 iterations of AutoFitImageClassificationTrainTest takes more than 1 hour.

I have also reactivated the AutoFitImageClassificationTrainTest unit test with this fix.

**Edit: **

src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs

…Container in used cases

…cationModelParameters

src/Microsoft.ML.Vision/ImageClassificationTrainer.cs

mstfbl · 2020-04-10T00:21:04Z

I am now disposing of Tensor models and their C-library objects in ExperimentResult.cs and CrossValidateExperimentResult.cs, and have added manual result.Dispose commands to the AutoML test cases. I have tested this approach in my debugging PR, and saw that no out-of-memory errors occured. Here's that test build of running AutoFitImageClassificationTrainTest for 100 iterations: https://dev.azure.com/dnceng/public/_build/results?buildId=595447&view=results #Resolved

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

…ificationTrainTest-memoryFix

src/Microsoft.ML.AutoML/Experiment/ModelContainer.cs

src/Microsoft.ML.AutoML/API/ExperimentResults/CrossValidationExperimentResult.cs

…del disposal

…://github.com/mstfbl/machinelearning into AutoFitImageClassificationTrainTest-memoryFix

src/Microsoft.ML.AutoML/API/ExperimentResults/CrossValidationExperimentResult.cs

harishsk · 2020-04-16T05:19:26Z

        DataViewSchema modelInputSchema,

If you are not going to support modelFileInfo being null, you should throw an exception if it is null

Refers to: src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs:22 in 087c0d5. [](commit_id = 087c0d5, deletion_comment = False)

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs

…t .Dispose() calls

mstfbl · 2020-04-16T17:17:23Z

        DataViewSchema modelInputSchema,

With the current implementation, modelFileInfo is never null as in the case where the given root directory is null, a temp file in a temp path is returned.

In reply to: 614421392 [](ancestors = 614421392)

Refers to: src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs:22 in 087c0d5. [](commit_id = 087c0d5, deletion_comment = False)

src/Microsoft.ML.AutoML/API/ExperimentResults/ExperimentResult.cs

harishsk · 2020-04-16T18:41:49Z

        DataViewSchema modelInputSchema,

Yes, but that knowledge does not exist here in this function. modelFileInfo gets passed without any validation to ModelContainer. Tomorrow, someone else may make a code change without realizing that.
It is good practice to validate your function parameters. Please add either a debug assert explicitly throw if parameter is null.

In reply to: 614783589 [](ancestors = 614783589,614421392)

Refers to: src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs:22 in 087c0d5. [](commit_id = 087c0d5, deletion_comment = False)

…ificationTrainTest-memoryFix

codecov · 2020-06-04T06:11:00Z

Codecov Report

Merging #4978 into master will decrease coverage by 2.48%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #4978      +/-   ##
==========================================
- Coverage   75.81%   73.32%   -2.49%     
==========================================
  Files         993     1007      +14     
  Lines      181224   188128    +6904     
  Branches    19510    20246     +736     
==========================================
+ Hits       137387   137937     +550     
- Misses      38538    44665    +6127     
- Partials     5299     5526     +227

Flag	Coverage Δ
#Debug	`73.32% <100.00%> (-2.49%)`	⬇️
#production	`69.07% <100.00%> (-2.64%)`	⬇️
#test	`87.41% <ø> (-1.52%)`	⬇️

Impacted Files	Coverage Δ
...crosoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs	`60.00% <ø> (ø)`
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs	`86.77% <ø> (ø)`
...oft.ML.AutoML/Experiment/Runners/CrossValRunner.cs	`92.50% <100.00%> (ø)`
...L.AutoML/Experiment/Runners/TrainValidateRunner.cs	`97.29% <100.00%> (ø)`
.../Microsoft.ML.Vision/ImageClassificationTrainer.cs	`91.16% <100.00%> (-0.05%)`	⬇️
...ML.Tests/Scenarios/IrisPlantClassificationTests.cs	`0.00% <0.00%> (-100.00%)`	⬇️
test/Microsoft.ML.AutoML.Tests/Util.cs	`50.00% <0.00%> (-50.00%)`	⬇️
...st/Microsoft.ML.Functional.Tests/Datasets/Adult.cs	`58.82% <0.00%> (-41.18%)`	⬇️
...rc/Microsoft.ML.AutoML/API/RunDetails/RunDetail.cs	`77.27% <0.00%> (-18.19%)`	⬇️
...enerator/CodeGenerator/CSharp/PipelineExtension.cs	`67.34% <0.00%> (-16.33%)`	⬇️
... and 226 more

Free Tensor objects in finally statement

f4a910e

mstfbl requested review from harishsk and frank-dong-ms-zz March 27, 2020 21:41

mstfbl requested a review from a team as a code owner March 27, 2020 21:41

mstfbl requested a review from justinormont March 27, 2020 21:59

harishsk suggested changes Mar 27, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs Outdated Show resolved Hide resolved

src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs Outdated Show resolved Hide resolved

mstfbl added 2 commits March 27, 2020 16:13

Update RunnerUtil.cs

4f8475c

Re-enable AutoFitImageClassificationTrainTest after fix

f7e1337

mstfbl changed the title ~~Fixed AutoFitImageClassificationTrainTest hanging by freeing old Tensor objects~~ Fixed & Reactivated AutoFitImageClassificationTrainTest hanging by freeing old Tensor objects Mar 30, 2020

justinormont reviewed Mar 31, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs Outdated Show resolved Hide resolved

mstfbl and others added 3 commits April 6, 2020 23:23

Added IDisposable support to ModelContainer & corrected name of model…

83ad312

…Container in used cases

Corrected name of modelContainer in used cases

b366fbe

Clean up Tensor objects through finalizer/destructor of ImageClassifi…

cb22b21

…cationModelParameters

mstfbl requested a review from a team as a code owner April 9, 2020 04:05

justinormont reviewed Apr 9, 2020

View reviewed changes

src/Microsoft.ML.Vision/ImageClassificationTrainer.cs Outdated Show resolved Hide resolved

mstfbl added 2 commits April 9, 2020 16:31

Dispose ExperimentResult objects at the end

eefa76f

Dispose only Tensor objects in models

45681b4

mstfbl requested review from justinormont and harishsk April 10, 2020 00:21

justinormont reviewed Apr 10, 2020

View reviewed changes

test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs Outdated Show resolved Hide resolved

mstfbl closed this Apr 12, 2020

mstfbl force-pushed the AutoFitImageClassificationTrainTest-memoryFix branch from 6e64d60 to 8660ecc Compare April 12, 2020 04:00

mstfbl added 2 commits April 11, 2020 21:11

Don't free BestModel models

fbd3fd9

Merge remote-tracking branch 'upstream/master' into AutoFitImageClass…

2816ced

…ificationTrainTest-memoryFix

mstfbl reopened this Apr 12, 2020

mstfbl added 2 commits April 13, 2020 17:29

Throw Exception if model is trying to be accessed after disposal

7dad242

Initialize IsModelDisposed inside constructors

1488d0c

mstfbl requested a review from justinormont April 14, 2020 04:34

harishsk reviewed Apr 14, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Experiment/ModelContainer.cs Outdated Show resolved Hide resolved

harishsk reviewed Apr 14, 2020

View reviewed changes

src/Microsoft.ML.AutoML/Experiment/ModelContainer.cs Outdated Show resolved Hide resolved

harishsk reviewed Apr 14, 2020

View reviewed changes

src/Microsoft.ML.AutoML/API/ExperimentResults/CrossValidationExperimentResult.cs Outdated Show resolved Hide resolved

mstfbl added 4 commits April 15, 2020 22:00

Model always written to disk, no longer stored in memory, simplify mo…

78bba9c

…del disposal

Model always written to disk, no longer stored in memory, simplify mo…

bf84823

…del disposal

Merge branch 'AutoFitImageClassificationTrainTest-memoryFix' of https…

15b6135

…://github.com/mstfbl/machinelearning into AutoFitImageClassificationTrainTest-memoryFix

Update ModelContainer.cs

087c0d5