Fixes #4571. About memory leak when using FeaturizeText. #4576

antoniovs1029 · 2019-12-13T20:51:00Z

Updated first post:

Originally this PR included different proposals to solve issue #4571 , which are described below. From all of them, proposal 3 was chosen, and now this PR only includes the changes of that Proposal.

Original first post:

In this PR I include 3 proposals to solve issue #4571:

Modifying NormStr constructor

OR -

Modifying TextNormalizer

OR -

Modifying NormStr.Pool Get Method

(the proposals are mutually exclusive, in that we are supposed to choose one of those, but I've included the three in this PR so that we can discuss about them, since @harishsk and I are unsure about the overall impact of each one, please let us know your opinion, and if there's anything else I should test)

For reference: Offline, @daholste provided me with a dataset which I can't share in here because of privacy reasons. It had 100k rows and 2.5k columns. When featurizing 22 of those columns, he got 22 NormStr Pools which, together, had an inclusive size of 2.25 GB, because many entire rows where held in memory, mainly because of the reason I described in this comment. The text that follows assume experiments based on that dataset, finding n-grams of size 1.

When using Proposal 1 the inclusive size of the NormStr Pools went down from 2.25GB to 22MB. When using Proposal 2 it went down to 30MB. With Proposal 3 it went down to 22MB. With any of the proposals, or without them, the time for fitting was around 40 seconds on my machine.

All of the proposals are based in the idea of creating a ReadOnlyMemory<char> that doesn't have an _object member holding a reference to the string of the whole row.

Proposal 1 - Modifying NormStr constructor

Link to proposal

Each NormStr has _object that only contains the text that NormStr is representing, and it's saved as a char[]
With @daholste's dataset 260k NormStrs are created, and each one of those holds a reference to the corresponding char[]
When fitting with this proposal, this was the Process Memory, notice the oscillations:

@harishsk and I think the oscillations in memory usage are caused by the strings that are created with this implementation, but that are then garbage collected every once in a while, because no actual reference to them is kept.

All the references to entire rows were gone

Proposal 2 - Modifying Text Normalizer

Link to proposal

In general, each NormStr has a _object of type string, which holds the content of the entire column where the NormStr was found.
With the dataset, 260k NormStrs are created, and less than 120k strings are held in memory.
When fitting with this proposal, this was the Process Memory graph, notice no oscillations:
Some references to entire rows were still there (I am still figuring out why). But there were only a few of them, compared to when there was no fix; as it is shown in the decrease in size of the NormStr pools I mentioned (from 2.2GB to 30MB), this proposal is still useful, at least for @daholste case. All the references to strings containing entire rows are gone.

Proposal 3 - Modifying NormStr.Pool Get Method

Link to proposal

Each NormStr has _object that only contains the text of the string that NormStr is representing
With the dataset, 260k NormStr were created, each one holding a reference to their corresponding string.
When fitting with this proposal, this was the Process Memory, notice the oscillations:
All the references to entire rows were gone

src/Microsoft.ML.Core/Utilities/NormStr.cs

src/Microsoft.ML.Transforms/Text/TextNormalizing.cs

sharwell · 2019-12-13T22:42:33Z

I would offer proposal 3: only reconstruct strings when they are actually getting added to the pool. We can still use sub-arrays (spans of larger strings) to avoid allocations when looking things up in the pool. To implement proposal 3, you only need to change str to str.ToString().AsMemory() on this line:

machinelearning/src/Microsoft.ML.Core/Utilities/NormStr.cs

Line 139 in fa62c81

return add ? AddCore(str, hash) : null;

There are three calls to AddCore, but this is the only call that has the possibility of including a large set of characters than are necessary to represent the cached value.

antoniovs1029

I would offer proposal 3: only reconstruct strings when they are actually getting added to the pool.

Thanks for your suggestion, @sharwell . I've just talked to @justinormont , and he actually had the exact same suggestion as yours, and I agree. It would be functionally very similar to Proposal 1 (having similar pros and cons I enumerated in the first post for Proposal 1), because both force the creation of a new ReadOnlyMemory without a reference to the whole row, by temporarilly creating a new string.

The advantage of Proposal 3 is that no char arrays are created, and also that this would only be done for the strs that are ReadOnlyMemory<char> when calling NormStr.Pool's Get with such str.

So I think its better to replace Proposal 1, with Proposal 3 here. And so far, I think I am preferring to not follow Proposal 2.

sharwell · 2019-12-13T23:01:28Z

@antoniovs1029 I would suggest running the test again on proposal 3. Since it only allocates on a cache miss, it may substantially lower the allocation churn you observed when testing proposal 1.

antoniovs1029 · 2019-12-13T23:32:15Z

@antoniovs1029 I would suggest running the test again on proposal 3. Since it only allocates on a cache miss, it may substantially lower the allocation churn you observed when testing proposal 1.

I've done it. But it seems to show the same behavior as Proposal 1. Still I've updated the first post with those findings, and made some corrections to the results of Proposal 2.

src/Microsoft.ML.Core/Utilities/NormStr.cs

codecov · 2019-12-14T00:36:15Z

Codecov Report

Merging #4576 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #4576      +/-   ##
==========================================
+ Coverage   75.13%   75.13%   +<.01%     
==========================================
  Files         909      913       +4     
  Lines      160262   160855     +593     
  Branches    17257    17303      +46     
==========================================
+ Hits       120409   120866     +457     
- Misses      35041    35143     +102     
- Partials     4812     4846      +34

Flag	Coverage Δ
#Debug	`75.13% <ø> (ø)`	⬆️
#production	`70.53% <ø> (ø)`	⬆️
#test	`90.29% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
...osoft.ML.Tests/Transformers/TextFeaturizerTests.cs	`99.63% <ø> (ø)`	⬆️
...rc/Microsoft.ML.TensorFlow/TensorTypeExtensions.cs	`59.09% <0%> (-22.73%)`	⬇️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs	`83.19% <0%> (-3.37%)`	⬇️
test/Microsoft.ML.AutoML.Tests/DatasetUtil.cs	`88.69% <0%> (-2.61%)`	⬇️
.../Microsoft.ML.Vision/ImageClassificationTrainer.cs	`91.06% <0%> (-2.58%)`	⬇️
src/Microsoft.ML.AutoML/Sweepers/Parameters.cs	`84.32% <0%> (-0.85%)`	⬇️
src/Microsoft.ML.TensorFlow/TensorflowTransform.cs	`73.67% <0%> (-0.4%)`	⬇️
...soft.ML.TestFramework/DataPipe/TestDataPipeBase.cs	`76.08% <0%> (-0.4%)`	⬇️
src/Microsoft.ML.Vision/DnnRetrainTransform.cs	`57.03% <0%> (-0.23%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`85.11% <0%> (-0.21%)`	⬇️
... and 16 more

This reverts commit 1182b91.

This reverts commit 19a2d0a.

src/Microsoft.ML.Core/Utilities/NormStr.cs

justinormont · 2019-12-16T20:18:31Z

src/Microsoft.ML.Core/Utilities/NormStr.cs

@@ -116,7 +116,7 @@ public NormStr Get(string str, bool add = false)
                return add ? AddCore(str.AsMemory(), hash) : null;
            }

-            public NormStr Get(ReadOnlyMemory<char> str, bool add = false)
+            public NormStr Get(ReadOnlyMemory<char> str, bool add = false, bool createNewStr = true)


You'll want to expose the same flag in the corresponding Add() method, and any other callers of this function within the NormStr class.

Speaking of, you may want to investigate which outside components call this interface to see if their memory usage / runtime will increase when defaulting to true.

The only method inside the NormStr class that calls Get() with add = true is the Add() method you mention. So that's going to be the only other method where I will add the flag.

About your other suggestion, I will discuss with @harishsk about where to investigate. But it seems to me that the Add() method is only used in 2 other classes:

The ValueToKeyMapperTransformer.Builder.TextImpl , which is the class that holds the _pool that I've been working with in this issue. It might be worth exploring cases where the VTKMT would have a worse memory usage with the changes introduced in this PR. I've tried with a couple of different datasets and textloder options, but I haven't seen this changes to cause a problem with memory usage or runtime, and I don't know which other kind of cases to explore.

The ModelSaveContext, where strings are serialized to disk as indices, and they're added to a NormStr.Pool that is later serialized; furthermore, it seems the specific overload relevant to us is only called when serializing a StopWords list. In this case I don't see why the changes in this PR could cause a problem in its memory usage/runtime (because the changes of this PR would only execute once, when the str is not already in the pool).

Thanks for checking for other uses which would be affected.

src/Microsoft.ML.Core/Utilities/NormStr.cs

justinormont

LGTM; few suggestions above.

antoniovs1029 · 2019-12-16T23:27:54Z

test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs

@@ -0,0 +1,136 @@
+// Licensed to the .NET Foundation under one or more agreements.


@sharwell Please, review this benchmark test and tell me if this was what you had in mind when you requested to add one

antoniovs1029 · 2019-12-16T23:34:52Z

test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs

+            var model = pipeline.Fit(dataset);
+            var memoryUsage = GC.GetTotalMemory(true);
+            Console.WriteLine($"Memory Used: {memoryUsage/1000000:0,0.00}MB");
+            Assert.True(memoryUsage < 1000000000, $"This benchmark should use less than 1GB of memory, but it's using {memoryUsage/1000000:0,0.00}MB"); // Memory usage should be less than 1GB after PR #4576


@harishsk this is the assertion you asked me to add to the benchmark. In my machine, without the changes made in this PR, the value of memoryUsage is over 2.5GB... with the changes of this PR it's less than 200MB. So I didn't know where to put the threshold, and I simply chose 1GB. Don't know if that's ok to you. Thanks.

1GB maybe a bit high. How about 2x the memory usage you are seeing on your machine? If memory usage on a different machine exceeds two times what you are seeing on your machine, then it is reason to investigate what is happening.

Ok, so I will change it into 400MB. Thanks!

Sorry, when running the test on my machine memoryUsage is actually, at most 120MB, so I will fix the threshold into 240MB.

@adamsitnik Is there a way to make assertions about the stabilized outcome of the benchmark?

harishsk

sharwell

I had a few comments but nothing blocking from my side.

sharwell · 2019-12-17T18:58:14Z

test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs

+            var memoryUsage = GC.GetTotalMemory(true);
+            Console.WriteLine($"Memory Used: {memoryUsage/1000000:0,0.00}MB");
+            Assert.True(memoryUsage < 400000000, $"This benchmark should use less than 400MB of memory, but it's using {memoryUsage/1000000:0,0.00}MB"); // Memory usage should be less than 1GB after PR https://github.com/dotnet/machinelearning/pull/4576


💭 I find it's generally not necessary to have these assertions as part of the benchmark itself.

sharwell · 2019-12-17T18:58:20Z

test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs

+        }
+
+        [Benchmark]
+        public ITransformer TrainFeaturizeText()


💡 Consider adding the following:

A benchmark that sets up the cache, and then requests the same string from it a few times using both a string and a ReadOnlyMemory<char> that is a substring of another longer string. Both cases should reveal no allocations for the lookup.

sharwell · 2019-12-17T18:58:49Z

test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs

+        }
+
+        [Benchmark]
+        public ITransformer TrainFeaturizeText()


💡 Can you include a sample output from this test as a comment in the pull request, just so we have a point-in-time reference?

Not sure what the sample output should look like, since in this sample I am actually not generating an output, only a transformer that featurizes the text.

I could add a step where I actually use the transformer to transform some input data (perhaps a single row of the dataset). But then again, the output of the featurizer would be a big vector of floats representing random strings.

So I wouldn't think it's worth it to do this.

sharwell · 2019-12-17T18:59:37Z

test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs

+        }
+
+        [Benchmark]
+        public ITransformer TrainFeaturizeText()


❔ Is there any meaningful way to vary the "size" of this test, such that we can see how increases to the size impact the performance?

Well, varying the number of rows generated for the dataset is the first thing I can think of. Would you recommend adding more benchmarks for different sizes?

antoniovs1029 added 2 commits December 13, 2019 11:23

Proposal to change NormStr

1182b91

Proposal to change TextNormalizingTransformer.Mapper

19a2d0a

antoniovs1029 requested a review from a team as a code owner December 13, 2019 20:51

antoniovs1029 commented Dec 13, 2019

View reviewed changes

src/Microsoft.ML.Core/Utilities/NormStr.cs Outdated Show resolved Hide resolved

antoniovs1029 commented Dec 13, 2019

View reviewed changes

src/Microsoft.ML.Transforms/Text/TextNormalizing.cs Outdated Show resolved Hide resolved

antoniovs1029 requested review from yaeldekel, codemzs, harishsk, justinormont, daholste and eerhardt December 13, 2019 20:58

antoniovs1029 commented Dec 13, 2019

View reviewed changes

Added Proposal 3 - Modifying NormStr.Pool.Get

d2f6528

antoniovs1029 commented Dec 13, 2019

View reviewed changes

src/Microsoft.ML.Core/Utilities/NormStr.cs Outdated Show resolved Hide resolved

antoniovs1029 mentioned this pull request Dec 13, 2019

Memory leak when featurizing text with the default settings #4571

Closed

antoniovs1029 added 3 commits December 16, 2019 10:38

Revert "Proposal to change NormStr"

20cb8de

This reverts commit 1182b91.

Revert "Proposal to change TextNormalizingTransformer.Mapper"

c3a1174

This reverts commit 19a2d0a.

Added flag to control creation of a new str

d36ea8e

justinormont reviewed Dec 16, 2019

View reviewed changes

src/Microsoft.ML.Core/Utilities/NormStr.cs Outdated Show resolved Hide resolved

justinormont reviewed Dec 16, 2019

View reviewed changes

src/Microsoft.ML.Core/Utilities/NormStr.cs Outdated Show resolved Hide resolved

justinormont reviewed Dec 16, 2019

View reviewed changes

src/Microsoft.ML.Core/Utilities/NormStr.cs Outdated Show resolved Hide resolved

justinormont approved these changes Dec 16, 2019

View reviewed changes

Added Benchmark Test

eca5b13

antoniovs1029 commented Dec 16, 2019

View reviewed changes

Added minor comment to benchmark

62af85b

Nit - Fixed references to PRs and Issues

e82485b

antoniovs1029 commented Dec 16, 2019

View reviewed changes

antoniovs1029 added 4 commits December 16, 2019 15:37

Nit - renamed flag in NormStr.Pool.Get()

3dfa0b9

Added space

198caed

Added duplicateStr prameter to NormStr.Pool.Add method

48892ea

Changed threshold in FeaturizeText Benchmark to be 400MB instead of 1GB

ba5dadd

antoniovs1029 requested a review from sharwell December 17, 2019 00:21

Typo

94b2424

harishsk approved these changes Dec 17, 2019

View reviewed changes

Changed threshold to 240MB

a5eb6c8

sharwell reviewed Dec 17, 2019

View reviewed changes

antoniovs1029 closed this Dec 17, 2019

antoniovs1029 reopened this Dec 17, 2019

antoniovs1029 removed request for codemzs, eerhardt, yaeldekel and daholste December 17, 2019 22:25

antoniovs1029 added 6 commits December 17, 2019 14:32

Added Benchmark output

ce181b6

Removed Assertion from Benchmark

25d6250

Added test with memory usage assertion

70886fe

Updated test comment

b5766fb

Comment out test Console.WriteLine

c94a293

Removing test

2998b05

antoniovs1029 merged commit 2b5bd21 into dotnet:master Dec 18, 2019

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #4571. About memory leak when using FeaturizeText. #4576

Fixes #4571. About memory leak when using FeaturizeText. #4576

antoniovs1029 commented Dec 13, 2019 •

edited

Loading

sharwell commented Dec 13, 2019 •

edited

Loading

antoniovs1029 left a comment •

edited

Loading

sharwell commented Dec 13, 2019

antoniovs1029 commented Dec 13, 2019

codecov bot commented Dec 14, 2019 •

edited

Loading

justinormont Dec 16, 2019 •

edited

Loading

antoniovs1029 Dec 17, 2019 •

edited

Loading

justinormont Dec 17, 2019

justinormont left a comment

antoniovs1029 Dec 16, 2019 •

edited

Loading

antoniovs1029 Dec 16, 2019

harishsk Dec 16, 2019

antoniovs1029 Dec 16, 2019

antoniovs1029 Dec 17, 2019

sharwell Dec 17, 2019

harishsk left a comment

sharwell left a comment

sharwell Dec 17, 2019

sharwell Dec 17, 2019

sharwell Dec 17, 2019

antoniovs1029 Dec 17, 2019 •

edited

Loading

sharwell Dec 17, 2019

antoniovs1029 Dec 17, 2019

		@@ -0,0 +1,136 @@
		// Licensed to the .NET Foundation under one or more agreements.

Fixes #4571. About memory leak when using FeaturizeText. #4576

Fixes #4571. About memory leak when using FeaturizeText. #4576

Conversation

antoniovs1029 commented Dec 13, 2019 • edited Loading

Updated first post:

Original first post:

Proposal 1 - Modifying NormStr constructor

Proposal 2 - Modifying Text Normalizer

Proposal 3 - Modifying NormStr.Pool Get Method

sharwell commented Dec 13, 2019 • edited Loading

antoniovs1029 left a comment • edited Loading

Choose a reason for hiding this comment

sharwell commented Dec 13, 2019

antoniovs1029 commented Dec 13, 2019

codecov bot commented Dec 14, 2019 • edited Loading

Codecov Report

justinormont Dec 16, 2019 • edited Loading

Choose a reason for hiding this comment

antoniovs1029 Dec 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinormont left a comment

Choose a reason for hiding this comment

antoniovs1029 Dec 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harishsk left a comment

Choose a reason for hiding this comment

sharwell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoniovs1029 Dec 17, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antoniovs1029 commented Dec 13, 2019 •

edited

Loading

sharwell commented Dec 13, 2019 •

edited

Loading

antoniovs1029 left a comment •

edited

Loading

codecov bot commented Dec 14, 2019 •

edited

Loading

justinormont Dec 16, 2019 •

edited

Loading

antoniovs1029 Dec 17, 2019 •

edited

Loading

antoniovs1029 Dec 16, 2019 •

edited

Loading

antoniovs1029 Dec 17, 2019 •

edited

Loading