-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes #4571. About memory leak when using FeaturizeText. #4576
Changes from 7 commits
1182b91
19a2d0a
d2f6528
20cb8de
c3a1174
d36ea8e
eca5b13
62af85b
e82485b
3dfa0b9
198caed
48892ea
ba5dadd
94b2424
a5eb6c8
ce181b6
25d6250
70886fe
b5766fb
c94a293
2998b05
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
// Licensed to the .NET Foundation under one or more agreements. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sharwell Please, review this benchmark test and tell me if this was what you had in mind when you requested to add one |
||
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
|
||
using System; | ||
using System.IO; | ||
using System.Collections.Generic; | ||
using System.Linq; | ||
using Microsoft.ML.Data; | ||
using BenchmarkDotNet.Attributes; | ||
using Microsoft.ML.Transforms.Text; | ||
using Xunit; | ||
|
||
namespace Microsoft.ML.Benchmarks | ||
{ | ||
[Config(typeof(TrainConfig))] | ||
public class FeaturizeTextBench | ||
{ | ||
private MLContext mlContext; | ||
private IDataView dataset; | ||
private static int numColumns = 1000; | ||
private static int numRows = 300; | ||
private static int maxWordLength = 15; | ||
|
||
[GlobalSetup] | ||
public void SetupData() | ||
{ | ||
Path.GetTempFileName(); | ||
mlContext = new MLContext(seed: 1); | ||
var path = Path.GetTempFileName(); | ||
Console.WriteLine($"Created dataset in temporary file:\n{path}\n"); | ||
path = CreateRandomFile(path); | ||
|
||
var columns = new List<TextLoader.Column>(); | ||
for(int i = 0; i < numColumns; i++) | ||
{ | ||
columns.Add(new TextLoader.Column($"Column{i}", DataKind.String, i)); | ||
} | ||
|
||
var textLoader = mlContext.Data.CreateTextLoader(new TextLoader.Options() | ||
{ | ||
Columns = columns.ToArray(), | ||
HasHeader = false, | ||
Separators = new char[] { ',' } | ||
}); | ||
|
||
dataset = textLoader.Load(path); | ||
} | ||
|
||
[Benchmark] | ||
public ITransformer TrainFeaturizeText() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Consider adding the following:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 💡 Can you include a sample output from this test as a comment in the pull request, just so we have a point-in-time reference? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure what the sample output should look like, since in this sample I am actually not generating an output, only a transformer that featurizes the text. I could add a step where I actually use the transformer to transform some input data (perhaps a single row of the dataset). But then again, the output of the featurizer would be a big vector of floats representing random strings. So I wouldn't think it's worth it to do this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❔ Is there any meaningful way to vary the "size" of this test, such that we can see how increases to the size impact the performance? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, varying the number of rows generated for the dataset is the first thing I can think of. Would you recommend adding more benchmarks for different sizes? |
||
{ | ||
var textColumns = new List<string>(); | ||
for (int i = 0; i < 20; i++) // Only load first 20 columns | ||
{ | ||
textColumns.Add($"Column{i}"); | ||
} | ||
|
||
var featurizers = new List<TextFeaturizingEstimator>(); | ||
foreach (var textColumn in textColumns) | ||
{ | ||
var featurizer = mlContext.Transforms.Text.FeaturizeText(textColumn, new TextFeaturizingEstimator.Options() | ||
{ | ||
CharFeatureExtractor = null, | ||
WordFeatureExtractor = new WordBagEstimator.Options() | ||
{ | ||
NgramLength = 2, | ||
MaximumNgramsCount = new int[] { 200000 } | ||
} | ||
}); | ||
featurizers.Add(featurizer); | ||
} | ||
|
||
IEstimator<ITransformer> pipeline = featurizers.First(); | ||
foreach (var featurizer in featurizers.Skip(1)) | ||
{ | ||
pipeline = pipeline.Append(featurizer); | ||
} | ||
|
||
var model = pipeline.Fit(dataset); | ||
var memoryUsage = GC.GetTotalMemory(true); | ||
Console.WriteLine($"Memory Used: {memoryUsage/1000000:0,0.00}MB"); | ||
Assert.True(memoryUsage < 1000000000, $"This benchmark should use less than 1GB of memory, but it's using {memoryUsage/1000000:0,0.00}MB"); // Memory usage should be less than 1GB after PR #4576 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @harishsk this is the assertion you asked me to add to the benchmark. In my machine, without the changes made in this PR, the value of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1GB maybe a bit high. How about 2x the memory usage you are seeing on your machine? If memory usage on a different machine exceeds two times what you are seeing on your machine, then it is reason to investigate what is happening. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, so I will change it into 400MB. Thanks! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, when running the test on my machine memoryUsage is actually, at most 120MB, so I will fix the threshold into 240MB. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @adamsitnik Is there a way to make assertions about the stabilized outcome of the benchmark? |
||
|
||
return model; | ||
} | ||
|
||
public static string CreateRandomFile(string path) | ||
{ | ||
Random random = new Random(1); | ||
|
||
using (StreamWriter file = new StreamWriter(path)) | ||
{ | ||
for(int i = 0; i < numRows; i++) | ||
file.WriteLine(CreateRandomLine(numColumns, random)); | ||
} | ||
return path; | ||
} | ||
|
||
public static string CreateRandomLine(int columns, Random random) | ||
{ | ||
var lineSB = new System.Text.StringBuilder(); | ||
for(int i = 0; i < columns; i++) | ||
{ | ||
lineSB.Append(CreateRandomColumn(random, random.Next(100))); | ||
lineSB.Append(","); | ||
} | ||
return lineSB.ToString(); | ||
} | ||
|
||
public static string CreateRandomColumn(Random random, int numwords) | ||
{ | ||
const string characters = | ||
"01234567890" + | ||
"abcdefghijklmnopqrstuvwxyz" + | ||
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"; | ||
|
||
var columnSB = new System.Text.StringBuilder(); | ||
int wordLength; | ||
|
||
for(int i = 0; i < numwords; i++) | ||
{ | ||
wordLength = random.Next(1, maxWordLength); | ||
for(int j = 0; j < wordLength; j++) | ||
columnSB.Append(characters[random.Next(characters.Length)]); | ||
|
||
columnSB.Append(" "); | ||
} | ||
|
||
if (random.Next(2) == 0) // sometimes return the column as lowercase | ||
return columnSB.ToString().ToLower(); | ||
|
||
return columnSB.ToString(); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to expose the same flag in the corresponding
Add()
method, and any other callers of this function within the NormStr class.Speaking of, you may want to investigate which outside components call this interface to see if their memory usage / runtime will increase when defaulting to
true
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only method inside the
NormStr
class that callsGet()
withadd = true
is the Add() method you mention. So that's going to be the only other method where I will add the flag.About your other suggestion, I will discuss with @harishsk about where to investigate. But it seems to me that the Add() method is only used in 2 other classes:
ValueToKeyMapperTransformer.Builder.TextImpl
, which is the class that holds the _pool that I've been working with in this issue. It might be worth exploring cases where the VTKMT would have a worse memory usage with the changes introduced in this PR. I've tried with a couple of different datasets and textloder options, but I haven't seen this changes to cause a problem with memory usage or runtime, and I don't know which other kind of cases to explore.ModelSaveContext
, where strings are serialized to disk as indices, and they're added to a NormStr.Pool that is later serialized; furthermore, it seems the specific overload relevant to us is only called when serializing a StopWords list. In this case I don't see why the changes in this PR could cause a problem in its memory usage/runtime (because the changes of this PR would only execute once, when thestr
is not already in the pool).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking for other uses which would be affected.