Open
Description
openedon Nov 19, 2021
System information
- Windows 10 Home Single Language
- .NET Version 5.0.400
- Microsoft.ML 1.6.0
Issue
I'm trying to train model with some dataset. Dataset is about 60 Mb (example in attachments, can't provide full data set because of privacy). It contains some text descriptions about 50-200 chars in each row. Total labels count - 84. There are about 100K rows in dataset for training. After 16-18 hours of training application consume about 32 Gb RAM and terminate with System.OutOfMemory exception (I have only 32 Gb free RAM on my PC). Is this RAM consumption is ok for such kind of task or maybe I'm doing something wrong?
Source code / logs
My data class:
public class SkuInfo
{
[ColumnName("Category")]
public string CategoryCode { get; set; }
[ColumnName("ManufacturerId")]
public float ManufacturerId { get; set; }
[ColumnName("ManufacturerPn")]
public string ManufacturerPn { get; set; }
[ColumnName("Description")]
public string Description { get; set; }
}
My trainig pipeline:
private IEstimator<ITransformer> BuildPipeline(MLContext mlContext)
{
var pipeline = mlContext.Transforms.ReplaceMissingValues(@"ManufacturerId", @"ManufacturerId")
.Append(mlContext.Transforms.Text.FeaturizeText(@"ManufacturerPn", @"ManufacturerPn"))
.Append(mlContext.Transforms.Text.FeaturizeText(@"Description", @"Description"))
.Append(mlContext.Transforms.Concatenate(@"Features", new[] { @"ManufacturerId", "ManufacturerPn", @"Description" }))
.Append(mlContext.Transforms.Conversion.MapValueToKey(@"Category", @"Category"))
.Append(mlContext.Transforms.NormalizeMinMax(@"Features", @"Features"))
.Append(mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(l1Regularization: 0.455F, l2Regularization: 0.034F, labelColumnName: @"Category", featureColumnName: @"Features"))
.Append(mlContext.Transforms.Conversion.MapKeyToValue(@"PredictedLabel", "PredictedLabel"));
return pipeline;
}
Training method:
public void TrainFromCollection(IEnumerable<SkuInfo> trainData, string outputModelPath)
{
var mlContext = new MLContext(seed: 1);
var dataView = mlContext.Data.LoadFromEnumerable(trainData);
var pipeline = BuildPipeline(mlContext);
var model = pipeline.Fit(dataView);
mlContext.Model.Save(model, dataView.Schema, outputModelPath);
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment