Skip to content

Multiclass text classification: training consume a lot of RAM #6007

Open

Description

ds example.txt

System information

  • Windows 10 Home Single Language
  • .NET Version 5.0.400
  • Microsoft.ML 1.6.0

Issue

I'm trying to train model with some dataset. Dataset is about 60 Mb (example in attachments, can't provide full data set because of privacy). It contains some text descriptions about 50-200 chars in each row. Total labels count - 84. There are about 100K rows in dataset for training. After 16-18 hours of training application consume about 32 Gb RAM and terminate with System.OutOfMemory exception (I have only 32 Gb free RAM on my PC). Is this RAM consumption is ok for such kind of task or maybe I'm doing something wrong?

Source code / logs

My data class:

public class SkuInfo
{
  [ColumnName("Category")]
  public string CategoryCode { get; set; }
  
  [ColumnName("ManufacturerId")]
  public float ManufacturerId { get; set; }
  
  [ColumnName("ManufacturerPn")]
  public string ManufacturerPn { get; set; }
  
  [ColumnName("Description")]
  public string Description { get; set; }
}

My trainig pipeline:

private IEstimator<ITransformer> BuildPipeline(MLContext mlContext)
{
	var pipeline = mlContext.Transforms.ReplaceMissingValues(@"ManufacturerId", @"ManufacturerId")
							.Append(mlContext.Transforms.Text.FeaturizeText(@"ManufacturerPn", @"ManufacturerPn"))
							.Append(mlContext.Transforms.Text.FeaturizeText(@"Description", @"Description"))
							.Append(mlContext.Transforms.Concatenate(@"Features", new[] { @"ManufacturerId", "ManufacturerPn", @"Description" }))
							.Append(mlContext.Transforms.Conversion.MapValueToKey(@"Category", @"Category"))
							.Append(mlContext.Transforms.NormalizeMinMax(@"Features", @"Features"))
							.Append(mlContext.MulticlassClassification.Trainers.LbfgsMaximumEntropy(l1Regularization: 0.455F, l2Regularization: 0.034F, labelColumnName: @"Category", featureColumnName: @"Features"))
							.Append(mlContext.Transforms.Conversion.MapKeyToValue(@"PredictedLabel", "PredictedLabel"));

	return pipeline;
}

Training method:

public void TrainFromCollection(IEnumerable<SkuInfo> trainData, string outputModelPath)
{
	var mlContext = new MLContext(seed: 1);
	var dataView = mlContext.Data.LoadFromEnumerable(trainData);
	var pipeline = BuildPipeline(mlContext);
	var model = pipeline.Fit(dataView);
	mlContext.Model.Save(model, dataView.Schema, outputModelPath);
}

ds example.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    classificationBugs related classification tasks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions