Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic number of features for the trainer / schema #4903

Closed
artemiusgreat opened this issue Feb 28, 2020 · 11 comments
Closed

Dynamic number of features for the trainer / schema #4903

artemiusgreat opened this issue Feb 28, 2020 · 11 comments
Assignees
Labels
Awaiting User Input Awaiting author to supply further info (data, model, repro). Will close issue if no more info given. classification Bugs related classification tasks lightgbm Bugs related lightgbm loadsave Bugs related loading and saving data or models P3 Doc bugs, questions, minor issues, etc.

Comments

@artemiusgreat
Copy link

artemiusgreat commented Feb 28, 2020

System information

  • OS version/distro: Windows 10 Pro x64
  • .NET Version (eg., dotnet --info): .NET Core 3.0
  • ML.NET Version: 1.5.0-preview

Issue

Trying to use variable number of properties (dynamic schema) for the trainer using dataView.SelectColumns. This creates correct trainer with only 2 features, but prediction engine still requires to specify original input model and uses all 10+ features, even though all features except selected 2 were set to 0.

What did you do?

  • use input model with 10 features / properties
  • create data view and select only 2 of these features
  • use LGBM as a trainer
  • create 3 input items with labels - Strategy1, Strategy2, Strategy3 and train estimator
  • try to make prediction providing test item identical to Strategy3

What happened?

  • output schema in CreatePredictionEngine shows that there are 10+ columns, even though, when I created a data view for training, I selected only 2 features
  • result of prediction is always the same - Strategy1, most probably because trainer always compares 10+ features instead of 2, even though all features except selected 2 were set to 0

What did you expect?

  • if estimator was trained to use only 2 features / input properties, then prediction engine should use provided data view schema and should also work only with 2 selected properties
  • in the code below I'd like to make sure that properties Contrast, Param1 ... Param5 are ignored by prediction engine

Source code / logs

public class MyInputModel
{
  [ColumnName(nameof(PredictorLabelsEnum.Strategy)), LoadColumn(0)]
  public string Strategy { get; set; }

  [ColumnName(nameof(InputNamesEnum.Pitch)), LoadColumn(1)]
  public float Pitch { get; set; }

  [ColumnName(nameof(InputNamesEnum.Energy)), LoadColumn(2)]
  public float Energy { get; set; }

  [ColumnName(nameof(InputNamesEnum.Contrast)), LoadColumn(3, 8), VectorType(6)]
  public float[] Contrast { get; set; }
  
  [ColumnName(nameof(InputNamesEnum.Param1)), LoadColumn(9)]
  public float Param1 { get; set; }

  [ColumnName(nameof(InputNamesEnum.Param2)), LoadColumn(10)]
  public float Param2 { get; set; }

  [ColumnName(nameof(InputNamesEnum.Param3)), LoadColumn(11)]
  public float Param3 { get; set; }

  [ColumnName(nameof(InputNamesEnum.Param4)), LoadColumn(12)]
  public float Param4 { get; set; }

  [ColumnName(nameof(InputNamesEnum.Param5)), LoadColumn(13)]
  public float Param5 { get; set; }
}

public IEstimator<ITransformer> GetPipeline(IEnumerable<string> columns)
{
  var pipeline = Context
    .Transforms
    .Conversion
    .MapValueToKey(new[] { new InputOutputColumnPair("Label", "Strategy") })  // use property "strategy" as categorizable label
    .Append(Context.Transforms.Concatenate("Combination", columns.ToArray()))  // merge properties selected for analysis into "Combination"
    .Append(Context.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") }));  // normalize selected properties as "Features"

  return pipeline;
}

public IEstimator<ITransformer> GetEstimator()
{
  var estimator = Context
    .MulticlassClassification
    .Trainers
    .LightGbm()
    .Append(Context.Transforms.Conversion.MapKeyToValue(new[] { new InputOutputColumnPair("Prediction", "PredictedLabel") }));

  return estimator;
}

public byte[] SaveModel(IEnumerable<MyInputModel> items)
{
  var columns = new [] { "Pitch", "Energy" };
  var estimator = GetEstimator();
  var pipeline = GetPipeline(columns);
  var sourceInputs = Context.Data.LoadFromEnumerable(items);
  var inputs = Context
    .Transforms
    .SelectColumns(columns.Concat(new List<string> { "Strategy" }).ToArray()) // model has ~10 properties, we select only 2 of them 
    .Fit(sourceInputs)
    .Transform(sourceInputs);

  var pipelineModel = pipeline.Fit(inputs);
  var pipelineView = pipelineModel.Transform(inputs);
  var estimatorModel = pipeline.Append(estimator).Fit(inputs);
  var model = new byte[0];

  using (var memoryStream = new MemoryStream())
  {
    Context.Model.Save(estimatorModel, pipelineView.Schema, memoryStream);
    model = memoryStream.ToArray();
  }

  return model;
}

public string LoadModelAndEstimate(byte[] predictor)
{
  var prediction = string.Empty;

  // let's make input identical to Strategy3, but somehow predicted result is still Strategy1

  var input = new MyInputModel 
  {
    Pitch = 50,
    Energy = 10,
    Contrast = new [] { 0, 0, 0, 0, 0, 0 },
    Param1 = 0,
    Param2 = 0,
    Param3 = 0,
    Param4 = 0,
    Param5 = 0
  };

  using (var stream = new MemoryStream(predictor))
  {
    var model = Context.Model.Load(stream, out var schema) as TransformerChain<ITransformer>;
    var chain = (model.LastTransformer as IEnumerable<ITransformer>).First() as MulticlassPredictionTransformer<OneVersusAllModelParameters>;
    var chainModel = chain.Model as OneVersusAllModelParameters; // here I see only 3 properties with weights - Pitch, Energy, Label
    var engine = Context.Model.CreatePredictionEngine<MyInputModel, MyOutputModel>(model); // here output schema shows 10+ columns, even though I expect 3
    
    // also tried to specify data view schema from the model explicitly for prediction engine
    // var engine = Context.Model.CreatePredictionEngine<MyInputModel, MyOutputModel>(model, schema); 
    
    prediction = engine.Predict(input);
  }

  return prediction;
}

Example

var testData = 
[
  { 
    Strategy = "Strategy1",
    Pitch = 115,
    Energy = 50,
    Contrast = new [] { 0, 0, 0, 0, 0, 0 },
    Param1 = 0, Param2 = 0, Param3 = 0, Param4 = 0, Param5 = 0
  },
  {
    Strategy = "Strategy2",
    Pitch = 90,
    Energy = 30,
    Contrast = new [] { 0, 0, 0, 0, 0, 0 },
    Param1 = 0, Param2 = 0, Param3 = 0, Param4 = 0, Param5 = 0
  },
  {
    Strategy = "Strategy3",
    Pitch = 50,
    Energy = 10,
    Contrast = new [] { 0, 0, 0, 0, 0, 0 },
    Param1 = 0, Param2 = 0, Param3 = 0, Param4 = 0, Param5 = 0
  }
]
var trainData = 
[
  {
    Strategy = "Strategy3",
    Pitch = 50,
    Energy = 10,
    Contrast = new [] { 0, 0, 0, 0, 0, 0 },
    Param1 = 0, Param2 = 0, Param3 = 0, Param4 = 0, Param5 = 0
  }
]
@ganik ganik self-assigned this Feb 28, 2020
@ganik
Copy link
Member

ganik commented Feb 28, 2020

I ll take a look.

@artemiusgreat
Copy link
Author

artemiusgreat commented Mar 1, 2020

After creating a simplified example, I see that LGBM always selects first model in the training set as a prediction, no matter what was provided as a test data.
Tested both approaches, with prediction made by model and by prediction engine.
https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/machine-learning-model-predictions-ml-net

public class StrategyInputModel
{
  [ColumnName("Strategy"), LoadColumn(0)]
  public string Strategy { get; set; }  // will be used as a label (classifier)

  [ColumnName("Pitch"), LoadColumn(1)]
  public float Pitch { get; set; }  // will be used as a part of dynamic Features

  [ColumnName("Energy"), LoadColumn(2)]
  public float Energy { get; set; }  // will be always 0

  [ColumnName("Contrast"), LoadColumn(3, 8), VectorType(6)]
  public float[] Contrast { get; set; }  // will be always [0, 0, 0, 0, 0, 0]
}

public class StrategyOutputModel
{
  [ColumnName("Prediction")]
  public string Prediction { get; set; }
  public float[] Score { get; set; }
}

public IEstimator<ITransformer> GetPipeline(IEnumerable<string> columns)
{
  var pipeline = Context
    .Transforms
    .Conversion
    .MapValueToKey(new[] { new InputOutputColumnPair("Label", "Strategy") })
    .Append(Context.Transforms.Concatenate("Combination", columns.ToArray())) // merge "dynamic" colums into single property
    .Append(Context.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") })) // normalize merged columns into Features
    .Append(Context.Transforms.SelectColumns(new string[] { "Label", "Features" })); // remove everything from data view, except transformed columns

  return pipeline;
}

public IEstimator<ITransformer> GetEstimator()
{
  var estimator = Context
    .MulticlassClassification
    .Trainers
    .LightGbm()
    .Append(Context.Transforms.Conversion.MapKeyToValue(new[]
    {
      new InputOutputColumnPair("Prediction", "PredictedLabel") // set trainer to use Prediction property as output
    }));

  return estimator;
}

public byte[] SaveModel(IEnumerable<string> columns, IEnumerable<StrategyInputModel> items)
{
  var estimator = GetEstimator();
  var pipeline = GetPipeline(columns);
  var inputs = Context.Data.LoadFromEnumerable(items);
  var estimatorModel = pipeline.Append(estimator).Fit(inputs);
  var model = new byte[0];

  using (var memoryStream = new MemoryStream())
  {
    Context.Model.Save(estimatorModel, inputs.Schema, memoryStream);
    model = memoryStream.ToArray();
  }

  return model;
}

Test method

public string Estimate()
{
  var aInput = new StrategyInputModel
  {
    Strategy = "A",
    Pitch = 130F,
    Energy = 0,
    Contrast = new float[] { 0, 0, 0, 0, 0, 0 }
  };

  var bInput = new StrategyInputModel
  {
    Strategy = "B",
    Pitch = 131F,
    Energy = 0,
    Contrast = new float[] { 0, 0, 0, 0, 0, 0 }
  };

  var columns = new[] { "Pitch" };
  var predictor = SaveModel(columns, new[] { aInput, bInput }); // train model on "A" and "B"

  using (var stream = new MemoryStream(predictor))
  {
    var model = Context.Model.Load(stream, out var schema);
    var inputs = Context.Data.LoadFromEnumerable(new[] { bInput }); // pass "B" as test data

    var predictions = model.Transform(inputs);
    var output = Context.MulticlassClassification.Evaluate(data: predictions); // Log Loss = 0.69, Micro / Macro Accuracy = 1
    var modelPrediction = predictions.GetColumn<string>("Prediction").ToArray().FirstOrDefault(); // get "A" as prediction [WRONG]

    var engine = Context.Model.CreatePredictionEngine<StrategyInputModel, StrategyOutputModel>(model);
    var enginePreidiction = engine.Predict(bInput).Prediction; // get "A" as prediction [WRONG]

    return modelPrediction;
  }
}

@artemiusgreat
Copy link
Author

artemiusgreat commented Mar 2, 2020

Now, shocking news :)
Tried 4 different architectures / model types, including combinations of them, e.g. create model using FasTreeOva and test using LightGbmMulti.

  • FastTreeOva
  • LightGbmMulti
  • SdcaMaximumEntropy
  • AveragedPerceptronOva

Training set = 3 records. Test set = item #2 from the training

Results

Model created and trained with AveragedPerceptronOva always produces correct results. For example, if I create model with AveragedPerceptronOva and then test item using the same AveragedPerceptronOva or LightGbmMulti, then prediction is correct.

If I create and train model using any other architecture, no matter what model I use with test data, it always returns item #1 from the training set as the best match, which is wrong.

- [Train] AveragedPerceptronOva => [Test] FastTreeOva => Correct prediction
- [Train] AveragedPerceptronOva => [Test] LightGbmMulti => Correct prediction
- [Train] AveragedPerceptronOva => [Test] SdcaMaximumEntropy => Correct prediction
- [Train] AveragedPerceptronOva => [Test] AveragedPerceptronOva => Correct prediction
- [Train] FastTreeOva => [Test] Any model type => Wrong prediction
- [Train] LightGbmMulti => [Test] Any model type => Wrong prediction
- [Train] SdcaMaximumEntropy => [Test] Any model type => Wrong prediction

Conclusion

There is no issue with Prediction Engine. There is something wrong with Context.Model.Save method from ModelOperationsCatalog class, because if model was created correctly, then any other model works fine and produces correct predictions.

@ganik ganik added the P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. label Mar 2, 2020
@artemiusgreat
Copy link
Author

artemiusgreat commented Mar 5, 2020

Possibly related...
#4051
#3878

@artemiusgreat
Copy link
Author

artemiusgreat commented Mar 10, 2020

Update

Algorithms AveragedPerceptronOva and SdcaMaximumEntropy work much better when training and prediction is based on a vector of values, rather than on a single value.
Example with real values that I used for testing.

Train data that give CORRECT classification

var inputs = new InputModel[] 
[
  {
    Label = "Sample #1",
    Factors = new float[] { 163.22714, 2.8778636, 0.5324864, 1.5412121, 0.64363956, 0.1371824, -0.021679323, 0.42805633, -0.712864, -0.2189847, 0.12471165, 0.07920727, 0.47652832 }
  },
  {
    Label = "Sample #2",
    Factors = new float[] { 148.25192, 4.3155456, 0.70223117, 1.5649862, 1.1754155, 0.13773751, 0.2579985, -0.26886848, -0.6455144, -0.073765576, -0.15425977, 0.19466293, 0.43180266 }
  },
  {
    Label = "Sample #3",
    Factors = new float[] { 164.9029, 4.810955, 0.87685776, 1.4808261, 0.9378684, 0.13101591, -0.06908134, -0.067622736, -0.8588759, -0.038343582, 0.36045787, -0.25861377, 0.63997686 }
  }
]

Train data that give WRONG classification

var inputs = new InputModel[] 
[
  {
    Label = "Sample #1",
    Factor = 154.1958F
  },
  {
    Label = "Sample #2",
    Factors = 130.47337F
  },
  {
    Label = "Sample #3",
    Factors = 135.6923F
  }
]

Results

When I use "wrong" data set for training and then try to use each of its items as a test data, AveragedPerceptronOva can successfully identify "Sample #1" with value of 154, but fails to distinguish values 130 and 135.

When I use "correct" data set, AveragedPerceptronOva and SdcaMaximumEntropy can correctly identify each item when it's used as a test data.

Tree-based algorithms always fail and return incorrect result on small data sets, no matter what training set was provided and which of its items was used as a test data. At the same time, trees work approximately fine on 500+ records. Perhaps, tree cannot be built using only 2-3 items?

Question

Is there an activation function, or some threshold, or some parameter to the algorithm that can make them more sensitive, to correctly separate values 130 and 135?

@artemiusgreat
Copy link
Author

Update

Tried to provide various options to LGBM trainer.
https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.trainers.lightgbm.lightgbmmulticlasstrainer.options?view=ml-dotnet

  • Seed = from 1 to 10
  • LearningRate, Sigmoid = 0.1...0.9
  • UnbalancedSets, UseSoftmax, UseCategoricalSplit = true
  • Number of leaves, iterations, splits and other counts = 1...10...100

Results

No changes. LGBM (possibly all tree-based algorithms) gives an incorrect prediction on small data sets.

@artemiusgreat
Copy link
Author

Tried to use XGBoost implemented in this library.
https://github.com/mdabros/SharpLearning/wiki/Using-SharpLearning.XGBoost
Same results - incorrect prediction.
Apparently, tree-based algorithms just can't work with small data-sets.

@harishsk harishsk added P1 Priority of the issue for triage purpose: Needs to be fixed soon. and removed P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. labels Apr 21, 2020
@harishsk harishsk added loadsave Bugs related loading and saving data or models classification Bugs related classification tasks lightgbm Bugs related lightgbm labels Apr 29, 2020
@wangyems wangyems self-assigned this Jul 1, 2020
@wangyems
Copy link
Member

wangyems commented Jul 8, 2020

Hi @artemiusgreat,

Sorry for the late response. Regarding your first comment, you found that the output schema has more than 10 columns and result of prediction is always the same. I'm not 100% sure but it's likely that when you call CreatePredictionEngine(), all the stuff defined in your MyInputModel are considered and which is ought to be the correct behavior of ML.NET. As a work around, you can manually define the input that actually use in MyInputModel. Meanwhile, can you please provide the full pipeline of it so that I can investigate further to make sure I give you the right answer?

@wangyems wangyems added Awaiting User Input Awaiting author to supply further info (data, model, repro). Will close issue if no more info given. P3 Doc bugs, questions, minor issues, etc. and removed P1 Priority of the issue for triage purpose: Needs to be fixed soon. labels Jul 8, 2020
@wangyems
Copy link
Member

wangyems commented Jul 9, 2020

Regarding this comment , I think that the network structure of the model you implemented does not have any obvious problem.
image
The features(Features.output in the screenshot) that obtained from original data is a 4-dimension vector: aInput ->[0.99237, 0, 0, 0] and bInput->[1, 0, 0, 0]. They are distinguishable. The reason that the trained model can not correctly distinguish them can possibly be categorized into multi-classification algorithm limitations. As in your example, the number of training data is small and it will indirectly limit the number of iterations that the algorithm runs. The result is that the model does not actually learn anything. Also some of your examples show certain classifiers predict correctly. It's likely that their predictions are happen to be "correct".

@wangyems
Copy link
Member

wangyems commented Jul 9, 2020

As for this comment. Yes, I think it's likely that tree-based algorithms does not work well on small datasets, and it's probably the expected behavior. In general practice, it's recommended to train tree-based algorithms with larger datasets.

@wangyems wangyems added enhancement New feature or request P3 Doc bugs, questions, minor issues, etc. and removed P3 Doc bugs, questions, minor issues, etc. enhancement New feature or request labels Jul 9, 2020
@wangyems
Copy link
Member

Closing this issue for now. Please feel free to reopen it if you need more help!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Awaiting User Input Awaiting author to supply further info (data, model, repro). Will close issue if no more info given. classification Bugs related classification tasks lightgbm Bugs related lightgbm loadsave Bugs related loading and saving data or models P3 Doc bugs, questions, minor issues, etc.
Projects
None yet
Development

No branches or pull requests

4 participants