Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML Regression Experiment fails after 67iterations #4906

Closed
francescomazzurco opened this issue Mar 2, 2020 · 16 comments · Fixed by #5163
Closed

AutoML Regression Experiment fails after 67iterations #4906

francescomazzurco opened this issue Mar 2, 2020 · 16 comments · Fixed by #5163
Assignees
Labels
AutoML.NET Automating various steps of the machine learning process bug Something isn't working P2 Priority of the issue for triage purpose: Needs to be fixed at some point.

Comments

@francescomazzurco
Copy link

Hi,

When running a Regression Experiment, AutoML sistematically fails after 67 iterations, raising the Exception "All instances skipped due to missing features". By looking at other issues, I got the idea that the SmacSweeper could be the cause. This is also suggested by the stack strace:

in Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl.MakeBoundariesAndCheckLabels(Int64& missingInstances, Int64& totalInstances)
   in Microsoft.ML.Trainers.FastTree.DataConverter.MemImpl..ctor(RoleMappedData data, IHost host, Double[][] binUpperBounds, Single maxLabel, Boolean dummy, Boolean noFlocks, PredictionKind kind, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   in Microsoft.ML.Trainers.FastTree.DataConverter.Create(RoleMappedData data, IHost host, Int32 maxBins, Single maxLabel, Boolean diskTranspose, Boolean noFlocks, Int32 minDocsPerLeaf, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeatureIndices, Boolean categoricalSplit)
   in Microsoft.ML.Trainers.FastTree.ExamplesToFastTreeBins.FindBinsAndReturnDataset(RoleMappedData data, PredictionKind kind, IParallelTraining parallelTraining, Int32[] categoricalFeaturIndices, Boolean categoricalSplit)
   in Microsoft.ML.Trainers.FastTree.FastTreeTrainerBase`3.ConvertData(RoleMappedData trainData)
   in Microsoft.ML.Trainers.FastTree.FastForestRegressionTrainer.TrainModelCore(TrainContext context)
   in Microsoft.ML.Trainers.TrainerEstimatorBase`2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
   in Microsoft.ML.AutoML.SmacSweeper.FitModel(IEnumerable`1 previousRuns)
   in Microsoft.ML.AutoML.SmacSweeper.ProposeSweeps(Int32 maxSweeps, IEnumerable`1 previousRuns)
   in Microsoft.ML.AutoML.PipelineSuggester.SampleHyperparameters(MLContext context, SuggestedTrainer trainer, IEnumerable`1 history, Boolean isMaximizingMetric)
   in Microsoft.ML.AutoML.PipelineSuggester.GetNextInferredPipeline(MLContext context, IEnumerable`1 history, DatasetColumnInfo[] columns, TaskKind task, Boolean isMaximizingMetric, CacheBeforeTrainer cacheBeforeTrainer, IEnumerable`1 trainerWhitelist)
   in Microsoft.ML.AutoML.Experiment`2.Execute()
   in Microsoft.ML.AutoML.ExperimentBase`2.Execute(ColumnInformation columnInfo, DatasetColumnInfo[] columns, IEstimator`1 preFeaturizer, IProgress`1 progressHandler, IRunner`1 runner)
   in Microsoft.ML.AutoML.ExperimentBase`2.Execute(IDataView trainData, ColumnInformation columnInformation, IEstimator`1 preFeaturizer, IProgress`1 progressHandler)

However, compared to the other issues, I'm running a console application, I'm loading data from database with no missing values. and I hopefully have the right NuGet dependencies:

  • Microsoft.ML.AutoML and Microsoft.ML.Recommender: 0.16.0
  • Microsoft.ML and all the other ML packages: 1.4.0

I understand that the problem might be caused by some of the third-party libraries ML depends on, but isn't at least possible to ignore the exception thrown by a single trainer without compromising the whole regression experiment? I would like to be able to access the BestRun object and choose the best out of the first 67 experiments without having to look back at the CacheDirectory.

If necessary, I can generate a csv with all the data used for training.

Thanks

@mstfbl mstfbl added Azure AutoML https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-automated-ml bug Something isn't working P2 Priority of the issue for triage purpose: Needs to be fixed at some point. labels Mar 2, 2020
@mstfbl
Copy link
Contributor

mstfbl commented Mar 2, 2020

Hi @francescomazzurco , please send along a .csv example with which we can reproduce this issue.

@francescomazzurco
Copy link
Author

Hi @mstfbl, I'm now creating a small working example along with the .csv, but I am encountering difficulties in reproducing the issue. I'll dig into it and give you updates by the end of the day

@francescomazzurco
Copy link
Author

Ok, I found the problem. I could reproduce the exception only on one of our computers, so I finally realised that the issue is related to culture, even when data is loaded from memory and there is no parsing. In the project I attached, data is parsed and loaded using invariant culture. Then, a non-english culture is set just before running the experiment.

   var mlContext = new MLContext();
   List<Model> models = ReadCsv(@"data\data.csv");
   var dataView = BuildDataView(mlContext, models);
   var experimentSettings = new RegressionExperimentSettings
   {
         MaxExperimentTimeInSeconds = 600,
          CacheDirectory = new DirectoryInfo(@".\cache"),
   };
   var experiment = mlContext.Auto().CreateRegressionExperiment(experimentSettings);
   // Data has already been parsed using invariant culture 
   CultureInfo.DefaultThreadCurrentCulture = CultureInfo.CreateSpecificCulture("it-IT");
   var bestRun = experiment.Execute(dataView).BestRun;

The exception is thrown after the 67th iteration.
TestML.zip

Now I've seen other issues related to culture, not sure if they are reporting the same issue but in such case feel free to close this issue. Thanks

@justinormont justinormont added AutoML.NET Automating various steps of the machine learning process and removed Azure AutoML https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-automated-ml labels Mar 3, 2020
@justinormont
Copy link
Contributor

@francescomazzurco: This should be fixed in the next release (v1.5.0-preview2). There was a fix added in January to use culture invariant when sweeping parameter values -- #4635.

You can test against the nightly NuGet feed by adding https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json as a NuGet source in Visual Studio. Feed details: https://dev.azure.com/dnceng/public/_packaging?_a=connect&feed=MachineLearning.

@francescomazzurco
Copy link
Author

francescomazzurco commented Mar 3, 2020

Hi @justinormont, thanks for your reply.
I tested against the nightly build, no exception is thrown anymore, however the regression experiment hangs forever and does not complete the 68th training. Nothing happens even after MaxExperimentTimeInSeconds (I expected the experiment to abort after such time).
Interestingly, this behaviour only occurs when setting a non-english culture, so it seems that culture still has effects on the SmacSweeper.

I published the working example here: https://github.com/francescomazzurco/TestML

@justinormont
Copy link
Contributor

@LittleLittleCloud: Do you have time to investigate?

@LittleLittleCloud
Copy link
Contributor

I will take a look

@DiegoStefanon
Copy link

DiegoStefanon commented Apr 28, 2020

Hi I am Diego S. , from Italy.
I have the same issue ...
CreateBinaryClassificationExperiment is good
CreateRegressionExperiment fail..
only if I set
CultureInfo.DefaultThreadCurrentCulture = CultureInfo.CreateSpecificCulture("en-EN");
it works.
The data is good, not nulls.
So I think it a bug.
I get the data from a database.
package ML.AutoML 0.16

@francescomazzurco
Copy link
Author

Quick update: I just tested against v.0.17.1 and the bug is still there. Same behavior: the 68th iteration hang forever and never completes.

@justinormont
Copy link
Contributor

@francescomazzurco: I believe this fixed now. It will be available in the next release. Or you can run against the nightly build, as outlined above.

@francescomazzurco
Copy link
Author

@justinormont I just tested against v.0.17.3-29420-1 from October 20th, but the bug is still there. I see there are newer builds, but I am not able to install them as NuGet can not find package MlNetMklDepsCode

@justinormont
Copy link
Contributor

@francescomazzurco: You'll need a nightly build or release after 2020-10-30 as the fix went in then.

@harishsk: Any guess why the nightly won't install for @francescomazzurco?

@antoniovs1029
Copy link
Member

@justinormont @francescomazzurco

As part of moving into arcade, we've published some nugets that have a bug, where it requires the MlNetMklDepsCode nuget to work. This is a bug, and we're working on fixing it. Those nugets should be ignored for the time being.

Also, there had been some problems with publishing nugets from master (which are the ones required by @francescomazzurco ), and so I believe there hasn't been any nuget published correctly from master since October 20th. So I don't think there's any public nuget including the change made on October 30, Justin is referring to. This problem was on Azure DevOps side, and should be fixed now. So I'll run a manual build to publish nugets from master branch, and hopefully it will work. I'll update this thread with info about that. Thanks.

@antoniovs1029
Copy link
Member

There are some problems with our nuget publishing pipeline. Working on that now, I'll update this thread once the nuget is published.

@antoniovs1029
Copy link
Member

The nugets has just been published to the public feed.
@francescomazzurco , please, try version 0.17.3-29530-4 from the feed, it should work now.
Thanks.

@francescomazzurco
Copy link
Author

I was able to successfully install the most recent build from today ( 0.17.3-29602-5 ) which indeed solves the bug. Feel free to close the issue. Thanks for the support

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process bug Something isn't working P2 Priority of the issue for triage purpose: Needs to be fixed at some point.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants