Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling Ranking Cross Validation #5263

Merged
merged 12 commits into from
Jul 10, 2020
Merged

Enabling Ranking Cross Validation #5263

merged 12 commits into from
Jul 10, 2020

Conversation

Lynx1820
Copy link
Contributor

@Lynx1820 Lynx1820 commented Jun 26, 2020

  • Enabled Ranking Cross Validation
  • Fixed some syntax errors in the ranking CodeGen generated code

Issue: Cross Validation is needed in order to integrate the AutoML Ranking Experiment with ModelBuilder.
Resolves: #2685

User Scenarios:

  • If the user doesn't provide a GroupIdColumnName, the default becomes "GroupId". This is used as the SamplingKeyColumn used to split the data.

  • If the user provides both a GroupIdColumnName and a SamplingKeyColumnName, both most be the same or an exception is thrown.

Review: If the user only provides a SamplingKeyColumnName, should we throw an error. Since we use the groupId to split the CV data, the user should not be populating the SamplingKeyColumnName. As of right now, an error is only thrown when the GroupIdColumnName and a SamplingKeyColumnName differ.

@@ -156,6 +157,31 @@ public void AutoFitRankingTest()
Assert.True(col.Name == expectedOutputNames[col.Index]);
}

[LightGBMFact]
public void AutoFitRankingCVTest()
Copy link
Contributor Author

@Lynx1820 Lynx1820 Jun 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the way experiments are used within codegen.
Review: Should I add cross validation tests to all other experiments?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add cross validation tests to all other experiments?

If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)

tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.

Copy link
Member

@antoniovs1029 antoniovs1029 Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the other way around. If it has less than 15000 it runs train test split automatically on one of the Execute overloads, if it has more it runs CV. This only happens on 1 overload, but I believe Keren isn't using that overload on her tests.


In reply to: 447156630 [](ancestors = 447156630)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added CV testing for ranking only. I think it would be good to add testing for other task as well in the future.

@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #5263 into master will increase coverage by 0.01%.
The diff coverage is 89.07%.

@@            Coverage Diff             @@
##           master    #5263      +/-   ##
==========================================
+ Coverage   73.68%   73.70%   +0.01%     
==========================================
  Files        1022     1022              
  Lines      190320   190490     +170     
  Branches    20470    20484      +14     
==========================================
+ Hits       140238   140393     +155     
- Misses      44553    44560       +7     
- Partials     5529     5537       +8     
Flag Coverage Δ
#Debug 73.70% <89.07%> (+0.01%) ⬆️
#production 69.43% <72.97%> (+<0.01%) ⬆️
#test 87.64% <100.00%> (+0.02%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.AutoML/API/ColumnInference.cs 100.00% <ø> (ø)
...oML/Experiment/MetricsAgents/BinaryMetricsAgent.cs 74.35% <ø> (ø)
...toML/Experiment/MetricsAgents/MultiMetricsAgent.cs 69.69% <ø> (ø)
...Experiment/MetricsAgents/RegressionMetricsAgent.cs 66.66% <ø> (ø)
src/Microsoft.ML.AutoML/Utils/BestResultUtil.cs 53.84% <0.00%> (ø)
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs 85.50% <ø> (ø)
...c/Microsoft.ML.Data/Evaluators/RankingEvaluator.cs 78.55% <ø> (ø)
src/Microsoft.ML.AutoML/API/RankingExperiment.cs 67.74% <33.33%> (+7.13%) ⬆️
...crosoft.ML.AutoML/Utils/UserInputValidationUtil.cs 90.50% <40.00%> (-1.46%) ⬇️
src/Microsoft.ML.AutoML/API/ExperimentBase.cs 76.02% <65.85%> (+9.36%) ⬆️
... and 56 more

Comment on lines 194 to 198
IProgress<CrossValidationRunDetail<TMetrics>> progressHandler = null)
{
UserInputValidationUtil.ValidateNumberOfCVFoldsArg(numberOfCVFolds);
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.SamplingKeyColumnName);
UserInputValidationUtil.ValidateSamplingKey(columnInformation?.SamplingKeyColumnName, columnInformation?.GroupIdColumnName, _task);
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.GroupIdColumnName);
Copy link
Member

@antoniovs1029 antoniovs1029 Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential Bug: Should we add the new ValidateSamplingKey() method also to the other Execute overloads that use columnInformation.SamplingKeyColumn to split data?

For example this overload:

public ExperimentResult<TMetrics> Execute(IDataView trainData, ColumnInformation columnInformation,
IEstimator<ITransformer> preFeaturizer = null, IProgress<RunDetail<TMetrics>> progressHandler = null)
{
// Cross val threshold for # of dataset rows --
// If dataset has < threshold # of rows, use cross val.
// Else, run experiment using train-validate split.
const int crossValRowCountThreshold = 15000;
var rowCount = DatasetDimensionsUtil.CountRows(trainData, crossValRowCountThreshold);
if (rowCount < crossValRowCountThreshold)
{
const int numCrossValFolds = 10;
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numCrossValFolds, columnInformation?.SamplingKeyColumnName);
return ExecuteCrossValSummary(splitResult.trainDatasets, columnInformation, splitResult.validationDatasets, preFeaturizer, progressHandler);
}
else
{
var splitResult = SplitUtil.TrainValidateSplit(Context, trainData, columnInformation?.SamplingKeyColumnName);
return ExecuteTrainValidate(splitResult.trainData, columnInformation, splitResult.validationData, preFeaturizer, progressHandler);
}
}

That overload uses CVSplit if the data is over 15000 rows, using SamplingKeyColumnName. But notice that overload also splits the data using TrainTestSplit if it is under that number of rows, and TrainTestSplit also uses samplingKeyColumn; the samplingKeyColumn for TTSplit should also be the GroupId column if it's done for ranking, for the same reasons we've already discussed offline. I.e. in both cases we need samplingKeyColumnName to be the same as GroupIdColumnName, thus needing to call ValidateSamplingKey() there too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(unresolving) I see you've updated other Execute overloads addressing this comment, but I think you missed one overload? The one at:
https://github.com/antoniovs1029/machinelearning/blob/bb13d629000c218136e741b643767cf45ae12fc4/src/Microsoft.ML.AutoML/API/ExperimentBase.cs#L221-L232


In reply to: 451036924 [](ancestors = 451036924)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, just curious, do you happen to know which Execute method does ModelBuilder calls? /cc: @LittleLittleCloud


In reply to: 451951124 [](ancestors = 451951124,451036924)

Comment on lines 73 to 77
columnInformation = new ColumnInformation()
{
LabelColumnName = labelColumnName,
GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: As per the feedback we got from @justinormont today, I think it would be better to set both GroupIdColumnName and SamplingKeyColumnName in here. Something like:

columnInformation = new ColumnInformation()
{
    LabelColumnName = labelColumnName,
    SamplingKeyColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId,
    GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId // For ranking, we want to enforce having the same column as samplingKeyColum and GroupIdColumn
}

With your current implementation it won't make any difference to do this, but I do think this might be clearer for future AutoML.NET developers.

A similar change would need to take place in the other overload that receives a samplingKeyColumnName but no columnInformation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would lean towards deferring this to the next update. I will take a quick look, but having a column with two column info seems to be causing issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a two line change (adding the line in here, and in the other overload), and it's just to have it clear in the columnInformation object that we'll be using the samplingKeyColumn provided by the user both as SamplingKeyColumnName and GroupIdColumnName (which is actually what we're doing). So I think it's clearer this way. But whatever you decide is fine 😉


In reply to: 452987600 [](ancestors = 452987600)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, mapping a groupId column to both SamplingKeyColumnName and GroupIdColumnName doesn't work with the current implementation. The current implementation uses GroupIdColumnName as the SamplingKeyColumnName, so if the user provides a SamplingKeyColumnName, we throw an error (unless they are both the same).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I know the current implementation throws if they're not the same. That's way I suggested using the samplingKeyColumn to set both SamplingKeyColumnName and GroupIdColumnName.

In general if the user provides a ColumnInformation object containing SamplingKeyColumnName and GroupIdColumnName, then we should accept that if both are the same (and in the current implementation this is doable). So I'm just not sure what's the problem in here.


In reply to: 453055824 [](ancestors = 453055824)

@@ -156,6 +157,31 @@ public void AutoFitRankingTest()
Assert.True(col.Name == expectedOutputNames[col.Index]);
}

[LightGBMFact]
public void AutoFitRankingCVTest()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add cross validation tests to all other experiments?

If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)

tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.

@Lynx1820 Lynx1820 merged commit f87a3bb into dotnet:master Jul 10, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants