-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling Ranking Cross Validation #5263
Conversation
@@ -156,6 +157,31 @@ public void AutoFitRankingTest() | |||
Assert.True(col.Name == expectedOutputNames[col.Index]); | |||
} | |||
|
|||
[LightGBMFact] | |||
public void AutoFitRankingCVTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the way experiments are used within codegen.
Review: Should I add cross validation tests to all other experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add cross validation tests to all other experiments?
If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)
tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the other way around. If it has less than 15000 it runs train test split automatically on one of the Execute overloads, if it has more it runs CV. This only happens on 1 overload, but I believe Keren isn't using that overload on her tests.
In reply to: 447156630 [](ancestors = 447156630)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added CV testing for ranking only. I think it would be good to add testing for other task as well in the future.
Codecov Report
@@ Coverage Diff @@
## master #5263 +/- ##
==========================================
+ Coverage 73.68% 73.70% +0.01%
==========================================
Files 1022 1022
Lines 190320 190490 +170
Branches 20470 20484 +14
==========================================
+ Hits 140238 140393 +155
- Misses 44553 44560 +7
- Partials 5529 5537 +8
|
IProgress<CrossValidationRunDetail<TMetrics>> progressHandler = null) | ||
{ | ||
UserInputValidationUtil.ValidateNumberOfCVFoldsArg(numberOfCVFolds); | ||
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.SamplingKeyColumnName); | ||
UserInputValidationUtil.ValidateSamplingKey(columnInformation?.SamplingKeyColumnName, columnInformation?.GroupIdColumnName, _task); | ||
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numberOfCVFolds, columnInformation?.GroupIdColumnName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential Bug: Should we add the new ValidateSamplingKey()
method also to the other Execute
overloads that use columnInformation.SamplingKeyColumn
to split data?
For example this overload:
machinelearning/src/Microsoft.ML.AutoML/API/ExperimentBase.cs
Lines 96 to 116 in e8fa731
public ExperimentResult<TMetrics> Execute(IDataView trainData, ColumnInformation columnInformation, | |
IEstimator<ITransformer> preFeaturizer = null, IProgress<RunDetail<TMetrics>> progressHandler = null) | |
{ | |
// Cross val threshold for # of dataset rows -- | |
// If dataset has < threshold # of rows, use cross val. | |
// Else, run experiment using train-validate split. | |
const int crossValRowCountThreshold = 15000; | |
var rowCount = DatasetDimensionsUtil.CountRows(trainData, crossValRowCountThreshold); | |
if (rowCount < crossValRowCountThreshold) | |
{ | |
const int numCrossValFolds = 10; | |
var splitResult = SplitUtil.CrossValSplit(Context, trainData, numCrossValFolds, columnInformation?.SamplingKeyColumnName); | |
return ExecuteCrossValSummary(splitResult.trainDatasets, columnInformation, splitResult.validationDatasets, preFeaturizer, progressHandler); | |
} | |
else | |
{ | |
var splitResult = SplitUtil.TrainValidateSplit(Context, trainData, columnInformation?.SamplingKeyColumnName); | |
return ExecuteTrainValidate(splitResult.trainData, columnInformation, splitResult.validationData, preFeaturizer, progressHandler); | |
} | |
} |
That overload uses CVSplit if the data is over 15000 rows, using SamplingKeyColumnName
. But notice that overload also splits the data using TrainTestSplit if it is under that number of rows, and TrainTestSplit also uses samplingKeyColumn
; the samplingKeyColumn
for TTSplit should also be the GroupId column if it's done for ranking, for the same reasons we've already discussed offline. I.e. in both cases we need samplingKeyColumnName
to be the same as GroupIdColumnName
, thus needing to call ValidateSamplingKey()
there too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(unresolving) I see you've updated other Execute
overloads addressing this comment, but I think you missed one overload? The one at:
https://github.com/antoniovs1029/machinelearning/blob/bb13d629000c218136e741b643767cf45ae12fc4/src/Microsoft.ML.AutoML/API/ExperimentBase.cs#L221-L232
In reply to: 451036924 [](ancestors = 451036924)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, just curious, do you happen to know which Execute
method does ModelBuilder calls? /cc: @LittleLittleCloud
In reply to: 451951124 [](ancestors = 451951124,451036924)
src/Microsoft.ML.AutoML/Experiment/MetricsAgents/IMetricsAgent.cs
Outdated
Show resolved
Hide resolved
columnInformation = new ColumnInformation() | ||
{ | ||
LabelColumnName = labelColumnName, | ||
GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: As per the feedback we got from @justinormont today, I think it would be better to set both GroupIdColumnName
and SamplingKeyColumnName
in here. Something like:
columnInformation = new ColumnInformation()
{
LabelColumnName = labelColumnName,
SamplingKeyColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId,
GroupIdColumnName = samplingKeyColumn ?? DefaultColumnNames.GroupId // For ranking, we want to enforce having the same column as samplingKeyColum and GroupIdColumn
}
With your current implementation it won't make any difference to do this, but I do think this might be clearer for future AutoML.NET developers.
A similar change would need to take place in the other overload that receives a samplingKeyColumnName
but no columnInformation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards deferring this to the next update. I will take a quick look, but having a column with two column info seems to be causing issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a two line change (adding the line in here, and in the other overload), and it's just to have it clear in the columnInformation object that we'll be using the samplingKeyColumn
provided by the user both as SamplingKeyColumnName
and GroupIdColumnName
(which is actually what we're doing). So I think it's clearer this way. But whatever you decide is fine 😉
In reply to: 452987600 [](ancestors = 452987600)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, mapping a groupId column to both SamplingKeyColumnName
and GroupIdColumnName
doesn't work with the current implementation. The current implementation uses GroupIdColumnName
as the SamplingKeyColumnName
, so if the user provides a SamplingKeyColumnName,
we throw an error (unless they are both the same).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I know the current implementation throws if they're not the same. That's way I suggested using the samplingKeyColumn
to set both SamplingKeyColumnName
and GroupIdColumnName
.
In general if the user provides a ColumnInformation
object containing SamplingKeyColumnName
and GroupIdColumnName
, then we should accept that if both are the same (and in the current implementation this is doable). So I'm just not sure what's the problem in here.
In reply to: 453055824 [](ancestors = 453055824)
@@ -156,6 +157,31 @@ public void AutoFitRankingTest() | |||
Assert.True(col.Name == expectedOutputNames[col.Index]); | |||
} | |||
|
|||
[LightGBMFact] | |||
public void AutoFitRankingCVTest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add cross validation tests to all other experiments?
If I recall it correctly, if your dataset has less than 15000 lines of data, AutoML will run CrossValidation automatically, if you have more than 15000 piece of data, it will use train-test split instead. So the rest of tests in AutoFitTests should all be CV runs considering that the dataset it uses is really small. (@justinormont correct me if I'm wrong)
tests start with AutoFit should test AutoML ranking experiment API, so you shouldn't have to create your pipeline from scratch in this test, If you just want to test Ranking.CrossValidation command, considering rename it more specifically.
Issue: Cross Validation is needed in order to integrate the AutoML Ranking Experiment with ModelBuilder.
Resolves: #2685
User Scenarios:
If the user doesn't provide a
GroupIdColumnName
, the default becomes "GroupId". This is used as theSamplingKeyColumn
used to split the data.If the user provides both a
GroupIdColumnName
and aSamplingKeyColumnName
, both most be the same or an exception is thrown.Review: If the user only provides a
SamplingKeyColumnName
, should we throw an error. Since we use the groupId to split the CV data, the user should not be populating theSamplingKeyColumnName
. As of right now, an error is only thrown when theGroupIdColumnName
and aSamplingKeyColumnName
differ.