Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combined methods related to splitting data into one single method. Also fixed related issues. #5227

Merged
merged 14 commits into from
Jun 30, 2020

Conversation

antoniovs1029
Copy link
Member

@antoniovs1029 antoniovs1029 commented Jun 10, 2020

This PR combines CreateStratificationColumn and EnsureGroupPreservationColumn, and some code on GetSplitColumn, into a new method called CreateSplitColumn. This method receives a samplingKeyColumnName parameter and always creates a new column named SplitColumn based on the samplingKeyColumn... the new column is elsewhere in the code used for splitting datasets for Train-Test splits or Cross-Validation Splits, and it's always safe to drop the new column as well. Note: the samplingKeyColumn that CreateSplitColumn receives as parameter, or the column it creates, are called samplingKeyColumn, stratificationColumn, groupPreservationColumn, or splitColumn (or maybe even rowGroupColumn ?) elsewhere in the code or docs, depending on where on the code you're / which API you're calling (ML.NET's TrainTestSplit, CrossValidationSplit, Evaluate or CrossValidate... AutoML, MAML, or NimbusML/Entrypoints). The inconsistency of the naming for this column is a long-standing problem, but since they're names used both in public APIs and for historical reasons, I guess there's not much we can do now (see #2536, and related issues and PR's).

This PR also solves multiple issues related with the "samplingKeyColumn" when splitting data:

Issue: When samplingKeyColumn is KeyType, dropping the original samplingKeyColumn when calling from entrypoints, makes it impossible to reuse that column
Solution: Use a copy column transformer to copy the original column, so that it's safe to drop the new column elsewhere without loosing the original column
Fixes #5221 : Issue with NimbusML test

Issue : a "samplingKeyColumn" is created, but not dropped, when users split data with ML.NET's API before training a model with AutoML, causing some problems when reusing the AutoML model
Solution: Always drop the new column when calling from ML.NET's APIs. Note: to distinguish between the "samplingKeyColumn" provided by the user, and the one we internally create, the new column is now called "SplitColumn".
Fixes #5256
Fixes #4048
Fixes dotnet/machinelearning-modelbuilder#694

Issue: If the "samplingKeyColumn" is float/double type, and includes negative values, or is normalized and outside the (0,1) range, then it might not work as expected.
Solution: Use ensureZeroUntouched = false in the normalizing estimator, and always normalize if it's float/double
Fixes #5227 (comment)

@antoniovs1029 antoniovs1029 requested a review from a team as a code owner June 10, 2020 23:57
@codecov
Copy link

codecov bot commented Jun 11, 2020

Codecov Report

Merging #5227 into master will increase coverage by 0.10%.
The diff coverage is 99.73%.

@@            Coverage Diff             @@
##           master    #5227      +/-   ##
==========================================
+ Coverage   73.49%   73.59%   +0.10%     
==========================================
  Files        1014     1014              
  Lines      188680   188963     +283     
  Branches    20330    20326       -4     
==========================================
+ Hits       138677   139074     +397     
+ Misses      44493    44396      -97     
+ Partials     5510     5493      -17     
Flag Coverage Δ
#Debug 73.59% <99.73%> (+0.10%) ⬆️
#production 69.37% <98.27%> (+0.07%) ⬆️
#test 87.52% <100.00%> (+0.08%) ⬆️
Impacted Files Coverage Δ
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs 85.50% <97.72%> (+3.68%) ⬆️
...crosoft.ML.Data/Commands/CrossValidationCommand.cs 84.04% <100.00%> (+0.52%) ⬆️
src/Microsoft.ML.Data/TrainCatalog.cs 83.94% <100.00%> (ø)
src/Microsoft.ML.EntryPoints/CVSplit.cs 100.00% <100.00%> (ø)
src/Microsoft.ML.EntryPoints/TrainTestSplit.cs 94.44% <100.00%> (+17.52%) ⬆️
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs 92.27% <100.00%> (+1.96%) ⬆️
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs 98.43% <100.00%> (+0.05%) ⬆️
test/Microsoft.ML.Tests/Scenarios/Api/TestApi.cs 94.16% <100.00%> (+1.31%) ⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 79.48% <0.00%> (-20.52%) ⬇️
src/Microsoft.ML.FastTree/RegressionTree.cs 75.51% <0.00%> (-8.17%) ⬇️
... and 21 more

new NormalizingEstimator.MinMaxColumnOptions(stratCol, stratificationColumn, ensureZeroUntouched: true))
.Fit(data).Transform(data);
{
data = new ColumnCopyingEstimator(host,(stratCol,stratificationColumn)).Fit(data).Transform(data);
Copy link

@yaeldekel yaeldekel Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnCopyingEstimator [](start = 35, length = 22)

Should this be added to EnsureGroupPreservationColumn as well?

And more generally, would it make sense to unify the two methods? There is also a method called GetSplitColumn in line 285 in CrossValidationCommand.cs that does more or less the same thing. I think the same utility method should be called for maml, entry points and the C# API. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I took a quick look into the EnsureGroupPreservationColumn callers, and it doesn't seem that they drop the created column, so this wouldn't be an issue there. I'll look into it more later to confirm this.

About merging these methods, I think it would be a good idea to maintain consistency across the codebase, although it wouldn't be necessary to fix the issue opened by @ganik to unblock NimbusML update. I'll look into this idea.


In reply to: 438552643 [](ancestors = 438552643)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that when EnsureGroupPreservationColumn was called, the samplingKeyColumn was never dropped, so the issue didn't apply on that case.

Also, I've merged the methods you've mentioned, so now only CreateGroupPreservationColumn exists


In reply to: 438969558 [](ancestors = 438969558,438552643)

Copy link
Member Author

@antoniovs1029 antoniovs1029 Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've renamed CreateGroupPreservationColumn to CreateSplitColumn and made similar renames elsewhere to make the code easier to read. Now "splitColumn" is the name of the temporary column we create for splitting, regardless if it's based-off a "samplingKeyColumn" (ML.NET) or a "stratificationColumn" (NimbusML, Maml, legacy TLC naming conventions...)

Also now I do drop the new "splitColumn" when it's created while splitting through ML.NET's API, because not dropping it was causing other issues... see comment:
#5227 (comment)
Added 2 tests to cheked this is dropped, and updated my PR description to explain all the issues it's now fixing.


In reply to: 441992076 [](ancestors = 441992076,438969558,438552643)

@yaeldekel
Copy link

yaeldekel commented Jun 11, 2020

Can you add a unit test that has a similar case to the NimbusML test that exposed this bug? #Resolved

@antoniovs1029 antoniovs1029 changed the title Fixed issue when running CrossValidation from an entrypoint Combined methods related to splitting data into one single method. Also fixed related issues. Jun 18, 2020
var split = mlContext.Data.TrainTestSplit(input, 0.5, nameof(Input.TextStrat));
var ids = split.TestSet.GetColumn<int>(split.TestSet.Schema[nameof(Input.Id)]);
Assert.Contains(1, ids);
Assert.Contains(5, ids);
split = mlContext.Data.TrainTestSplit(input, 0.5, nameof(Input.FloatStrat));
ids = split.TrainSet.GetColumn<int>(split.TrainSet.Schema[nameof(Input.Id)]);
ids = split.TestSet.GetColumn<int>(split.TestSet.Schema[nameof(Input.Id)]);
Copy link

@yaeldekel yaeldekel Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestSet [](start = 24, length = 7)

why? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the change to ensureZeroUntouched = false


In reply to: 443437139 [](ancestors = 443437139)

** Fix issue with AutoML
** Enforce having the same schema before and after splitting, avoiding future issues
*  Added tests for these
*  Enforce that samplingKeyColumnName shouldn't be null by the time we split stuff, since anyway it will throw in the RangeFilter if its null. Also change its name to tempSamplingKeyColumnName since it's going to be dropped.
@antoniovs1029 antoniovs1029 requested a review from a team as a code owner June 27, 2020 04:49
Comment on lines +206 to +210
public void AutoFitWithPresplittedData()
{
// Models created in AutoML should work over the same data,
// no matter how that data is splitted before passing it to the experiment execution
// or to the model for prediction
Copy link
Member Author

@antoniovs1029 antoniovs1029 Jun 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test abstracts the issue in #5256 (which is also the issue in #4048 and dotnet/machinelearning-modelbuilder#694 )

This test fails with a
System.ArgumentOutOfRangeException : Could not find input column 'SamplingKeyColumn' without the changes on this PR.

The particular change that fixed this issue is dropping the (temporary version of) SamplingKeyColumn (now called SplitColumn), on DataOperationsCatalog.CrossValidationSplit and DataOperationsCatalog.TrainTestSplit in ML.NET (no change was needed in AutoML, but I added this AutoML test since this issue was found repeatedly by AutoML users).

…r renames for consistency. Internally we'll call the new column SplitColumn, regardless if it's based off a "samplingKeyColumn" or a "stratificationColumn"
Copy link
Contributor

@LittleLittleCloud LittleLittleCloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve on behalf of AutoML test change

{
env.CheckValue(splitColumn, nameof(splitColumn));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitColumn [](start = 27, length = 11)

Should we check/assert that this column exists in the data?

Copy link
Member Author

@antoniovs1029 antoniovs1029 Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it because if the column doesn't exist (or if splitColumn was null) it is going to throw an exception anyway inside the RangeFilter. Since it's necessary that the column exists for the RangeFilter to work, I thought it was sensible to check for this here and don't let splitColumn to be null.


In reply to: 447414598 [](ancestors = 447414598)

Min = (double)fold / numberOfFolds,
Max = (double)(fold + 1) / numberOfFolds,
Complement = false,
IncludeMin = true,
IncludeMax = true
}, data);

yield return new TrainTestData(trainFilter, testFilter);
var trainDV = ColumnSelectingTransformer.CreateDrop(env, trainFilter, splitColumn);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateDrop [](start = 57, length = 10)

Is this a new behavior? Or is it needed now because of the other changes?
If it is a new behavior, why is it needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the SplitColumn is something new to fix the situation I mentioned here:
#5227 (comment)

Not dropping it was causing an issue with AutoML. And also it didn't make sense that we didn't drop it, because if we didn't do it then the schema of a DataView was changed by splitting it, which I think should be considered unexpected behavior.


In reply to: 447415004 [](ancestors = 447415004)

Copy link

@yaeldekel yaeldekel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@antoniovs1029 antoniovs1029 merged commit 709ec02 into dotnet:master Jun 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants