Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore hidden columns in AutoML schema checks of validation data #4490

Merged
merged 2 commits into from
Nov 21, 2019

Conversation

daholste
Copy link
Contributor

@daholste daholste commented Nov 20, 2019

Closes #4491

When the AutoML API consumes data, it validates schema consistency between the train and validation data.

There are two bugs in this logic:

  1. The API asserts that the count of columns in the train and validation data must be equal. This throws an exception if the two data views have the same number of active columns but a different number of hidden columns. This PR updates to assert that the # of active (not hidden) columns in the train and validation data are equal.

  2. If either the train or validation data has a hidden column with a type that differs from an active column of the same name, an exception is thrown. This PR restricts type consistency checks to active columns only.

@daholste daholste added the AutoML.NET Automating various steps of the machine learning process label Nov 20, 2019
@daholste daholste requested a review from a team as a code owner November 20, 2019 23:07
Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@@ -183,14 +183,19 @@ private static void ValidateValidationData(IDataView trainData, IDataView valida

const string schemaMismatchError = "Training data and validation data schemas do not match.";

if (trainData.Schema.Count != validationData.Schema.Count)
if (trainData.Schema.Count(c => !c.IsHidden) != validationData.Schema.Count(c => !c.IsHidden))
{
throw new ArgumentException($"{schemaMismatchError} Train data has '{trainData.Schema.Count}' columns," +
$"and validation data has '{validationData.Schema.Count}' columns.", nameof(validationData));
}

foreach (var trainCol in trainData.Schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use a comment...

Suggested change
foreach (var trainCol in trainData.Schema)
// Also indirectly checks for new columns in the validation datasets as we above enforce the column counts are equal
foreach (var trainCol in trainData.Schema)

I was otherwise going to suggest we check the reverse direction. A comment helps future readers to not have to think through it.

@codecov
Copy link

codecov bot commented Nov 21, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@b7db4fa). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff            @@
##             master    #4490   +/-   ##
=========================================
  Coverage          ?   74.85%           
=========================================
  Files             ?      908           
  Lines             ?   159880           
  Branches          ?    17215           
=========================================
  Hits              ?   119678           
  Misses            ?    35384           
  Partials          ?     4818
Flag Coverage Δ
#Debug 74.85% <100%> (?)
#production 70.2% <100%> (?)
#test 90.19% <100%> (?)
Impacted Files Coverage Δ
...rosoft.ML.AutoML.Tests/UserInputValidationTests.cs 100% <100%> (ø)
...crosoft.ML.AutoML/Utils/UserInputValidationUtil.cs 91.81% <100%> (ø)

@daholste daholste merged commit 71e97ea into dotnet:master Nov 21, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ignore hidden columns in AutoML schema checks of validation data
2 participants