Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoML.Net filter infinity value when calculate average score #5345

Merged
merged 2 commits into from
Aug 19, 2020

Conversation

LittleLittleCloud
Copy link
Contributor

@LittleLittleCloud LittleLittleCloud commented Aug 12, 2020

Fix issue:

This pr fixes #5339 by filtering out the infinity value when calculating average scores from the results of cross validation runner, so that the average score won't be infinity value in any situation, so that the return value of GetIndexCloestToAverage will not be -1.

In rare case, when all the scores from cross-validation run are infinitive, the average score will be designated to nan. and the return value of GetIndexCloestToAverage will be 0

Noted: the +/- infinite value is filtered before calculating average score because of the same reason nan value is filtered. By doing that the evaluation for cross-validation runner might be better than the real situation.

@LittleLittleCloud LittleLittleCloud requested a review from a team as a code owner August 12, 2020 23:05
if (newResults.Count() == 0)
return double.NaN;

// Return average of non-NaN scores otherwise
Copy link
Contributor

@justinormont justinormont Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases being fixed by removing the +/-Infinity values? I'm not sure they should be filtered, as +/-Infinity can be a correct, and wanted, metric value. For instance, for log-loss reduction.

Instead of filtering you could add checks to GetIndexClosestToAverage() to allow it to handle the case of the average being +/-Infinity.

Perhaps coded as:

 private static int GetIndexClosestToAverage(IEnumerable<double> values, double average) 
 { 
     // Average will be NaN iff all values are NaN. 
     // Return the first index in this case. 
     if (double.IsNaN(average)) 
         return 0; 

    /* -- New checks -- */

    // Return the largest metric value when the average is +Infinity
    if (double.IsPositiveInfinity(average))
        return values.IndexOf(values.Max());

    // Return the smallest metric value when the average is -Infinity
    if (double.IsNegativeInfinity(average))
        return values.IndexOf(values.Min());

   /* ...function continues as before... */

Copy link
Contributor Author

@LittleLittleCloud LittleLittleCloud Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with both solution, actually what you suggest is the first solution coming in my mind, but on the second thought, filtering +/- inf score before calculating average might be a more suitable solution because it follows what AutoML.Net does. As AutoML.Net will filter out nan value when calculating average score.

It's the same reason why we filter nan value out before calculating average score, there is no mathematical meaning to include infinite value either, thinking about what's the average score of [+inf, -1, 0, 1, -inf] should be.

If we want to keep +/- inf when calculating average score, we'd better keep nan as well. In that way, the evaluation for cross-validation runner is closer to the real situation.

Copy link
Contributor

@justinormont justinormont Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the right answer is to keep the +/-Infinity.

NaN and Infinities have different uses. NaN is the lack of a value for a metric, used when it's inherently incalculable. Infinities are generally actual values of the metric.

For instance, log-loss is Infinity when the model is fully confident (probability == 0 or probability == 1), and incorrect, about a prediction (some background).

AutoML․NET ignores NaN metric values because they are not actual values of the metric, and are often caused by a self-induced data issue in how the AutoML code creates its CV splits.

The returned metric value from cross-validation, among other uses, is meant to be representative of the metric as if trained/scored on a larger dataset (lowered variance though averaging the folds). In the case of log-loss == Infinity, somewhere in the CV scoring datasets, there is one or more rows giving an Infinity value, which would be part of this larger scoring dataset. To be representative of this larger scoring dataset, the output value of the cross-validation should also be Infinity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanantion

@codecov
Copy link

codecov bot commented Aug 13, 2020

Codecov Report

Merging #5345 into master will increase coverage by 0.16%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #5345      +/-   ##
==========================================
+ Coverage   73.97%   74.13%   +0.16%     
==========================================
  Files        1019     1019              
  Lines      189949   190653     +704     
  Branches    20429    20670     +241     
==========================================
+ Hits       140506   141339     +833     
+ Misses      43914    43783     -131     
- Partials     5529     5531       +2     
Flag Coverage Δ
#Debug 74.13% <0.00%> (+0.16%) ⬆️
#production 69.87% <0.00%> (+0.13%) ⬆️
#test 87.77% <ø> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...AutoML/Experiment/Runners/CrossValSummaryRunner.cs 64.51% <0.00%> (-2.16%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 23.78% <0.00%> (-0.98%) ⬇️
test/Microsoft.ML.Tests/OnnxConversionTest.cs 96.01% <0.00%> (-0.68%) ⬇️
src/Microsoft.ML.OnnxTransformer/OnnxUtils.cs 86.38% <0.00%> (-0.03%) ⬇️
...L.AutoML/TrainerExtensions/TrainerExtensionUtil.cs 86.36% <0.00%> (ø)
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 89.45% <0.00%> (+0.15%) ⬆️
...osoft.ML.OnnxTransformerTest/OnnxTransformTests.cs 97.89% <0.00%> (+0.22%) ⬆️
...StandardTrainers/Standard/LinearModelParameters.cs 66.58% <0.00%> (+0.25%) ⬆️
...rosoft.ML.Data/Scorers/PredictedLabelScorerBase.cs 81.95% <0.00%> (+0.37%) ⬆️
...Microsoft.ML.Data/Scorers/PredictionTransformer.cs 91.00% <0.00%> (+0.50%) ⬆️
... and 13 more

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; thanks.

@@ -157,13 +157,24 @@ private static double GetAverageOfNonNaNScores(IEnumerable<double> results)
return newResults.Average(r => r);
}

/// <summary>
/// return the index of value from <paramref name="values"/> that closest to <paramref name="average"/>. If <paramref name="average"/> is non, +/- inf, the first, max/min value's index will be return.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Mispelling "... if average is NaN*, +/- inf, ... "

if (double.IsPositiveInfinity(average))
return values.ToList().IndexOf(values.Max());

// Return the max value's index if average is positive inf.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The comment is the same as in the max case, it should say min and "negative inf"

Copy link
Member

@antoniovs1029 antoniovs1029 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just left a couple of nits regarding the comments you added.

@antoniovs1029
Copy link
Member

@LittleLittleCloud , gentle ping about this PR. Is there anything else you want to add to it, or is it ok if we merge it once it passes the 1 failing CI build? thanks

@LittleLittleCloud
Copy link
Contributor Author

@LittleLittleCloud , gentle ping about this PR. Is there anything else you want to add to it, or is it ok if we merge it once it passes the 1 failing CI build? thanks

No that's all, I will close this PR.

@LittleLittleCloud LittleLittleCloud merged commit ba1804e into dotnet:master Aug 19, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AutoML.Net: CrossValSummaryRunner can't handle all-infinity metrics value.
3 participants