Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198

Merged
merged 22 commits into from
Jun 9, 2020

Conversation

antoniovs1029
Copy link
Member

@antoniovs1029 antoniovs1029 commented Jun 3, 2020

Fixes #4132 by implementing a new option on TextLoader that will allow it to impute missing float/doubles as NaN instead of the default 0.
Builds on #5154

@antoniovs1029 antoniovs1029 requested a review from a team as a code owner June 3, 2020 05:02
Comment on lines 535 to 539
/// <summary>
/// If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "If true, empty float fields will be loaded as NaN. If false, they'll be loaded as 0. Default is false.", ShortName = "imputefloat")]
public bool ImputeEmptyFloats = Defaults.ImputeEmptyFloats;
Copy link
Member Author

@antoniovs1029 antoniovs1029 Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with the name I chose for this option, and for text I wrote as its doc. I think the name "field" can mean many things in the C# world (but inside the code of the TextLoader a "field" is the value found on a given column on a given row). Also, this option is supposed to be related to both Single and Doubles, so "float" is misleading. And "empty" would actually mean empty fields or fields with only whitespace. So I'm not sure how to name this and what to say on the public docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure 'Impute' fits the terminology here. A few options to consider: ParseEmptyFloatsAsNan, ParseEmptyRealsAsNan, EmptyFieldsToNan, EmptyValuesToNan, etc.


In reply to: 434309670 [](ancestors = 434309670)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the word "impute" as that was the original name @justinormont suggested here #4132 (comment)
I wouldn't know if that's a common term in which our users could understand the feature by reading it.

As for the names you provided, perhaps EmptyRealsAsNan would fit well, as it avoids the word Field, and "Reals" could imply Floats or Doubles.

I will leave unaddressed right now, and see if we get any suggestion on this. It's only a variable name change, and I'll be able to make the change before merging the PR.


In reply to: 434886717 [](ancestors = 434886717,434309670)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've decided to go for MissingRealsAsNan , since it's not only the Empty fields that get mapped as NaNs, but also the ones that only have whitespace, and now also the ones were there are missing columns in a given row.


In reply to: 435163479 [](ancestors = 435163479,434886717,434309670)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case you should fix the documentation above to read "If true, empty or missing float and double fields will be loaded as NaN"


In reply to: 437077933 [](ancestors = 437077933,435163479,434886717,434309670)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the documentation reads:

            /// <summary>
            /// If true, missing real fields (i.e. double or single fields) will be loaded as NaN.
            /// If false, they'll be loaded as 0. Default is false.
            /// A field is considered "missing" if it's empty, if it only has whitespace, or if there are missing columns
            /// at the end of a given row.
            /// </summary>

In reply to: 437170658 [](ancestors = 437170658,437077933,435163479,434886717,434309670)

Comment on lines 10 to 13
// The next two rows map the missing columns as 0, as it was decided not to impute with NaN
// in this case
6,"this has no date, num3 or num4 (the separator corresponding to them is also missing)",6.6,66.66,1/1/0001,0,0
7,"this has no num4 (the separator corresponding to it is missing)",7.7,77.77,7/7/2007,777.777,0
Copy link
Member Author

@antoniovs1029 antoniovs1029 Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(appended to example to allow for threaded discussion, this is how it will actually load missing columns at the end of the row)

The feature I implement on this PR means that "1,,dog" will be loaded as "1,NaN,dog" if the second column is a float/double. And to load "1,dog," as "1,dog,NaN" if the third column is a float/double. This was the behavior requested on #4132

What to do with incomplete rows was undefined in the request. I.e. what to do if there are supposed to be 6 columns but on one row there's only 2? (I.e. 1,dog, without the last separator)

In such a case, I took the decision to simply load the missing columns with default values (i.e. 0 for Float and Double) even when using the new feature of this PR.

The reason behind this is that the implementation required to load those missing columns as NaNs only for float/doubles would be somewhat hacky (and I didn't figure out how to make it work for vector types which included fields on missing columns on a given row). Besides, since it might be a fringe case. it's perhaps better not to support it.

Any thoughts on this, @justinormont @harishsk ? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the current behavior of the Textloader when it encounters incomplete rows? Does it thrown an error or does it load default values for missing columns?
If it is the latter, then I think we should fix it to set those columns to NaN because that is the spirit of this PR.


In reply to: 434313813 [](ancestors = 434313813)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of incomplete rows it returns default values, and doesn't throw, both with or without the new ImputeEmptyFloats option.

I'll update this PR to have it return NaNs, but I won't close this thread because I'd still want to leave this discussion on. Once I update this PR with that fix, we can decide if this case is worth it the changes that would need to be introduced to cover it.


In reply to: 434896173 [](ancestors = 434896173,434313813)

Copy link
Member Author

@antoniovs1029 antoniovs1029 Jun 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this case I'll also need to do some checks like typeof(TValue) == typeof(float) and casts like (TValue)(obj) Single.NaN) similar to the ones described here:
#5198 (comment)

The same concerns I've mentioned there apply to this, although in here I think I can change the code to do the cast only when there are missing float/doubles columns for a given row and the ImputeEmptyFloats option is enabled.


In reply to: 435167058 [](ancestors = 435167058,434896173,434313813)

Copy link
Member Author

@antoniovs1029 antoniovs1029 Jun 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out fixing this case was much simpler than what I had originally expected, and no need to the (TValue)(obj)Single.NaN cast I thought about doing. So I've updated mi changes to fix this case 😄


In reply to: 435219809 [](ancestors = 435219809,435167058,434896173,434313813)

@codecov
Copy link

codecov bot commented Jun 3, 2020

Codecov Report

Merging #5198 into master will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5198      +/-   ##
==========================================
+ Coverage   73.31%   73.33%   +0.02%     
==========================================
  Files        1007     1007              
  Lines      187969   188123     +154     
  Branches    20241    20245       +4     
==========================================
+ Hits       137812   137967     +155     
  Misses      44630    44630              
+ Partials     5527     5526       -1     
Flag Coverage Δ
#Debug 73.33% <100.00%> (+0.02%) ⬆️
#production 69.07% <100.00%> (+<0.01%) ⬆️
#test 87.49% <100.00%> (+0.04%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Data/Data/Conversion.cs 79.06% <ø> (+0.37%) ⬆️
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.13% <100.00%> (+0.06%) ⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs 87.15% <100.00%> (+0.13%) ⬆️
test/Microsoft.ML.Tests/TextLoaderTests.cs 97.02% <100.00%> (+0.45%) ⬆️
...c/Microsoft.ML.FastTree/Utils/ThreadTaskManager.cs 79.48% <0.00%> (-20.52%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 89.45% <0.00%> (+0.15%) ⬆️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs 83.19% <0.00%> (+3.36%) ⬆️

@harishsk
Copy link
Contributor

harishsk commented Jun 3, 2020

    {

While you are here, is it possible to combine these two Parse functions with a generic implementation that covers float and double? The one obstacle I can see is the Single.NaN and Double.NaN. Can that be factored in by extending the Result enum to have a Result.NaN or some other similar technique? #Resolved


Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

@antoniovs1029
Copy link
Member Author

    {

I'm not sure it's doable, but I'll explore the possibilities.


In reply to: 638494349 [](ancestors = 638494349)


Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

@antoniovs1029
Copy link
Member Author

antoniovs1029 commented Jun 4, 2020

    {

So I found that this worked, and I guess I can use this to refactor as per your suggestion:

    static private void ReturnNaN<TValue>(out TValue val)
    {
        if(typeof(TValue) == typeof(Single))
        {
            Console.WriteLine("this was float!");
            // val = (TValue)Convert.ChangeType(Single.NaN, typeof(TValue)); // this also works
            val = (TValue)(object)Single.NaN; // this also works
            return;
        }
        else if(typeof(TValue) == typeof(Double))
        {
            Console.WriteLine("this was double!");
            // val = (TValue)Convert.ChangeType(Double.NaN, typeof(TValue)); // this also works
            val = (TValue)(object)Double.NaN; // this also works
            return;
        }

        throw new Exception();
    }

Still, I'm not sure if it's ok to do the above... In general it's commented that checking typeof(TValue) == typeof(othertype) and doing the `(TValue)(object) is considered bad design:
https://stackoverflow.com/questions/15958830/c-sharp-generics-cast-generic-type-to-value-type

But on the other hand, in our code base there are already multiple generic methods where we check for typeof(TValue) == typeof(othertype), and I found that we actually also use something in similar to the return (TValue)(object) trick in places like:

public static BufferBuilder<T> CreateDefault()
{
if (typeof(T) == typeof(ReadOnlyMemory<char>))
return (BufferBuilder<T>)(object)new BufferBuilder<ReadOnlyMemory<char>>(TextCombiner.Instance);
if (typeof(T) == typeof(float))
return (BufferBuilder<T>)(object)new BufferBuilder<float>(FloatAdder.Instance);
throw Contracts.Except($"Unrecognized type '{typeof(T)}' for default {nameof(BufferBuilder<T>)}");
}

private Tensor CastDataAndReturnAsTensor(T[] data)
{
if (typeof(T) == typeof(sbyte))
return new Tensor((sbyte[])(object)data, _dims, TF_DataType.TF_INT8);
else if (typeof(T) == typeof(long))
return new Tensor((long[])(object)data, _dims, TF_DataType.TF_INT64);
else if (typeof(T) == typeof(Int32))
return new Tensor((Int32[])(object)data, _dims, TF_DataType.TF_INT32);
else if (typeof(T) == typeof(Int16))
return new Tensor((Int16[])(object)data, _dims, TF_DataType.TF_INT16);

Although I notice that in those 2 examples, we're actually returning a reference type, whereas in the refactor needed to return NaN it would be returning a value type... so I'm not sure if there's another workaround, although in the stack overflow link I posted above (and other places I checked) it does look that the (object) cast would be necessary even when returning Float/Double value type. Also, regarding performance, I don't know if there's any cost in casting from value to object to value... but I guess it would negligible as it would only occur when returning NaN?


In reply to: 638777783 [](ancestors = 638777783,638494349)


Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

@codecov
Copy link

codecov bot commented Jun 4, 2020

Codecov Report

Merging #5198 into master will increase coverage by 0.23%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5198      +/-   ##
==========================================
+ Coverage   73.31%   73.55%   +0.23%     
==========================================
  Files        1007     1016       +9     
  Lines      187969   190040    +2071     
  Branches    20241    20438     +197     
==========================================
+ Hits       137812   139777    +1965     
- Misses      44630    44693      +63     
- Partials     5527     5570      +43     
Flag Coverage Δ
#Debug 73.55% <100.00%> (+0.23%) ⬆️
#production 69.35% <100.00%> (+0.28%) ⬆️
#test 87.52% <100.00%> (+0.07%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Data/Data/Conversion.cs 79.06% <ø> (+0.37%) ⬆️
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.41% <100.00%> (+0.34%) ⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs 87.39% <100.00%> (+0.37%) ⬆️
test/Microsoft.ML.TestFramework/TestCommandBase.cs 44.96% <100.00%> (+0.39%) ⬆️
test/Microsoft.ML.Tests/TextLoaderTests.cs 97.02% <100.00%> (+0.45%) ⬆️
test/Microsoft.ML.Tests/OnnxConversionTest.cs 96.81% <0.00%> (-2.16%) ⬇️
...c/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs 40.00% <0.00%> (-0.68%) ⬇️
src/Microsoft.ML.Data/Model/Onnx/OnnxContext.cs 100.00% <0.00%> (ø)
...Microsoft.ML.OnnxConverter/OnnxExportExtensions.cs 100.00% <0.00%> (ø)
.../Microsoft.ML.Transforms/Dracula/DictCountTable.cs 100.00% <0.00%> (ø)
... and 52 more

@harishsk
Copy link
Contributor

harishsk commented Jun 4, 2020

    {

Never mind. This seems a more invasive change. And even if we were to do this, this is a big enough change that it shouldn't be part of this PR.


In reply to: 638818549 [](ancestors = 638818549,638777783,638494349)


Refers to: src/Microsoft.ML.Core/Utilities/DoubleParser.cs:109 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

@yaeldekel
Copy link

yaeldekel commented Jun 8, 2020

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

Do we need to have a different check here based on the version of the model? #Resolved


Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

@antoniovs1029
Copy link
Member Author

antoniovs1029 commented Jun 8, 2020

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

Yeah. You're right. If I don't check for version then the CheckDecode in here is pretty much being ignored. Will fix this. EDIT: DONE.


In reply to: 640419819 [](ancestors = 640419819)


Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

@antoniovs1029
Copy link
Member Author

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

And I'll add a MAML test to test that this is being saved and loaded correctly (I believe saving and loading a textloader can only be tested through MAML).


In reply to: 640708134 [](ancestors = 640708134,640419819)


Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

@antoniovs1029
Copy link
Member Author

        host.CheckDecode((_flags & ~OptionFlags.All) == 0);

Done.


In reply to: 640726768 [](ancestors = 640726768,640708134,640419819)


Refers to: src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs:1419 in 3dd66ff. [](commit_id = 3dd66ff, deletion_comment = False)

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@antoniovs1029 antoniovs1029 merged commit c664748 into dotnet:master Jun 9, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conversion.TryParse parses floats and doubles with empty/whitespace values as 0 instead of NaN
4 participants