Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198
Allow TextLoader to load empty float/double fields as NaN instead of 0 #5198
Changes from 16 commits
a5ac022
e7e7da7
a7a454f
7d10358
4dd081c
b952e71
c0cdbec
66c6c24
684d359
b1cc176
65fcb59
a68da25
df32ba1
f6912b1
3dd66ff
280a35e
8e05dff
1cc7eff
b074412
19241ef
00a972b
a417228
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not happy with the name I chose for this option, and for text I wrote as its doc. I think the name "field" can mean many things in the C# world (but inside the code of the TextLoader a "field" is the value found on a given column on a given row). Also, this option is supposed to be related to both Single and Doubles, so "float" is misleading. And "empty" would actually mean empty fields or fields with only whitespace. So I'm not sure how to name this and what to say on the public docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure 'Impute' fits the terminology here. A few options to consider: ParseEmptyFloatsAsNan, ParseEmptyRealsAsNan, EmptyFieldsToNan, EmptyValuesToNan, etc.
In reply to: 434309670 [](ancestors = 434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the word "impute" as that was the original name @justinormont suggested here #4132 (comment)
I wouldn't know if that's a common term in which our users could understand the feature by reading it.
As for the names you provided, perhaps EmptyRealsAsNan would fit well, as it avoids the word Field, and "Reals" could imply Floats or Doubles.
I will leave unaddressed right now, and see if we get any suggestion on this. It's only a variable name change, and I'll be able to make the change before merging the PR.
In reply to: 434886717 [](ancestors = 434886717,434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've decided to go for MissingRealsAsNan , since it's not only the Empty fields that get mapped as NaNs, but also the ones that only have whitespace, and now also the ones were there are missing columns in a given row.
In reply to: 435163479 [](ancestors = 435163479,434886717,434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case you should fix the documentation above to read "If true, empty or missing float and double fields will be loaded as NaN"
In reply to: 437077933 [](ancestors = 437077933,435163479,434886717,434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the documentation reads:
In reply to: 437170658 [](ancestors = 437170658,437077933,435163479,434886717,434309670)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(appended to example to allow for threaded discussion, this is how it will actually load missing columns at the end of the row)
The feature I implement on this PR means that "1,,dog" will be loaded as "1,NaN,dog" if the second column is a float/double. And to load "1,dog," as "1,dog,NaN" if the third column is a float/double. This was the behavior requested on #4132
What to do with incomplete rows was undefined in the request. I.e. what to do if there are supposed to be 6 columns but on one row there's only 2? (I.e.
1,dog
, without the last separator)In such a case, I took the decision to simply load the missing columns with default values (i.e. 0 for Float and Double) even when using the new feature of this PR.
The reason behind this is that the implementation required to load those missing columns as NaNs only for float/doubles would be somewhat hacky (and I didn't figure out how to make it work for vector types which included fields on missing columns on a given row). Besides, since it might be a fringe case. it's perhaps better not to support it.
Any thoughts on this, @justinormont @harishsk ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the current behavior of the Textloader when it encounters incomplete rows? Does it thrown an error or does it load default values for missing columns?
If it is the latter, then I think we should fix it to set those columns to NaN because that is the spirit of this PR.
In reply to: 434313813 [](ancestors = 434313813)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of incomplete rows it returns default values, and doesn't throw, both with or without the new ImputeEmptyFloats option.
I'll update this PR to have it return NaNs, but I won't close this thread because I'd still want to leave this discussion on. Once I update this PR with that fix, we can decide if this case is worth it the changes that would need to be introduced to cover it.
In reply to: 434896173 [](ancestors = 434896173,434313813)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this case I'll also need to do some checks like
typeof(TValue) == typeof(float)
and casts like(TValue)(obj) Single.NaN)
similar to the ones described here:#5198 (comment)
The same concerns I've mentioned there apply to this, although in here I think I can change the code to do the cast only when there are missing float/doubles columns for a given row and the ImputeEmptyFloats option is enabled.
In reply to: 435167058 [](ancestors = 435167058,434896173,434313813)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turned out fixing this case was much simpler than what I had originally expected, and no need to the
(TValue)(obj)Single.NaN
cast I thought about doing. So I've updated mi changes to fix this case 😄In reply to: 435219809 [](ancestors = 435219809,435167058,434896173,434313813)