Skip to content

Created DoubleParser.OptionFlags to be used by TextLoader #5154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 24, 2020

Conversation

antoniovs1029
Copy link
Member

@antoniovs1029 antoniovs1029 commented May 23, 2020

As mentioned on #5145 (comment) , PR #5145 opened the possibility of an issue to occur when running, at the same time, different cursors coming from different TextLoaders with different DecimalMarkers, as race conditions could occur. Although we originally agreed to accept that fringe scenario, this PR fixes that problem, and adds a test for that scenario.

This PR also addresses the general problem, that if we add new options to TextLoader that need to affect how we parse Single/Doubles, then there was no direct/thread-safe way to make these options affect how DoubleParser works... For example, issue #4132 would require adding a new TextLoader.Option "impute" that needs to affect how DoubleParser works. Other case would be the offline suggestion of also adding a ThousandsMarker option to be able to parse 10,332.05 or 10.332,05 into 10332.05 depending on that option.

General explanation of the PR

Without this PR the behavior was that TextLoader.Parser would use the singleton TextLoader.ValueCreatorCache.Instance to get the delegates to parse the fields loaded from a file. In turn, the ValueCreatorCache singleton would have gotten those delegates from the singleton Conversion.Instance, whose methods for parsing text to single/doubles would, then, call static methods in DoubleParser.

I am assuming that the only reason to have both ValueCreatorCache and Conversion follow the singleton pattern was for performance reasons, so that they didn't have to create the delegates every time somewhere in the code their instances were needed

With this PR:

  1. I created a new DoubleParser.OptionFlags enum that is used by TextLoader, and gets propagated through ValueCreatorCache up to Conversion in order to call the static DoubleParser methods with the correct options.
  2. I added code to ValueCreatorCache and Conversion to let them create other instances besides its default instance. Most of the codebase would still use their default instance, so this PR doesn't affect performance or behavior there. It's only in TextLoader, and only in the case when using custom options for DoubleParser, that a custom instance of ValueCreatorCache and Conversion gets created. So then performance would only be somewhat affected when creating those instances for that particular case. To have minimum impact in performance, I also added a ConcurrentDictionary to hold the _customInstances of ValueCreatorCache (i.e. those which were created using non-default DoubleParser.OptionFlags)... this way, when using custom options, the custom instance of ValueCreatorCache and Conversion would only be created once, and would be reused afterwards.

@antoniovs1029 antoniovs1029 requested a review from a team as a code owner May 23, 2020 06:54
@antoniovs1029 antoniovs1029 requested review from harishsk and mstfbl May 23, 2020 07:29
@codecov
Copy link

codecov bot commented May 23, 2020

Codecov Report

Merging #5154 into master will increase coverage by 0.01%.
The diff coverage is 92.35%.

@@            Coverage Diff             @@
##           master    #5154      +/-   ##
==========================================
+ Coverage   75.78%   75.79%   +0.01%     
==========================================
  Files         993      993              
  Lines      180862   180955      +93     
  Branches    19474    19486      +12     
==========================================
+ Hits       137060   137158      +98     
+ Misses      38512    38508       -4     
+ Partials     5290     5289       -1     
Flag Coverage Δ
#Debug 75.79% <92.35%> (+0.01%) ⬆️
#production 71.71% <80.80%> (+<0.01%) ⬆️
#test 88.89% <99.50%> (+0.01%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Data/Commands/TypeInfoCommand.cs 0.00% <0.00%> (ø)
src/Microsoft.ML.Data/Data/DataViewUtils.cs 74.02% <0.00%> (ø)
...c/Microsoft.ML.Data/DataView/LambdaColumnMapper.cs 63.02% <0.00%> (ø)
src/Microsoft.ML.Data/DataView/LambdaFilter.cs 0.00% <0.00%> (ø)
src/Microsoft.ML.Data/Transforms/TypeConverting.cs 80.81% <0.00%> (ø)
...t.ML.Transforms/MissingValueDroppingTransformer.cs 72.89% <0.00%> (ø)
...t.ML.Transforms/MissingValueHandlingTransformer.cs 60.37% <0.00%> (ø)
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.19% <54.54%> (-0.05%) ⬇️
src/Microsoft.ML.Data/Transforms/ValueMapping.cs 84.37% <66.66%> (ø)
...c/Microsoft.ML.Transforms/MissingValueReplacing.cs 77.31% <66.66%> (ø)
... and 38 more

_instance;
return _defaultInstance ??
Interlocked.CompareExchange(ref _defaultInstance, new ValueCreatorCache(), null) ??
_defaultInstance;
}
Copy link
Contributor

@harishsk harishsk May 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: If InterlockedCompareExchange fails, we are returning null. Explicitly returning null instead of _defaultInstance would make it slightly more readable. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By reading the docs on Interlocked.CompareExchange I think that the above could be even written as return _instance ?? Interlocked.CompareExchange(...) (i.e. without the last ??) since the method already returns the content of the variable to update, even if it didn't updated it.

In any case, I am hesitant to modify this, since it seems everywhere in the codebase where we return a variable updated by Interlocked.CompareExchange we follow exactly the same pattern that is above. Since my knowledge of multi-threaded programming is fairly low, I'll just assume that the pattern is correct for some reason, and it's better to leave it that way.


In reply to: 429557109 [](ancestors = 429557109)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, you can just change it to return _instance ?? Interlocked.CompareExchange(...) . The pattern that is used throughout seems unnecessary. We can change it some other time.


In reply to: 429585239 [](ancestors = 429585239,429557109)

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@mstfbl mstfbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like there are many more options that can be added to DoubleParser.Options, such as:

  • the thousandth symbol (100,000.00 or 42.000,50)
  • having parenthesis surrounding a double to indicate negativeness (for example: (5.0) = -5.0)
  • having exponential numbers as doubles (42e+3 or 42e-3 or 42e3)
  • having both positive and negative signs in front of doubles (+4.2, -4.2). While +4.2 can also be specified as just 4.2, we should still be able to accept +4.2, no?
  • and so on...

If we want to think about future scalability, perhaps we should have a better way of calling TryParse, TryParseCore and Conversions without having to specify every single parsing option like decimalMarker.

For example, look at the NumberStyles enum enum here, which "determines the styles permitted in numeric string arguments that are passed to the Parse and TryParse methods of the integral and floating-point numeric types".

We could use one NumberStyles object to represent all of these options that can be added to DoubleParser.Options, which will control the number of parameters we'd need to add to TryParse, TryParseCore and Conversions. @antoniovs1029 What do you think?

@antoniovs1029 antoniovs1029 requested a review from a team as a code owner May 23, 2020 23:06
@antoniovs1029
Copy link
Member Author

antoniovs1029 commented May 23, 2020

That's a great suggestion, @mstfbl ! I have refactored the DoubleParser.Options into an OptionFlags enum, following the same pattern already used by TextLoader.OptionFlags.

EDIT: By the way, I think that most of the features you've mentioned are already implemented by default on DoubleParser... but as you've mentioned, it might be the case we'll want to add options to control this behavior.


In reply to: 417308429 [](ancestors = 417308429)


In reply to: 417308429 [](ancestors = 417308429)

@antoniovs1029 antoniovs1029 changed the title Created DoubleParser.Options to be used by TextLoader Created DoubleParser.OptionFlags to be used by TextLoader May 23, 2020
if ((flags & OptionFlags.UseCommaAsDecimalMarker) != 0)
decimalMarker = ',';
else
decimalMarker = '.';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why as a flag OptionFlags.UseCommaAsDecimalMarker instead of just having a user enter a char directly? Having user enter a char directly lets them any character like _ in 1000_23.

Pandas lets users enter the char directly, see decimal parameter:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

They also allow for entering the char for thousands (technically allow a string).

Excel also lets the user enter the chars directly:
https://www.officetooltips.com/excel_2016/tips/change_the_decimal_point_to_a_comma_or_vice_versa.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The public API to let users change the decimal marker was implemented and decided on #5145. We accept a character as the DecimalMarker option, but we actually only accept "," or "." and throw exceptions if the user sets other character.

On this PR I am not changing that behavior, but actually only changing how we connect the TextLoader to the DoubleParser (as the way we connected them wasn't thread safe, and would have unexpected behavior in some cases). Since we only accept "," or ".", then I used a Flag here. Since this flag is of internal use (i.e. it isn't determining a public API), it could later be replaced by an options object or something else if we get to accept other characters/strings as decimal separators. In any case, if we were to accept other characters different from "," and ".", then we'd actually need to change other things on the parser and on textloader to make this work correctly... so perhaps that could be left for future work, since I don't think we'd have time to correctly implement and test those other options before the next release. So, I think it's better to simply merge this PR to enable users to use "," and "." as decimal markers in a safe manner in the upcoming release, and then discuss about the possibility of enabling other characters.

As for the thousands parameter you mention, it is included inside a list of possible features that we might want to implement for TextLoader, but it isn't in our plans to implement it before the upcoming release.


In reply to: 429606655 [](ancestors = 429606655)

Copy link
Contributor

@justinormont justinormont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@mstfbl mstfbl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To parsing doubles! 🎉

@antoniovs1029 antoniovs1029 merged commit bc9abda into dotnet:master May 24, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants