Created DoubleParser.OptionFlags to be used by TextLoader #5154

antoniovs1029 · 2020-05-23T06:54:07Z

As mentioned on #5145 (comment) , PR #5145 opened the possibility of an issue to occur when running, at the same time, different cursors coming from different TextLoaders with different DecimalMarkers, as race conditions could occur. Although we originally agreed to accept that fringe scenario, this PR fixes that problem, and adds a test for that scenario.

This PR also addresses the general problem, that if we add new options to TextLoader that need to affect how we parse Single/Doubles, then there was no direct/thread-safe way to make these options affect how DoubleParser works... For example, issue #4132 would require adding a new TextLoader.Option "impute" that needs to affect how DoubleParser works. Other case would be the offline suggestion of also adding a ThousandsMarker option to be able to parse 10,332.05 or 10.332,05 into 10332.05 depending on that option.

General explanation of the PR

Without this PR the behavior was that TextLoader.Parser would use the singleton TextLoader.ValueCreatorCache.Instance to get the delegates to parse the fields loaded from a file. In turn, the ValueCreatorCache singleton would have gotten those delegates from the singleton Conversion.Instance, whose methods for parsing text to single/doubles would, then, call static methods in DoubleParser.

I am assuming that the only reason to have both ValueCreatorCache and Conversion follow the singleton pattern was for performance reasons, so that they didn't have to create the delegates every time somewhere in the code their instances were needed

With this PR:

I created a new DoubleParser.OptionFlags enum that is used by TextLoader, and gets propagated through ValueCreatorCache up to Conversion in order to call the static DoubleParser methods with the correct options.
I added code to ValueCreatorCache and Conversion to let them create other instances besides its default instance. Most of the codebase would still use their default instance, so this PR doesn't affect performance or behavior there. It's only in TextLoader, and only in the case when using custom options for DoubleParser, that a custom instance of ValueCreatorCache and Conversion gets created. So then performance would only be somewhat affected when creating those instances for that particular case. To have minimum impact in performance, I also added a ConcurrentDictionary to hold the _customInstances of ValueCreatorCache (i.e. those which were created using non-default DoubleParser.OptionFlags)... this way, when using custom options, the custom instance of ValueCreatorCache and Conversion would only be created once, and would be reused afterwards.

….cs and TextLoader so that it can be used by TextLoader.Parser

src/Microsoft.ML.Core/Utilities/DoubleParser.cs

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs

test/Microsoft.ML.Tests/TextLoaderTests.cs

src/Microsoft.ML.Data/Data/Conversion.cs

codecov · 2020-05-23T08:05:58Z

Codecov Report

Merging #5154 into master will increase coverage by 0.01%.
The diff coverage is 92.35%.

@@            Coverage Diff             @@
##           master    #5154      +/-   ##
==========================================
+ Coverage   75.78%   75.79%   +0.01%     
==========================================
  Files         993      993              
  Lines      180862   180955      +93     
  Branches    19474    19486      +12     
==========================================
+ Hits       137060   137158      +98     
+ Misses      38512    38508       -4     
+ Partials     5290     5289       -1

Flag	Coverage Δ
#Debug	`75.79% <92.35%> (+0.01%)`	⬆️
#production	`71.71% <80.80%> (+<0.01%)`	⬆️
#test	`88.89% <99.50%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
src/Microsoft.ML.Data/Commands/TypeInfoCommand.cs	`0.00% <0.00%> (ø)`
src/Microsoft.ML.Data/Data/DataViewUtils.cs	`74.02% <0.00%> (ø)`
...c/Microsoft.ML.Data/DataView/LambdaColumnMapper.cs	`63.02% <0.00%> (ø)`
src/Microsoft.ML.Data/DataView/LambdaFilter.cs	`0.00% <0.00%> (ø)`
src/Microsoft.ML.Data/Transforms/TypeConverting.cs	`80.81% <0.00%> (ø)`
...t.ML.Transforms/MissingValueDroppingTransformer.cs	`72.89% <0.00%> (ø)`
...t.ML.Transforms/MissingValueHandlingTransformer.cs	`60.37% <0.00%> (ø)`
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs	`82.19% <54.54%> (-0.05%)`	⬇️
src/Microsoft.ML.Data/Transforms/ValueMapping.cs	`84.37% <66.66%> (ø)`
...c/Microsoft.ML.Transforms/MissingValueReplacing.cs	`77.31% <66.66%> (ø)`
... and 38 more

…where in the TextLoader.

…oncurrent dictionary

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs

harishsk · 2020-05-23T15:51:06Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs

-                        _instance;
+                    return _defaultInstance ??
+                        Interlocked.CompareExchange(ref _defaultInstance, new ValueCreatorCache(), null) ??
+                        _defaultInstance;
                }


suggestion: If InterlockedCompareExchange fails, we are returning null. Explicitly returning null instead of _defaultInstance would make it slightly more readable. #Resolved

By reading the docs on Interlocked.CompareExchange I think that the above could be even written as return _instance ?? Interlocked.CompareExchange(...) (i.e. without the last ??) since the method already returns the content of the variable to update, even if it didn't updated it.

In any case, I am hesitant to modify this, since it seems everywhere in the codebase where we return a variable updated by Interlocked.CompareExchange we follow exactly the same pattern that is above. Since my knowledge of multi-threaded programming is fairly low, I'll just assume that the pattern is correct for some reason, and it's better to leave it that way.

In reply to: 429557109 [](ancestors = 429557109)

You are right, you can just change it to return _instance ?? Interlocked.CompareExchange(...) . The pattern that is used throughout seems unnecessary. We can change it some other time.

In reply to: 429585239 [](ancestors = 429585239,429557109)

harishsk

mstfbl

I feel like there are many more options that can be added to DoubleParser.Options, such as:

the thousandth symbol (100,000.00 or 42.000,50)
having parenthesis surrounding a double to indicate negativeness (for example: (5.0) = -5.0)
having exponential numbers as doubles (42e+3 or 42e-3 or 42e3)
having both positive and negative signs in front of doubles (+4.2, -4.2). While +4.2 can also be specified as just 4.2, we should still be able to accept +4.2, no?
and so on...

If we want to think about future scalability, perhaps we should have a better way of calling TryParse, TryParseCore and Conversions without having to specify every single parsing option like decimalMarker.

For example, look at the NumberStyles enum enum here, which "determines the styles permitted in numeric string arguments that are passed to the Parse and TryParse methods of the integral and floating-point numeric types".

We could use one NumberStyles object to represent all of these options that can be added to DoubleParser.Options, which will control the number of parameters we'd need to add to TryParse, TryParseCore and Conversions. @antoniovs1029 What do you think?

antoniovs1029 · 2020-05-23T23:10:02Z

That's a great suggestion, @mstfbl ! I have refactored the DoubleParser.Options into an OptionFlags enum, following the same pattern already used by TextLoader.OptionFlags.

EDIT: By the way, I think that most of the features you've mentioned are already implemented by default on DoubleParser... but as you've mentioned, it might be the case we'll want to add options to control this behavior.

In reply to: 417308429 [](ancestors = 417308429)

src/Microsoft.ML.Core/Utilities/DoubleParser.cs

test/Microsoft.ML.Tests/TextLoaderTests.cs

justinormont · 2020-05-24T07:07:47Z

src/Microsoft.ML.Core/Utilities/DoubleParser.cs

+            if ((flags & OptionFlags.UseCommaAsDecimalMarker) != 0)
+                decimalMarker = ',';
+            else
+                decimalMarker = '.';


Why as a flag OptionFlags.UseCommaAsDecimalMarker instead of just having a user enter a char directly? Having user enter a char directly lets them any character like _ in 1000_23.

Pandas lets users enter the char directly, see decimal parameter:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

They also allow for entering the char for thousands (technically allow a string).

Excel also lets the user enter the chars directly:
https://www.officetooltips.com/excel_2016/tips/change_the_decimal_point_to_a_comma_or_vice_versa.html

The public API to let users change the decimal marker was implemented and decided on #5145. We accept a character as the DecimalMarker option, but we actually only accept "," or "." and throw exceptions if the user sets other character.

On this PR I am not changing that behavior, but actually only changing how we connect the TextLoader to the DoubleParser (as the way we connected them wasn't thread safe, and would have unexpected behavior in some cases). Since we only accept "," or ".", then I used a Flag here. Since this flag is of internal use (i.e. it isn't determining a public API), it could later be replaced by an options object or something else if we get to accept other characters/strings as decimal separators. In any case, if we were to accept other characters different from "," and ".", then we'd actually need to change other things on the parser and on textloader to make this work correctly... so perhaps that could be left for future work, since I don't think we'd have time to correctly implement and test those other options before the next release. So, I think it's better to simply merge this PR to enable users to use "," and "." as decimal markers in a safe manner in the upcoming release, and then discuss about the possibility of enabling other characters.

As for the thousands parameter you mention, it is included inside a list of possible features that we might want to implement for TextLoader, but it isn't in our plans to implement it before the upcoming release.

In reply to: 429606655 [](ancestors = 429606655)

justinormont

LGTM

mstfbl

To parsing doubles! 🎉

antoniovs1029 added 2 commits May 22, 2020 23:50

Created the DoubleParserOptions class, and made changes on Conversion…

cce5ec4

….cs and TextLoader so that it can be used by TextLoader.Parser

Added test for concurrent cursors

c059e2c

antoniovs1029 requested a review from a team as a code owner May 23, 2020 06:54