-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Created DoubleParser.OptionFlags to be used by TextLoader #5154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Created DoubleParser.OptionFlags to be used by TextLoader #5154
Conversation
….cs and TextLoader so that it can be used by TextLoader.Parser
Codecov Report
@@ Coverage Diff @@
## master #5154 +/- ##
==========================================
+ Coverage 75.78% 75.79% +0.01%
==========================================
Files 993 993
Lines 180862 180955 +93
Branches 19474 19486 +12
==========================================
+ Hits 137060 137158 +98
+ Misses 38512 38508 -4
+ Partials 5290 5289 -1
|
…where in the TextLoader.
…oncurrent dictionary
_instance; | ||
return _defaultInstance ?? | ||
Interlocked.CompareExchange(ref _defaultInstance, new ValueCreatorCache(), null) ?? | ||
_defaultInstance; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: If InterlockedCompareExchange fails, we are returning null. Explicitly returning null instead of _defaultInstance would make it slightly more readable. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By reading the docs on Interlocked.CompareExchange I think that the above could be even written as return _instance ?? Interlocked.CompareExchange(...)
(i.e. without the last ??
) since the method already returns the content of the variable to update, even if it didn't updated it.
In any case, I am hesitant to modify this, since it seems everywhere in the codebase where we return a variable updated by Interlocked.CompareExchange we follow exactly the same pattern that is above. Since my knowledge of multi-threaded programming is fairly low, I'll just assume that the pattern is correct for some reason, and it's better to leave it that way.
In reply to: 429557109 [](ancestors = 429557109)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, you can just change it to return _instance ?? Interlocked.CompareExchange(...)
. The pattern that is used throughout seems unnecessary. We can change it some other time.
In reply to: 429585239 [](ancestors = 429585239,429557109)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there are many more options that can be added to DoubleParser.Options
, such as:
- the thousandth symbol (100,000.00 or 42.000,50)
- having parenthesis surrounding a double to indicate negativeness (for example: (5.0) = -5.0)
- having exponential numbers as doubles (42e+3 or 42e-3 or 42e3)
- having both positive and negative signs in front of doubles (+4.2, -4.2). While +4.2 can also be specified as just 4.2, we should still be able to accept +4.2, no?
- and so on...
If we want to think about future scalability, perhaps we should have a better way of calling TryParse
, TryParseCore
and Conversions
without having to specify every single parsing option like decimalMarker
.
For example, look at the NumberStyles enum enum here, which "determines the styles permitted in numeric string arguments that are passed to the Parse and TryParse methods of the integral and floating-point numeric types".
We could use one NumberStyles
object to represent all of these options that can be added to DoubleParser.Options
, which will control the number of parameters we'd need to add to TryParse
, TryParseCore
and Conversions
. @antoniovs1029 What do you think?
That's a great suggestion, @mstfbl ! I have refactored the DoubleParser.Options into an OptionFlags enum, following the same pattern already used by TextLoader.OptionFlags. EDIT: By the way, I think that most of the features you've mentioned are already implemented by default on DoubleParser... but as you've mentioned, it might be the case we'll want to add options to control this behavior. In reply to: 417308429 [](ancestors = 417308429) In reply to: 417308429 [](ancestors = 417308429) |
if ((flags & OptionFlags.UseCommaAsDecimalMarker) != 0) | ||
decimalMarker = ','; | ||
else | ||
decimalMarker = '.'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why as a flag OptionFlags.UseCommaAsDecimalMarker
instead of just having a user enter a char directly? Having user enter a char directly lets them any character like _
in 1000_23
.
Pandas lets users enter the char directly, see decimal
parameter:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
They also allow for entering the char for thousands
(technically allow a string).
Excel also lets the user enter the chars directly:
https://www.officetooltips.com/excel_2016/tips/change_the_decimal_point_to_a_comma_or_vice_versa.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public API to let users change the decimal marker was implemented and decided on #5145. We accept a character as the DecimalMarker option, but we actually only accept "," or "." and throw exceptions if the user sets other character.
On this PR I am not changing that behavior, but actually only changing how we connect the TextLoader to the DoubleParser (as the way we connected them wasn't thread safe, and would have unexpected behavior in some cases). Since we only accept "," or ".", then I used a Flag here. Since this flag is of internal use (i.e. it isn't determining a public API), it could later be replaced by an options object or something else if we get to accept other characters/strings as decimal separators. In any case, if we were to accept other characters different from "," and ".", then we'd actually need to change other things on the parser and on textloader to make this work correctly... so perhaps that could be left for future work, since I don't think we'd have time to correctly implement and test those other options before the next release. So, I think it's better to simply merge this PR to enable users to use "," and "." as decimal markers in a safe manner in the upcoming release, and then discuss about the possibility of enabling other characters.
As for the thousands parameter you mention, it is included inside a list of possible features that we might want to implement for TextLoader, but it isn't in our plans to implement it before the upcoming release.
In reply to: 429606655 [](ancestors = 429606655)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To parsing doubles! 🎉
As mentioned on #5145 (comment) , PR #5145 opened the possibility of an issue to occur when running, at the same time, different cursors coming from different
TextLoaders
with differentDecimalMarkers
, as race conditions could occur. Although we originally agreed to accept that fringe scenario, this PR fixes that problem, and adds a test for that scenario.This PR also addresses the general problem, that if we add new options to
TextLoader
that need to affect how we parse Single/Doubles, then there was no direct/thread-safe way to make these options affect howDoubleParser
works... For example, issue #4132 would require adding a newTextLoader.Option
"impute" that needs to affect howDoubleParser
works. Other case would be the offline suggestion of also adding aThousandsMarker
option to be able to parse10,332.05
or10.332,05
into10332.05
depending on that option.General explanation of the PR
Without this PR the behavior was that
TextLoader.Parser
would use the singletonTextLoader.ValueCreatorCache.Instance
to get the delegates to parse the fields loaded from a file. In turn, theValueCreatorCache
singleton would have gotten those delegates from the singletonConversion.Instance
, whose methods for parsing text to single/doubles would, then, call static methods inDoubleParser
.I am assuming that the only reason to have both
ValueCreatorCache
andConversion
follow the singleton pattern was for performance reasons, so that they didn't have to create the delegates every time somewhere in the code their instances were neededWith this PR:
DoubleParser.OptionFlags
enum that is used byTextLoader
, and gets propagated throughValueCreatorCache
up toConversion
in order to call the staticDoubleParser
methods with the correct options.ValueCreatorCache
andConversion
to let them create other instances besides its default instance. Most of the codebase would still use their default instance, so this PR doesn't affect performance or behavior there. It's only inTextLoader
, and only in the case when using custom options forDoubleParser
, that a custom instance ofValueCreatorCache
andConversion
gets created. So then performance would only be somewhat affected when creating those instances for that particular case. To have minimum impact in performance, I also added aConcurrentDictionary
to hold the_customInstances
ofValueCreatorCache
(i.e. those which were created using non-defaultDoubleParser.OptionFlags
)... this way, when using custom options, the custom instance ofValueCreatorCache
and Conversion would only be created once, and would be reused afterwards.