-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add escapeChar support to TextLoader and added benchmark for TextLoader #5147
Conversation
/// <summary> | ||
/// Character to use to escape quotes inside quoted fields. | ||
/// </summary> | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "Character to use to escape quotes inside quoted fields", ShortName = "escapechar")] | ||
public char EscapeChar = Defaults.EscapeChar; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(pinned to a line of code to allow for threaded conversations)
So currently this PR uses the escapeChar to escape quote chars (") inside quoted fields; which I think should be enough for the main ask made by Justin of letting users escape with \"
. Still, there is room for expanding this feature in 2 directions:
-
Allow escapeChar to also escape quote chars inside non-quoted fields. The RFC 4180 would actually consider invalid any quote char (") outside of quoted fields. Still, ML.NET has always allowed unescaped quotes inside non-quoted fields... so if a field contained something like
1,these are 2 quotes "" and this is one " quote,true
it would actually read the second field asthese are 2 quotes "" and this is one " quote
(i.e., without throwing or logging errors, and without escaping the first ""). So if we allow escapeChar to escape quotes even outside quoted fields, it will break previous behavior of ML.NET (of not escaping double quotes) and it would actually be unnecessary (since escaping them was never necessary). -
Allow escapeChar to also escape separators (a.k.a delimitation characters) (outside and/or inside quoted fields). Justin mentioned on his ask that other libraries allow escaping separators, but actually it seems that Databricks only allows escaping quotes, whereas Pandas' documentation is more ambiguous, and it says escapechar is used to "escape other characters"... so I'd still need to actually test if pandas allow escaping separators. Justin has also shared with me this paper where (among many other things) they used escapechar to escape both separators, quotes and the escapeChar itself... but I haven't read it fully yet, and I'm not sure if they conclude anything specifically on the usefulness of escaping separators. Even more, I'm not sure if @justinormont considers this to be a must. Also, it would be ambiguous what to do if the
escapeChar = "
(default) and we encounter the common case of... "this is a field",5 ...
... should",
be interpreted as an escaped,
? I guess that ifescapeChar = "
then it should never escape separators... this tricky case might require more changes to the code, making it somewhat bigger and perhaps somewhat less efficient.
My opinion is we shouldn't allow those two cases, and that we should stick with only having escapeChar escape quotes inside quoted fields, as adding the other cases might require more code to maintain and have some impact on efficiency. But I'm eager to hear opinions from others. @harishsk @justinormont #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend covering the first case. Though, as is, this is a large improvement to the code base.
One possible way to handle it is to only accept quotes when pared w/ a separator:
- Open quote with one of
- Preceded w/ a separator:
a,"open...close",c
(with whitespace optionala, "open...close", c
) - At the start of a line:
"open...close",b,c
- Preceded w/ a separator:
- Close quote with one of
- Followed by a separator:
a,"open...close",c
(no white space allowed) - At the end of a line:
a,b,"open...close"
- Followed by a separator:
bool isOpenQuote, isCloseQuote, escapedChar, isInsideQuotedRegion, fieldClosed;
for (i = 0; i < str.length; i++)
{
isOpenQuote = !isInsideQuotedRegion && !escapedChar && (
i == 0 && str[i] == quoteChar
|| i >= 1 && str[i] == quoteChar && str[i-1] == sepChar
|| i >= 2 && str[i] == quoteChar && str[i-2] == sepChar && IsWhiteSpace(str[i-1])
);
if (isOpenQuote) { isInsideQuotedRegion = true; }
isCloseQuote = isInsideQuotedRegion && !isOpenQuote && !escapedChar && (
str[i] == quoteChar && str[i+1] == sepChar && i <= str.length - 2
|| str[i] == quoteChar && i == str.length - 1
);
if (isCloseQuote) { isInsideQuotedRegion = false; }
fieldClosed = !isInsideQuotedRegion && (!escapedChar && str[i] == sepChar || i == str.length - 1);
/* if fieldClosed or escapedChar, append to string builder, checking for isOpenQuote/isCloseQuote to skip the start/current char */
escapedChar = !escapedChar && str[i] == escapeChar && IsEscapable(str[i+1]) && i <= str.length - 2;
}
Along with handling the cases of a,12",b
, this " quote was escaped
, and 1,these are 2 quotes "" and this is one " quote,true
, this also catches the weird though common enough case of unescaped quotes a,"open...unescaped"...close",c
.
It wouldn't handle 1,"abc",def"
as the middle quote would still need to be properly escaped. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking the time to review this, @justinormont .
In general it would look to me that your logic is correct, and it might be a good (if not the best) to actually fully accept the first case I mentioned above. Although problems would arise if we also wanted to support all of the other current ML.NET behavior... for example, that we currently ignore all spaces before the opening quote and after the closing quote, whereas your logic only accepts 1 space before and none after (and this characteristic is what your logic rely on for distinguishing between a "real" closing quote that's supposed to be used to close the field, or a simple quote that just happens to be inside a field... so we would either have to choose between your logic or ML.NET current behavior). Also, with your logic I'm not sure if the whole QuotingError mechanism in ML.NET (where quoting errors are logged and empty fields are returned if they appear) would make sense.
In any case, as I mention below in another comment the code of the Parser already avoids as much as possible doing comparisons and calling functions. So it might not be good to add all of the checks that your logic needs to apply to every character of the input string. Because of this, I wouldn't recommend going on with your logic.
Specially, since all the valid cases (according to the RFC) are already accepted, and your logic seems to only be accepting some unescaped quotes (for which we would typically simply have a QuotingError), and being able to escape "escapable characters" inside or outside of quoted fields (which might or might not be the desired behavior of any given user, if we hard code which are these escapable characters).
Without your logic, particularly for case 1 you mentioned might be important, the behavior on this PR is to leave unquoted fields untouched... so "1,this is an \"unquoted\" field,true"
will read the second field as this is an \"unquoted\" field
(i.e. it won't escape the \"
as your logic does, but it won't throw errors or anything else, it will simply not remove the \
).
So I don't think it's worth it to add more logic to this PR to support escaping in case 1. If that's ok with you. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Thanks for thinking through the impact. #Resolved
Codecov Report
@@ Coverage Diff @@
## master #5147 +/- ##
========================================
Coverage 75.67% 75.67%
========================================
Files 990 991 +1
Lines 179802 180024 +222
Branches 19341 19371 +30
========================================
+ Hits 136065 136238 +173
- Misses 38470 38517 +47
- Partials 5267 5269 +2
|
@@ -576,45 +582,76 @@ public bool LastFieldIncludesNewLine(string line, bool startsInsideQuoted = fals | |||
if (!startsInsideQuoted) | |||
ichCur++; | |||
|
|||
for (; ; ichCur++) | |||
if (_escapeCharIsDoubleQuote) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to refactor these two if/else blocks into separate functions for the sake of readability? (Or even better merge them into a single function/codepath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible to have this refactored into two functions or even into one path.
I didn't do this because the original writers of the Parser seemed to be very worried about efficiency and the way they addressed this is by avoiding creating / calling functions, and using bools such as the one I've used here for _escapeCharIsDoubleQuote.
So for example, we have the following code written by them, which can be obviously be refactored into a function, and perhaps even simplified to avoid the first 2 if branches, but it's stated that they didn't do it because it was "performance critical"
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs
Lines 1208 to 1241 in c023271
// Please note that these branched tight loops are intended and performance critical. | |
if (_seps.Length == 1) | |
{ | |
for (; ; ichCur++) | |
{ | |
Contracts.Assert(ichCur <= ichLim); | |
if (ichCur >= ichLim) | |
break; | |
if (_sep0 == span[ichCur]) | |
break; | |
} | |
} | |
else if (_seps.Length == 2) | |
{ | |
for (; ; ichCur++) | |
{ | |
Contracts.Assert(ichCur <= ichLim); | |
if (ichCur >= ichLim) | |
break; | |
if (_sep0 == span[ichCur] || _sep1 == span[ichCur]) | |
break; | |
} | |
} | |
else | |
{ | |
for (; ; ichCur++) | |
{ | |
Contracts.Assert(ichCur <= ichLim); | |
if (ichCur >= ichLim) | |
break; | |
if (IsSep(span[ichCur])) | |
break; | |
} | |
} |
Following that pattern I thought that the fastest way for the code I am writing to run is to branch earlier with _escapeCharIsDoubleQuote only requiring to check it once... because if I refactor it to call functions or even worse if I merge into a single path I'll have to continuously check if _escapeCharIsDoubleQuote for it to work correctly, then it would have to take more steps and probably have a toll on performance. Also, personally I think that a merged path would be harder to read.
Do you still want me to try to refactor it? If so, I can try to come up with a refactored version for us to compare.
In reply to: 428386984 [](ancestors = 428386984)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any benchmark tests for text loader in the code. Can you please look up and see if I have missed something?
In reply to: 428393614 [](ancestors = 428393614,428386984)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any benchmark that only tests text loader. But I think it's a good idea to add one, so I'll do that.
In the meantime, I actually used the FeaturizeTextBench (which only loads a syntetic dataset and featurizes it) to run benchmarks on my previous readMultilines PR:
#5125 (comment)
As I mention there, I'm not sure how to interpret the results, so I will address this tomorrow on scrum.
In reply to: 428398494 [](ancestors = 428398494,428393614,428386984)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've added a new TextLoaderBench Benchmark test, based on TextFeaturizerBench and CacheDataViewBench. On my new benchmark I simply load a DataView using TextLoader from a syntetic file, get the getter of its first 20 columns and read its 3000 rows without saving them into memory.
The results are as follows. Notice that I still don't understand how to interpret them, as they're somewhat unexpected... it seems they indicate there's not much change across runs, but it might be just as well that perhaps my benchmark isn't conclusive on what it's measuring. I wouldn't know 😕
(notice that even when the mean is very similar across scenarios, the Error is consistently higher without my PRs)
Without this PR, and also without my readMultilines PR
| Method | Mean | Error | StdDev | Extra Metric |
|---------------------- |--------:|--------:|---------:|-------------:|
| TestTextLoaderGetters | 1.256 s | 3.273 s | 0.1794 s | - |
With this PR, with readMultilines = false and escapeChar ="
(i.e. with defaults)
| Method | Mean | Error | StdDev | Extra Metric |
|---------------------- |---------:|---------:|---------:|-------------:|
| TestTextLoaderGetters | 810.5 ms | 765.0 ms | 41.93 ms | - |
With this PR, readMultilines = true, escapeChar = "
| Method | Mean | Error | StdDev | Extra Metric |
|---------------------- |--------:|---------:|---------:|-------------:|
| TestTextLoaderGetters | 1.089 s | 0.7992 s | 0.0438 s | - |
With this PR, readMultilines = true, escapeChar = \
| Method | Mean | Error | StdDev | Extra Metric |
|---------------------- |---------:|---------:|---------:|-------------:|
| TestTextLoaderGetters | 857.9 ms | 696.7 ms | 38.19 ms | - |
// * Legends *
Mean : Arithmetic mean of all measurements
Error : Half of 99.9% confidence interval
StdDev : Standard deviation of all measurements
Extra Metric : Value of the provided extra metric
1 s : 1 Second (1 sec)
In reply to: 428493474 [](ancestors = 428493474,428398494,428393614,428386984)
AllowQuoting = true, | ||
//ReadMultilines = true, | ||
//EscapeChar = '\\', | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do separate benchmarks for different combination of options? If you think that is going to take time, you can back out this file and let the rest of this PR go through and add the benchmark in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, I'll keep the benchmark here but uncomment the ReadMultilines and EscapeChar options to actually test them on the benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds feature requested here: #5125 (comment)
And also a benchmark for TextLoader, whose results are reported here:
#5147 (comment)
The feature is exposing a new option called escapeChar to the users, which is used to escape quote characters inside quoted fields.
So if
escapeChar = \
, then the second field of the following row:1,"this \" quote was escaped",true
will be loaded as
this " quote was escaped
Default behavior (which was the behavior even before this PR) is that
escapeChar = "
since 2 double quotes ("") has always been used to escape quotes inside quoted fields by ML.NET.