Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add escapeChar support to TextLoader and added benchmark for TextLoader #5147

Merged
merged 12 commits into from
May 21, 2020

Conversation

antoniovs1029
Copy link
Member

@antoniovs1029 antoniovs1029 commented May 20, 2020

Adds feature requested here: #5125 (comment)

And also a benchmark for TextLoader, whose results are reported here:
#5147 (comment)

The feature is exposing a new option called escapeChar to the users, which is used to escape quote characters inside quoted fields.

So if escapeChar = \, then the second field of the following row:
1,"this \" quote was escaped",true

will be loaded as
this " quote was escaped

Default behavior (which was the behavior even before this PR) is that escapeChar = " since 2 double quotes ("") has always been used to escape quotes inside quoted fields by ML.NET.

@antoniovs1029 antoniovs1029 requested a review from a team as a code owner May 20, 2020 11:17
Comment on lines 524 to 529
/// <summary>
/// Character to use to escape quotes inside quoted fields.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Character to use to escape quotes inside quoted fields", ShortName = "escapechar")]
public char EscapeChar = Defaults.EscapeChar;

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(pinned to a line of code to allow for threaded conversations)

So currently this PR uses the escapeChar to escape quote chars (") inside quoted fields; which I think should be enough for the main ask made by Justin of letting users escape with \". Still, there is room for expanding this feature in 2 directions:

  1. Allow escapeChar to also escape quote chars inside non-quoted fields. The RFC 4180 would actually consider invalid any quote char (") outside of quoted fields. Still, ML.NET has always allowed unescaped quotes inside non-quoted fields... so if a field contained something like 1,these are 2 quotes "" and this is one " quote,true it would actually read the second field as these are 2 quotes "" and this is one " quote (i.e., without throwing or logging errors, and without escaping the first ""). So if we allow escapeChar to escape quotes even outside quoted fields, it will break previous behavior of ML.NET (of not escaping double quotes) and it would actually be unnecessary (since escaping them was never necessary).

  2. Allow escapeChar to also escape separators (a.k.a delimitation characters) (outside and/or inside quoted fields). Justin mentioned on his ask that other libraries allow escaping separators, but actually it seems that Databricks only allows escaping quotes, whereas Pandas' documentation is more ambiguous, and it says escapechar is used to "escape other characters"... so I'd still need to actually test if pandas allow escaping separators. Justin has also shared with me this paper where (among many other things) they used escapechar to escape both separators, quotes and the escapeChar itself... but I haven't read it fully yet, and I'm not sure if they conclude anything specifically on the usefulness of escaping separators. Even more, I'm not sure if @justinormont considers this to be a must. Also, it would be ambiguous what to do if the escapeChar = " (default) and we encounter the common case of ... "this is a field",5 ...... should ", be interpreted as an escaped ,? I guess that if escapeChar = " then it should never escape separators... this tricky case might require more changes to the code, making it somewhat bigger and perhaps somewhat less efficient.

My opinion is we shouldn't allow those two cases, and that we should stick with only having escapeChar escape quotes inside quoted fields, as adding the other cases might require more code to maintain and have some impact on efficiency. But I'm eager to hear opinions from others. @harishsk @justinormont #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the above assessment.


In reply to: 427946570 [](ancestors = 427946570)

Copy link
Contributor

@justinormont justinormont May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend covering the first case. Though, as is, this is a large improvement to the code base.

One possible way to handle it is to only accept quotes when pared w/ a separator:

  • Open quote with one of
    • Preceded w/ a separator: a,"open...close",c
      (with whitespace optional a, "open...close", c)
    • At the start of a line: "open...close",b,c
  • Close quote with one of
    • Followed by a separator: a,"open...close",c
      (no white space allowed)
    • At the end of a line: a,b,"open...close"
bool isOpenQuote, isCloseQuote, escapedChar, isInsideQuotedRegion, fieldClosed;
for (i = 0; i < str.length; i++)
{
  isOpenQuote = !isInsideQuotedRegion && !escapedChar && (
      i == 0 && str[i] == quoteChar
      || i >= 1 && str[i] == quoteChar && str[i-1] == sepChar
      || i >= 2 && str[i] == quoteChar && str[i-2] == sepChar && IsWhiteSpace(str[i-1])
    );

  if (isOpenQuote) { isInsideQuotedRegion = true; }

  isCloseQuote = isInsideQuotedRegion && !isOpenQuote && !escapedChar && (
      str[i] == quoteChar && str[i+1] == sepChar &&  i <= str.length - 2
      || str[i] == quoteChar && i == str.length - 1
    ); 

  if (isCloseQuote) { isInsideQuotedRegion = false; }

  fieldClosed = !isInsideQuotedRegion && (!escapedChar && str[i] == sepChar || i ==  str.length - 1);

  /* if fieldClosed or escapedChar, append to string builder, checking for isOpenQuote/isCloseQuote to skip the start/current char */

  escapedChar = !escapedChar && str[i] == escapeChar && IsEscapable(str[i+1]) && i <= str.length - 2;
}

Along with handling the cases of a,12",b, this " quote was escaped, and 1,these are 2 quotes "" and this is one " quote,true, this also catches the weird though common enough case of unescaped quotes a,"open...unescaped"...close",c.

It wouldn't handle 1,"abc",def" as the middle quote would still need to be properly escaped. #Resolved

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking the time to review this, @justinormont .

In general it would look to me that your logic is correct, and it might be a good (if not the best) to actually fully accept the first case I mentioned above. Although problems would arise if we also wanted to support all of the other current ML.NET behavior... for example, that we currently ignore all spaces before the opening quote and after the closing quote, whereas your logic only accepts 1 space before and none after (and this characteristic is what your logic rely on for distinguishing between a "real" closing quote that's supposed to be used to close the field, or a simple quote that just happens to be inside a field... so we would either have to choose between your logic or ML.NET current behavior). Also, with your logic I'm not sure if the whole QuotingError mechanism in ML.NET (where quoting errors are logged and empty fields are returned if they appear) would make sense.

In any case, as I mention below in another comment the code of the Parser already avoids as much as possible doing comparisons and calling functions. So it might not be good to add all of the checks that your logic needs to apply to every character of the input string. Because of this, I wouldn't recommend going on with your logic.

Specially, since all the valid cases (according to the RFC) are already accepted, and your logic seems to only be accepting some unescaped quotes (for which we would typically simply have a QuotingError), and being able to escape "escapable characters" inside or outside of quoted fields (which might or might not be the desired behavior of any given user, if we hard code which are these escapable characters).

Without your logic, particularly for case 1 you mentioned might be important, the behavior on this PR is to leave unquoted fields untouched... so "1,this is an \"unquoted\" field,true" will read the second field as this is an \"unquoted\" field (i.e. it won't escape the \" as your logic does, but it won't throw errors or anything else, it will simply not remove the \).

So I don't think it's worth it to add more logic to this PR to support escaping in case 1. If that's ok with you. #Resolved

Copy link
Contributor

@justinormont justinormont May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Thanks for thinking through the impact. #Resolved

@codecov
Copy link

codecov bot commented May 20, 2020

Codecov Report

Merging #5147 into master will increase coverage by 0.00%.
The diff coverage is 56.41%.

@@           Coverage Diff            @@
##           master    #5147    +/-   ##
========================================
  Coverage   75.67%   75.67%            
========================================
  Files         990      991     +1     
  Lines      179802   180024   +222     
  Branches    19341    19371    +30     
========================================
+ Hits       136065   136238   +173     
- Misses      38470    38517    +47     
- Partials     5267     5269     +2     
Flag Coverage Δ
#Debug 75.67% <56.41%> (+<0.01%) ⬆️
#production 71.58% <84.82%> (+0.01%) ⬆️
#test 88.83% <30.32%> (-0.09%) ⬇️
Impacted Files Coverage Δ
test/Microsoft.ML.Benchmarks/BenchmarkBase.cs 0.00% <0.00%> (ø)
test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs 0.00% <0.00%> (ø)
test/Microsoft.ML.Benchmarks/TextLoaderBench.cs 0.00% <0.00%> (ø)
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.34% <83.33%> (+0.05%) ⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 89.42% <84.44%> (+0.25%) ⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs 86.12% <85.71%> (-0.40%) ⬇️
test/Microsoft.ML.TestFramework/TestCommandBase.cs 44.56% <100.00%> (ø)
test/Microsoft.ML.Tests/TextLoaderTests.cs 95.75% <100.00%> (+0.19%) ⬆️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0.00%> (-1.46%) ⬇️
src/Microsoft.ML.Sweeper/AsyncSweeper.cs 72.60% <0.00%> (-1.37%) ⬇️
... and 11 more

@@ -576,45 +582,76 @@ public bool LastFieldIncludesNewLine(string line, bool startsInsideQuoted = fals
if (!startsInsideQuoted)
ichCur++;

for (; ; ichCur++)
if (_escapeCharIsDoubleQuote)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to refactor these two if/else blocks into separate functions for the sake of readability? (Or even better merge them into a single function/codepath?

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible to have this refactored into two functions or even into one path.

I didn't do this because the original writers of the Parser seemed to be very worried about efficiency and the way they addressed this is by avoiding creating / calling functions, and using bools such as the one I've used here for _escapeCharIsDoubleQuote.

So for example, we have the following code written by them, which can be obviously be refactored into a function, and perhaps even simplified to avoid the first 2 if branches, but it's stated that they didn't do it because it was "performance critical"

// Please note that these branched tight loops are intended and performance critical.
if (_seps.Length == 1)
{
for (; ; ichCur++)
{
Contracts.Assert(ichCur <= ichLim);
if (ichCur >= ichLim)
break;
if (_sep0 == span[ichCur])
break;
}
}
else if (_seps.Length == 2)
{
for (; ; ichCur++)
{
Contracts.Assert(ichCur <= ichLim);
if (ichCur >= ichLim)
break;
if (_sep0 == span[ichCur] || _sep1 == span[ichCur])
break;
}
}
else
{
for (; ; ichCur++)
{
Contracts.Assert(ichCur <= ichLim);
if (ichCur >= ichLim)
break;
if (IsSep(span[ichCur]))
break;
}
}

Following that pattern I thought that the fastest way for the code I am writing to run is to branch earlier with _escapeCharIsDoubleQuote only requiring to check it once... because if I refactor it to call functions or even worse if I merge into a single path I'll have to continuously check if _escapeCharIsDoubleQuote for it to work correctly, then it would have to take more steps and probably have a toll on performance. Also, personally I think that a merged path would be harder to read.

Do you still want me to try to refactor it? If so, I can try to come up with a refactored version for us to compare.


In reply to: 428386984 [](ancestors = 428386984)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any benchmark tests for text loader in the code. Can you please look up and see if I have missed something?


In reply to: 428393614 [](ancestors = 428393614,428386984)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any benchmark that only tests text loader. But I think it's a good idea to add one, so I'll do that.

In the meantime, I actually used the FeaturizeTextBench (which only loads a syntetic dataset and featurizes it) to run benchmarks on my previous readMultilines PR:
#5125 (comment)

As I mention there, I'm not sure how to interpret the results, so I will address this tomorrow on scrum.


In reply to: 428398494 [](ancestors = 428398494,428393614,428386984)

Copy link
Member Author

@antoniovs1029 antoniovs1029 May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've added a new TextLoaderBench Benchmark test, based on TextFeaturizerBench and CacheDataViewBench. On my new benchmark I simply load a DataView using TextLoader from a syntetic file, get the getter of its first 20 columns and read its 3000 rows without saving them into memory.

The results are as follows. Notice that I still don't understand how to interpret them, as they're somewhat unexpected... it seems they indicate there's not much change across runs, but it might be just as well that perhaps my benchmark isn't conclusive on what it's measuring. I wouldn't know 😕

(notice that even when the mean is very similar across scenarios, the Error is consistently higher without my PRs)

Without this PR, and also without my readMultilines PR

|                Method |    Mean |   Error |   StdDev | Extra Metric |
|---------------------- |--------:|--------:|---------:|-------------:|
| TestTextLoaderGetters | 1.256 s | 3.273 s | 0.1794 s |            - |

With this PR, with readMultilines = false and escapeChar =" (i.e. with defaults)

|                Method |     Mean |    Error |   StdDev | Extra Metric |
|---------------------- |---------:|---------:|---------:|-------------:|
| TestTextLoaderGetters | 810.5 ms | 765.0 ms | 41.93 ms |            - |

With this PR, readMultilines = true, escapeChar = "

|                Method |    Mean |    Error |   StdDev | Extra Metric |
|---------------------- |--------:|---------:|---------:|-------------:|
| TestTextLoaderGetters | 1.089 s | 0.7992 s | 0.0438 s |            - |

With this PR, readMultilines = true, escapeChar = \

|                Method |     Mean |    Error |   StdDev | Extra Metric |
|---------------------- |---------:|---------:|---------:|-------------:|
| TestTextLoaderGetters | 857.9 ms | 696.7 ms | 38.19 ms |            - |
// * Legends *
  Mean         : Arithmetic mean of all measurements
  Error        : Half of 99.9% confidence interval
  StdDev       : Standard deviation of all measurements
  Extra Metric : Value of the provided extra metric
  1 s          : 1 Second (1 sec)


In reply to: 428493474 [](ancestors = 428493474,428398494,428393614,428386984)

AllowQuoting = true,
//ReadMultilines = true,
//EscapeChar = '\\',
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do separate benchmarks for different combination of options? If you think that is going to take time, you can back out this file and let the rest of this PR go through and add the benchmark in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I'll keep the benchmark here but uncomment the ReadMultilines and EscapeChar options to actually test them on the benchmark.

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@antoniovs1029 antoniovs1029 changed the title Add escapeChar support to TextLoader Add escapeChar support to TextLoader and added benchmark for TextLoader May 21, 2020
@harishsk harishsk merged commit c576d5e into dotnet:master May 21, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants