Add escapeChar support to TextLoader and added benchmark for TextLoader #5147

antoniovs1029 · 2020-05-20T11:17:37Z

Adds feature requested here: #5125 (comment)

And also a benchmark for TextLoader, whose results are reported here:
#5147 (comment)

The feature is exposing a new option called escapeChar to the users, which is used to escape quote characters inside quoted fields.

So if escapeChar = \, then the second field of the following row:
1,"this \" quote was escaped",true

will be loaded as
this " quote was escaped

Default behavior (which was the behavior even before this PR) is that escapeChar = " since 2 double quotes ("") has always been used to escape quotes inside quoted fields by ML.NET.

antoniovs1029 · 2020-05-20T11:49:47Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

+            /// <summary>
+            /// Character to use to escape quotes inside quoted fields.
+            /// </summary>
+            [Argument(ArgumentType.AtMostOnce, HelpText = "Character to use to escape quotes inside quoted fields", ShortName = "escapechar")]
+            public char EscapeChar = Defaults.EscapeChar;
+


(pinned to a line of code to allow for threaded conversations)

So currently this PR uses the escapeChar to escape quote chars (") inside quoted fields; which I think should be enough for the main ask made by Justin of letting users escape with \". Still, there is room for expanding this feature in 2 directions:

Allow escapeChar to also escape quote chars inside non-quoted fields. The RFC 4180 would actually consider invalid any quote char (") outside of quoted fields. Still, ML.NET has always allowed unescaped quotes inside non-quoted fields... so if a field contained something like 1,these are 2 quotes "" and this is one " quote,true it would actually read the second field as these are 2 quotes "" and this is one " quote (i.e., without throwing or logging errors, and without escaping the first ""). So if we allow escapeChar to escape quotes even outside quoted fields, it will break previous behavior of ML.NET (of not escaping double quotes) and it would actually be unnecessary (since escaping them was never necessary).

Allow escapeChar to also escape separators (a.k.a delimitation characters) (outside and/or inside quoted fields). Justin mentioned on his ask that other libraries allow escaping separators, but actually it seems that Databricks only allows escaping quotes, whereas Pandas' documentation is more ambiguous, and it says escapechar is used to "escape other characters"... so I'd still need to actually test if pandas allow escaping separators. Justin has also shared with me this paper where (among many other things) they used escapechar to escape both separators, quotes and the escapeChar itself... but I haven't read it fully yet, and I'm not sure if they conclude anything specifically on the usefulness of escaping separators. Even more, I'm not sure if @justinormont considers this to be a must. Also, it would be ambiguous what to do if the escapeChar = " (default) and we encounter the common case of ... "this is a field",5 ...... should ", be interpreted as an escaped ,? I guess that if escapeChar = " then it should never escape separators... this tricky case might require more changes to the code, making it somewhat bigger and perhaps somewhat less efficient.

My opinion is we shouldn't allow those two cases, and that we should stick with only having escapeChar escape quotes inside quoted fields, as adding the other cases might require more code to maintain and have some impact on efficiency. But I'm eager to hear opinions from others. @harishsk @justinormont #Resolved

I agree with the above assessment.

In reply to: 427946570 [](ancestors = 427946570)

I'd recommend covering the first case. Though, as is, this is a large improvement to the code base.

One possible way to handle it is to only accept quotes when pared w/ a separator:

Open quote with one of

Preceded w/ a separator: a,"open...close",c
(with whitespace optional a, "open...close", c)

At the start of a line: "open...close",b,c

Close quote with one of

Followed by a separator: a,"open...close",c
(no white space allowed)

At the end of a line: a,b,"open...close"

bool isOpenQuote, isCloseQuote, escapedChar, isInsideQuotedRegion, fieldClosed; for (i = 0; i < str.length; i++) { isOpenQuote = !isInsideQuotedRegion && !escapedChar && ( i == 0 && str[i] == quoteChar || i >= 1 && str[i] == quoteChar && str[i-1] == sepChar || i >= 2 && str[i] == quoteChar && str[i-2] == sepChar && IsWhiteSpace(str[i-1]) ); if (isOpenQuote) { isInsideQuotedRegion = true; } isCloseQuote = isInsideQuotedRegion && !isOpenQuote && !escapedChar && ( str[i] == quoteChar && str[i+1] == sepChar && i <= str.length - 2 || str[i] == quoteChar && i == str.length - 1 ); if (isCloseQuote) { isInsideQuotedRegion = false; } fieldClosed = !isInsideQuotedRegion && (!escapedChar && str[i] == sepChar || i == str.length - 1); /* if fieldClosed or escapedChar, append to string builder, checking for isOpenQuote/isCloseQuote to skip the start/current char */ escapedChar = !escapedChar && str[i] == escapeChar && IsEscapable(str[i+1]) && i <= str.length - 2; }

Along with handling the cases of a,12",b, this " quote was escaped, and 1,these are 2 quotes "" and this is one " quote,true, this also catches the weird though common enough case of unescaped quotes a,"open...unescaped"...close",c.

It wouldn't handle 1,"abc",def" as the middle quote would still need to be properly escaped. #Resolved

Thanks for taking the time to review this, @justinormont .

In general it would look to me that your logic is correct, and it might be a good (if not the best) to actually fully accept the first case I mentioned above. Although problems would arise if we also wanted to support all of the other current ML.NET behavior... for example, that we currently ignore all spaces before the opening quote and after the closing quote, whereas your logic only accepts 1 space before and none after (and this characteristic is what your logic rely on for distinguishing between a "real" closing quote that's supposed to be used to close the field, or a simple quote that just happens to be inside a field... so we would either have to choose between your logic or ML.NET current behavior). Also, with your logic I'm not sure if the whole QuotingError mechanism in ML.NET (where quoting errors are logged and empty fields are returned if they appear) would make sense.

In any case, as I mention below in another comment the code of the Parser already avoids as much as possible doing comparisons and calling functions. So it might not be good to add all of the checks that your logic needs to apply to every character of the input string. Because of this, I wouldn't recommend going on with your logic.

Specially, since all the valid cases (according to the RFC) are already accepted, and your logic seems to only be accepting some unescaped quotes (for which we would typically simply have a QuotingError), and being able to escape "escapable characters" inside or outside of quoted fields (which might or might not be the desired behavior of any given user, if we hard code which are these escapable characters).

Without your logic, particularly for case 1 you mentioned might be important, the behavior on this PR is to leave unquoted fields untouched... so "1,this is an \"unquoted\" field,true" will read the second field as this is an \"unquoted\" field (i.e. it won't escape the \" as your logic does, but it won't throw errors or anything else, it will simply not remove the \).

So I don't think it's worth it to add more logic to this PR to support escaping in case 1. If that's ok with you. #Resolved

Sounds good. Thanks for thinking through the impact. #Resolved

test/Microsoft.ML.TestFramework/TestCommandBase.cs

codecov · 2020-05-20T12:38:19Z

Codecov Report

Merging #5147 into master will increase coverage by 0.00%.
The diff coverage is 56.41%.

@@           Coverage Diff            @@
##           master    #5147    +/-   ##
========================================
  Coverage   75.67%   75.67%            
========================================
  Files         990      991     +1     
  Lines      179802   180024   +222     
  Branches    19341    19371    +30     
========================================
+ Hits       136065   136238   +173     
- Misses      38470    38517    +47     
- Partials     5267     5269     +2

Flag	Coverage Δ
#Debug	`75.67% <56.41%> (+<0.01%)`	⬆️
#production	`71.58% <84.82%> (+0.01%)`	⬆️
#test	`88.83% <30.32%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
test/Microsoft.ML.Benchmarks/BenchmarkBase.cs	`0.00% <0.00%> (ø)`
test/Microsoft.ML.Benchmarks/FeaturizeTextBench.cs	`0.00% <0.00%> (ø)`
test/Microsoft.ML.Benchmarks/TextLoaderBench.cs	`0.00% <0.00%> (ø)`
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs	`82.34% <83.33%> (+0.05%)`	⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`89.42% <84.44%> (+0.25%)`	⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs	`86.12% <85.71%> (-0.40%)`	⬇️
test/Microsoft.ML.TestFramework/TestCommandBase.cs	`44.56% <100.00%> (ø)`
test/Microsoft.ML.Tests/TextLoaderTests.cs	`95.75% <100.00%> (+0.19%)`	⬆️
src/Microsoft.ML.Maml/MAML.cs	`24.75% <0.00%> (-1.46%)`	⬇️
src/Microsoft.ML.Sweeper/AsyncSweeper.cs	`72.60% <0.00%> (-1.37%)`	⬇️
... and 11 more

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

harishsk · 2020-05-21T00:50:07Z

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs

@@ -576,45 +582,76 @@ public bool LastFieldIncludesNewLine(string line, bool startsInsideQuoted = fals
                            if (!startsInsideQuoted)
                                ichCur++;

-                            for (; ; ichCur++)
+                            if (_escapeCharIsDoubleQuote)
                            {


Is it possible to refactor these two if/else blocks into separate functions for the sake of readability? (Or even better merge them into a single function/codepath?

I think it is possible to have this refactored into two functions or even into one path.

I didn't do this because the original writers of the Parser seemed to be very worried about efficiency and the way they addressed this is by avoiding creating / calling functions, and using bools such as the one I've used here for _escapeCharIsDoubleQuote.

So for example, we have the following code written by them, which can be obviously be refactored into a function, and perhaps even simplified to avoid the first 2 if branches, but it's stated that they didn't do it because it was "performance critical"

machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs

Lines 1208 to 1241 in c023271

// Please note that these branched tight loops are intended and performance critical.

if (_seps.Length == 1)

{

for (; ; ichCur++)

{

Contracts.Assert(ichCur <= ichLim);

if (ichCur >= ichLim)

break;

if (_sep0 == span[ichCur])

break;

}

}

else if (_seps.Length == 2)

{

for (; ; ichCur++)

{

Contracts.Assert(ichCur <= ichLim);

if (ichCur >= ichLim)

break;

if (_sep0 == span[ichCur] || _sep1 == span[ichCur])

break;

}

}

else

{

for (; ; ichCur++)

{

Contracts.Assert(ichCur <= ichLim);

if (ichCur >= ichLim)

break;

if (IsSep(span[ichCur]))

break;

}

}

Following that pattern I thought that the fastest way for the code I am writing to run is to branch earlier with _escapeCharIsDoubleQuote only requiring to check it once... because if I refactor it to call functions or even worse if I merge into a single path I'll have to continuously check if _escapeCharIsDoubleQuote for it to work correctly, then it would have to take more steps and probably have a toll on performance. Also, personally I think that a merged path would be harder to read.

Do you still want me to try to refactor it? If so, I can try to come up with a refactored version for us to compare.

In reply to: 428386984 [](ancestors = 428386984)

I couldn't find any benchmark tests for text loader in the code. Can you please look up and see if I have missed something?

In reply to: 428393614 [](ancestors = 428393614,428386984)

I don't think there's any benchmark that only tests text loader. But I think it's a good idea to add one, so I'll do that.

In the meantime, I actually used the FeaturizeTextBench (which only loads a syntetic dataset and featurizes it) to run benchmarks on my previous readMultilines PR:
#5125 (comment)

As I mention there, I'm not sure how to interpret the results, so I will address this tomorrow on scrum.

In reply to: 428398494 [](ancestors = 428398494,428393614,428386984)

So I've added a new TextLoaderBench Benchmark test, based on TextFeaturizerBench and CacheDataViewBench. On my new benchmark I simply load a DataView using TextLoader from a syntetic file, get the getter of its first 20 columns and read its 3000 rows without saving them into memory.

The results are as follows. Notice that I still don't understand how to interpret them, as they're somewhat unexpected... it seems they indicate there's not much change across runs, but it might be just as well that perhaps my benchmark isn't conclusive on what it's measuring. I wouldn't know 😕

(notice that even when the mean is very similar across scenarios, the Error is consistently higher without my PRs)

Without this PR, and also without my readMultilines PR

| Method | Mean | Error | StdDev | Extra Metric | |---------------------- |--------:|--------:|---------:|-------------:| | TestTextLoaderGetters | 1.256 s | 3.273 s | 0.1794 s | - |

With this PR, with readMultilines = false and escapeChar =" (i.e. with defaults)

| Method | Mean | Error | StdDev | Extra Metric | |---------------------- |---------:|---------:|---------:|-------------:| | TestTextLoaderGetters | 810.5 ms | 765.0 ms | 41.93 ms | - |

With this PR, readMultilines = true, escapeChar = "

| Method | Mean | Error | StdDev | Extra Metric | |---------------------- |--------:|---------:|---------:|-------------:| | TestTextLoaderGetters | 1.089 s | 0.7992 s | 0.0438 s | - |

With this PR, readMultilines = true, escapeChar = \

| Method | Mean | Error | StdDev | Extra Metric | |---------------------- |---------:|---------:|---------:|-------------:| | TestTextLoaderGetters | 857.9 ms | 696.7 ms | 38.19 ms | - |

// * Legends * Mean : Arithmetic mean of all measurements Error : Half of 99.9% confidence interval StdDev : Standard deviation of all measurements Extra Metric : Value of the provided extra metric 1 s : 1 Second (1 sec)

In reply to: 428493474 [](ancestors = 428493474,428398494,428393614,428386984)

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs

harishsk · 2020-05-21T18:24:35Z

test/Microsoft.ML.Benchmarks/TextLoaderBench.cs

+                AllowQuoting = true,
+                //ReadMultilines = true,
+                //EscapeChar = '\\',
+            });


Can we do separate benchmarks for different combination of options? If you think that is going to take time, you can back out this file and let the rest of this PR go through and add the benchmark in a separate PR.

As discussed offline, I'll keep the benchmark here but uncomment the ReadMultilines and EscapeChar options to actually test them on the benchmark.

harishsk

antoniovs1029 added 5 commits May 20, 2020 02:52

Added support for escapeChar to escape quotes inside quoted fields

ae99160

Fixed bug in logic when field is quoted and empty

22434c1

Added serialization logic

862de8b

Added tests including backcompat test

caa4dc8

Updated manifest

cd26dfb

antoniovs1029 requested a review from a team as a code owner May 20, 2020 11:17

antoniovs1029 requested review from justinormont, harishsk and mstfbl May 20, 2020 11:18

antoniovs1029 commented May 20, 2020

View reviewed changes

test/Microsoft.ML.TestFramework/TestCommandBase.cs Show resolved Hide resolved

antoniovs1029 mentioned this pull request May 20, 2020

Added decimal marker option in TextLoader #5145

Merged

antoniovs1029 mentioned this pull request May 20, 2020

support sweeping multiline option in AutoML #5148

Merged

antoniovs1029 commented May 20, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

harishsk reviewed May 21, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs Show resolved Hide resolved

justinormont reviewed May 21, 2020

View reviewed changes

src/Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs Outdated Show resolved Hide resolved

antoniovs1029 added 6 commits May 21, 2020 00:43

Fix whitespace

32fffe2

Throw exceptions if escapeChar is also used as separator

b2d5025

Fixed small bug on multireader

5484d11

Updated manifest

fa16dd8

Fixed mistake on deserialization logic

20e07f4

Added Benchmark test

e331ebe

harishsk reviewed May 21, 2020

View reviewed changes

Updated benchmark

098a56e

harishsk approved these changes May 21, 2020

View reviewed changes

antoniovs1029 changed the title ~~Add escapeChar support to TextLoader~~ Add escapeChar support to TextLoader and added benchmark for TextLoader May 21, 2020

harishsk merged commit c576d5e into dotnet:master May 21, 2020

justinormont mentioned this pull request Jun 6, 2020

Error in ML.net training #4464

Closed

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

	// Please note that these branched tight loops are intended and performance critical.
	if (_seps.Length == 1)
	{
	for (; ; ichCur++)
	{
	Contracts.Assert(ichCur <= ichLim);
	if (ichCur >= ichLim)
	break;
	if (_sep0 == span[ichCur])
	break;
	}
	}
	else if (_seps.Length == 2)
	{
	for (; ; ichCur++)
	{
	Contracts.Assert(ichCur <= ichLim);
	if (ichCur >= ichLim)
	break;
	if (_sep0 == span[ichCur] \|\| _sep1 == span[ichCur])
	break;
	}
	}
	else
	{
	for (; ; ichCur++)
	{
	Contracts.Assert(ichCur <= ichLim);
	if (ichCur >= ichLim)
	break;
	if (IsSep(span[ichCur]))
	break;
	}
	}

Add escapeChar support to TextLoader and added benchmark for TextLoader #5147

Add escapeChar support to TextLoader and added benchmark for TextLoader #5147

Uh oh!

Conversation

antoniovs1029 commented May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

antoniovs1029 May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk May 20, 2020

Choose a reason for hiding this comment

Uh oh!

justinormont May 21, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinormont May 21, 2020 • edited by antoniovs1029 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented May 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

harishsk May 21, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harishsk May 21, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 21, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

harishsk May 21, 2020

Choose a reason for hiding this comment

Uh oh!

antoniovs1029 May 21, 2020

Choose a reason for hiding this comment

Uh oh!

harishsk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antoniovs1029 commented May 20, 2020 •

edited

Loading

antoniovs1029 May 20, 2020 •

edited

Loading

justinormont May 21, 2020 •

edited by antoniovs1029

Loading

antoniovs1029 May 21, 2020 •

edited

Loading

justinormont May 21, 2020 •

edited by antoniovs1029

Loading

codecov bot commented May 20, 2020 •

edited

Loading

antoniovs1029 May 21, 2020 •

edited

Loading

antoniovs1029 May 21, 2020 •

edited

Loading