-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add escapeChar support to TextLoader and added benchmark for TextLoader #5147
Merged
Merged
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
ae99160
Added support for escapeChar to escape quotes inside quoted fields
antoniovs1029 22434c1
Fixed bug in logic when field is quoted and empty
antoniovs1029 862de8b
Added serialization logic
antoniovs1029 caa4dc8
Added tests including backcompat test
antoniovs1029 cd26dfb
Updated manifest
antoniovs1029 32fffe2
Fix whitespace
antoniovs1029 b2d5025
Throw exceptions if escapeChar is also used as separator
antoniovs1029 5484d11
Fixed small bug on multireader
antoniovs1029 fa16dd8
Updated manifest
antoniovs1029 20e07f4
Fixed mistake on deserialization logic
antoniovs1029 e331ebe
Added Benchmark test
antoniovs1029 098a56e
Updated benchmark
antoniovs1029 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to refactor these two if/else blocks into separate functions for the sake of readability? (Or even better merge them into a single function/codepath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is possible to have this refactored into two functions or even into one path.
I didn't do this because the original writers of the Parser seemed to be very worried about efficiency and the way they addressed this is by avoiding creating / calling functions, and using bools such as the one I've used here for _escapeCharIsDoubleQuote.
So for example, we have the following code written by them, which can be obviously be refactored into a function, and perhaps even simplified to avoid the first 2 if branches, but it's stated that they didn't do it because it was "performance critical"
machinelearning/src/Microsoft.ML.Data/DataLoadSave/Text/TextLoaderParser.cs
Lines 1208 to 1241 in c023271
Following that pattern I thought that the fastest way for the code I am writing to run is to branch earlier with _escapeCharIsDoubleQuote only requiring to check it once... because if I refactor it to call functions or even worse if I merge into a single path I'll have to continuously check if _escapeCharIsDoubleQuote for it to work correctly, then it would have to take more steps and probably have a toll on performance. Also, personally I think that a merged path would be harder to read.
Do you still want me to try to refactor it? If so, I can try to come up with a refactored version for us to compare.
In reply to: 428386984 [](ancestors = 428386984)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find any benchmark tests for text loader in the code. Can you please look up and see if I have missed something?
In reply to: 428393614 [](ancestors = 428393614,428386984)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's any benchmark that only tests text loader. But I think it's a good idea to add one, so I'll do that.
In the meantime, I actually used the FeaturizeTextBench (which only loads a syntetic dataset and featurizes it) to run benchmarks on my previous readMultilines PR:
#5125 (comment)
As I mention there, I'm not sure how to interpret the results, so I will address this tomorrow on scrum.
In reply to: 428398494 [](ancestors = 428398494,428393614,428386984)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've added a new TextLoaderBench Benchmark test, based on TextFeaturizerBench and CacheDataViewBench. On my new benchmark I simply load a DataView using TextLoader from a syntetic file, get the getter of its first 20 columns and read its 3000 rows without saving them into memory.
The results are as follows. Notice that I still don't understand how to interpret them, as they're somewhat unexpected... it seems they indicate there's not much change across runs, but it might be just as well that perhaps my benchmark isn't conclusive on what it's measuring. I wouldn't know 😕
(notice that even when the mean is very similar across scenarios, the Error is consistently higher without my PRs)
Without this PR, and also without my readMultilines PR
With this PR, with readMultilines = false and escapeChar =
"
(i.e. with defaults)With this PR, readMultilines = true, escapeChar =
"
With this PR, readMultilines = true, escapeChar =
\
In reply to: 428493474 [](ancestors = 428493474,428398494,428393614,428386984)