-
Couldn't load subscription status.
- Fork 176
Fix: CSV handling of embedded crlf #3515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: CSV handling of embedded crlf #3515
Conversation
According to RFC 4180 for the CSV file format,
section 2, item 6, if a CSV cell contains a
carriage return ('\r') or line feed ('\n') it
must be quoted.
Signed-off-by: Mike Swierczek <441523+Michael-S@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
| // Pretty formatting raw or CSV data with embedded newlines and carriage | ||
| // returns will still look awful on output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (non-blocking): Is anything stopping us from fixing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what would look better for pretty-printing with newlines in fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO escaped literals like "\\n" would be better, but it's no big deal. Can let someone open a feature request down the road if they have a specific solution they like.
| if (cell.contains(separator) | ||
| || cell.contains(quote) | ||
| || cell.contains("\r") | ||
| || cell.contains("\n")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: We're now traversing the string 4 times instead of 2.
It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to see if something matters, we would have to benchmark it. My first thought would be to use a java.lang.String::matches with a regex like "[\",\r\n]" (but the first two entries would need to be dynamically added based on the input quote and separator characters). My second would be to do a single traversal like:
for (int i = 0; i < cell.length(); i++) {
int c = cell.codePointAt(c);
if (c == ... || c == ... || c == ... || c == ) {
return quote + cell.replaceAll(quote, quote + quote) + quote;
}
}
return cell;
(edit: fixed)
For right now, if I understand OpenSearch properly the plugin is restricted to 10,000 rows. I'm guessing on that scale, the extra string traversals are harmless. But that's a completely wild guess, I won't argue with anyone who suggests otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, checks out. Unless someone is outputting very large text fields the traversals shouldn't be noticeable. I think you could theoretically nitpick and say the "c == ... || c == ..." approach is better for CPU cache locality, but I don't think there's a realistic situation where that makes a real difference. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: We're now traversing the string 4 times instead of 2.
It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?
You piqued my interest, so I wrote a simplistic benchmark. https://github.com/Michael-S/silly_java_string_search
Ten million random unicode strings, length 20-520, with the quotable characters sprinkled throughout. On my mid-range Intel laptop, in a single-threaded search: using the four contains calls took about 2.5 seconds; a manual loop to look at each character with four equality checks took about 3.8 seconds; and a regex took about 6.1 second. (And the regex didn't find the same number of matches as the others, which confused me because I thought the regex was simple.)
So it looks like contains is a decent default. Maybe some JVM wizard has a better solution.
|
Failing test tracked in #3516 |
Description
According to RFC 4180 for the CSV file format,
section 2, item 6, if a CSV cell contains a
carriage return ('\r') or line feed ('\n') it
must be quoted.
Related Issues
Resolves #3514
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.