-
Couldn't load subscription status.
- Fork 176
Fix: CSV handling of embedded crlf #3515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -135,6 +135,42 @@ void quoteIfRequired() { | |
| assertEquals(format(expectedPretty), getRawFormatterPretty().format(response)); | ||
| } | ||
|
|
||
| @Test | ||
| void quoteIfRequiredCrlf() { | ||
| ExecutionEngine.Schema schema = | ||
| new ExecutionEngine.Schema( | ||
| ImmutableList.of( | ||
| new ExecutionEngine.Schema.Column("field1", "field1", STRING), | ||
| new ExecutionEngine.Schema.Column("num", "num", INTEGER))); | ||
| QueryResult response = | ||
| new QueryResult( | ||
| schema, | ||
| Arrays.asList( | ||
| tupleValue(ImmutableMap.of("field1", "abc", "num", "5")), | ||
| tupleValue(ImmutableMap.of("field1", "def\r", "num", "\n6")), | ||
| tupleValue(ImmutableMap.of("field1", "gh\ri", "num", "7\n")), | ||
| tupleValue(ImmutableMap.of("field1", "jk\r\nl", "num", "\n\r8")), | ||
| tupleValue(ImmutableMap.of("field1", "\r\nmno\r\n", "num", "9\n\r0")))); | ||
| String expected = | ||
| "field1|num%n" | ||
| + "abc|5%n" | ||
| + "\"def\r\"|\"\n6\"%n" | ||
| + "\"gh\ri\"|\"7\n\"%n" | ||
| + "\"jk\r\nl\"|\"\n\r8\"%n" | ||
| + "\"\r\nmno\r\n\"|\"9\n\r0\""; | ||
| assertEquals(format(expected), getRawFormatter().format(response)); | ||
| // Pretty formatting raw or CSV data with embedded newlines and carriage | ||
| // returns will still look awful on output. | ||
|
Comment on lines
+162
to
+163
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question (non-blocking): Is anything stopping us from fixing this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure what would look better for pretty-printing with newlines in fields. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO escaped literals like |
||
| String expectedPretty = | ||
| "field1 |num %n" | ||
| + "abc |5 %n" | ||
| + "\"def\r\" |\"\n6\" %n" | ||
| + "\"gh\ri\" |\"7\n\" %n" | ||
| + "\"jk\r\nl\" |\"\n\r8\" %n" | ||
| + "\"\r\nmno\r\n\"|\"9\n\r0\""; | ||
| assertEquals(format(expectedPretty), getRawFormatterPretty().format(response)); | ||
| } | ||
|
|
||
| @Test | ||
| void formatError() { | ||
| Throwable t = new RuntimeException("This is an exception"); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: We're now traversing the string 4 times instead of 2.
It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think to see if something matters, we would have to benchmark it. My first thought would be to use a
java.lang.String::matcheswith a regex like"[\",\r\n]"(but the first two entries would need to be dynamically added based on the input quote and separator characters). My second would be to do a single traversal like:(edit: fixed)
For right now, if I understand OpenSearch properly the plugin is restricted to 10,000 rows. I'm guessing on that scale, the extra string traversals are harmless. But that's a completely wild guess, I won't argue with anyone who suggests otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, checks out. Unless someone is outputting very large text fields the traversals shouldn't be noticeable. I think you could theoretically nitpick and say the "c == ... || c == ..." approach is better for CPU cache locality, but I don't think there's a realistic situation where that makes a real difference. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You piqued my interest, so I wrote a simplistic benchmark. https://github.com/Michael-S/silly_java_string_search
Ten million random unicode strings, length 20-520, with the quotable characters sprinkled throughout. On my mid-range Intel laptop, in a single-threaded search: using the four
containscalls took about 2.5 seconds; a manual loop to look at each character with four equality checks took about 3.8 seconds; and a regex took about 6.1 second. (And the regex didn't find the same number of matches as the others, which confused me because I thought the regex was simple.)So it looks like
containsis a decent default. Maybe some JVM wizard has a better solution.