Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,10 @@ protected List<List<String>> formatData(List<List<String>> lines) {

protected String quoteIfRequired(String separator, String cell) {
final String quote = "\"";
if (cell.contains(separator) || cell.contains(quote)) {
if (cell.contains(separator)
|| cell.contains(quote)
|| cell.contains("\r")
|| cell.contains("\n")) {
Comment on lines +87 to +90
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We're now traversing the string 4 times instead of 2.

It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?

Copy link
Contributor Author

@Michael-S Michael-S Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to see if something matters, we would have to benchmark it. My first thought would be to use a java.lang.String::matches with a regex like "[\",\r\n]" (but the first two entries would need to be dynamically added based on the input quote and separator characters). My second would be to do a single traversal like:

for (int i = 0; i < cell.length(); i++) {
   int c = cell.codePointAt(c);
   if (c == ... || c == ... || c == ... || c == ) {
       return quote + cell.replaceAll(quote, quote + quote) + quote;
    }
}
return cell;

(edit: fixed)

For right now, if I understand OpenSearch properly the plugin is restricted to 10,000 rows. I'm guessing on that scale, the extra string traversals are harmless. But that's a completely wild guess, I won't argue with anyone who suggests otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, checks out. Unless someone is outputting very large text fields the traversals shouldn't be noticeable. I think you could theoretically nitpick and say the "c == ... || c == ..." approach is better for CPU cache locality, but I don't think there's a realistic situation where that makes a real difference. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We're now traversing the string 4 times instead of 2.

It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?

You piqued my interest, so I wrote a simplistic benchmark. https://github.com/Michael-S/silly_java_string_search
Ten million random unicode strings, length 20-520, with the quotable characters sprinkled throughout. On my mid-range Intel laptop, in a single-threaded search: using the four contains calls took about 2.5 seconds; a manual loop to look at each character with four equality checks took about 3.8 seconds; and a regex took about 6.1 second. (And the regex didn't find the same number of matches as the others, which confused me because I thought the regex was simple.)

So it looks like contains is a decent default. Maybe some JVM wizard has a better solution.

return quote + cell.replaceAll(quote, quote + quote) + quote;
} else {
return cell;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,9 +110,11 @@ void quoteIfRequired() {
schema,
Arrays.asList(
tupleValue(ImmutableMap.of("na,me", "John,Smith", ",,age", "30,,,")),
tupleValue(ImmutableMap.of("na,me", "Line\nBreak", ",,age", "28,,,")),
tupleValue(ImmutableMap.of("na,me", "\"Janice Jones", ",,age", "26\""))));
String expected =
"\"na,me\",\",,age\"%n\"John,Smith\",\"30,,,\"%n\"\"\"Janice Jones\",\"26\"\"\"";
"\"na,me\",\",,age\"%n\"John,Smith\",\"30,,,\"%n\"Line\nBreak\",\"28,,,\"%n"
+ "\"\"\"Janice Jones\",\"26\"\"\"";
assertEquals(format(expected), formatter.format(response));
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,42 @@ void quoteIfRequired() {
assertEquals(format(expectedPretty), getRawFormatterPretty().format(response));
}

@Test
void quoteIfRequiredCrlf() {
ExecutionEngine.Schema schema =
new ExecutionEngine.Schema(
ImmutableList.of(
new ExecutionEngine.Schema.Column("field1", "field1", STRING),
new ExecutionEngine.Schema.Column("num", "num", INTEGER)));
QueryResult response =
new QueryResult(
schema,
Arrays.asList(
tupleValue(ImmutableMap.of("field1", "abc", "num", "5")),
tupleValue(ImmutableMap.of("field1", "def\r", "num", "\n6")),
tupleValue(ImmutableMap.of("field1", "gh\ri", "num", "7\n")),
tupleValue(ImmutableMap.of("field1", "jk\r\nl", "num", "\n\r8")),
tupleValue(ImmutableMap.of("field1", "\r\nmno\r\n", "num", "9\n\r0"))));
String expected =
"field1|num%n"
+ "abc|5%n"
+ "\"def\r\"|\"\n6\"%n"
+ "\"gh\ri\"|\"7\n\"%n"
+ "\"jk\r\nl\"|\"\n\r8\"%n"
+ "\"\r\nmno\r\n\"|\"9\n\r0\"";
assertEquals(format(expected), getRawFormatter().format(response));
// Pretty formatting raw or CSV data with embedded newlines and carriage
// returns will still look awful on output.
Comment on lines +162 to +163
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): Is anything stopping us from fixing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what would look better for pretty-printing with newlines in fields.

Copy link
Collaborator

@Swiddis Swiddis Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO escaped literals like "\\n" would be better, but it's no big deal. Can let someone open a feature request down the road if they have a specific solution they like.

String expectedPretty =
"field1 |num %n"
+ "abc |5 %n"
+ "\"def\r\" |\"\n6\" %n"
+ "\"gh\ri\" |\"7\n\" %n"
+ "\"jk\r\nl\" |\"\n\r8\" %n"
+ "\"\r\nmno\r\n\"|\"9\n\r0\"";
assertEquals(format(expectedPretty), getRawFormatterPretty().format(response));
}

@Test
void formatError() {
Throwable t = new RuntimeException("This is an exception");
Expand Down
Loading