Skip to content

Conversation

@Michael-S
Copy link
Contributor

@Michael-S Michael-S commented Apr 3, 2025

Description

According to RFC 4180 for the CSV file format,
section 2, item 6, if a CSV cell contains a
carriage return ('\r') or line feed ('\n') it
must be quoted.

Related Issues

Resolves #3514

Check List

  • [x ] New functionality includes testing.
  • New functionality has been documented. (N/A, bug fix)
  • New functionality has javadoc added. (N/A, bug fix)
  • New functionality has a user manual doc added. (N/A, bug fix)
  • API changes companion pull request created. (N/A, bug fix)
  • [x ] Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created. (N/A, bug fix)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

According to RFC 4180 for the CSV file format,
section 2, item 6, if a CSV cell contains a
carriage return ('\r') or line feed ('\n') it
must be quoted.

Signed-off-by: Mike Swierczek <441523+Michael-S@users.noreply.github.com>
Copy link
Collaborator

@Swiddis Swiddis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Comment on lines +162 to +163
// Pretty formatting raw or CSV data with embedded newlines and carriage
// returns will still look awful on output.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): Is anything stopping us from fixing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what would look better for pretty-printing with newlines in fields.

Copy link
Collaborator

@Swiddis Swiddis Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO escaped literals like "\\n" would be better, but it's no big deal. Can let someone open a feature request down the road if they have a specific solution they like.

Comment on lines +87 to +90
if (cell.contains(separator)
|| cell.contains(quote)
|| cell.contains("\r")
|| cell.contains("\n")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We're now traversing the string 4 times instead of 2.

It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?

Copy link
Contributor Author

@Michael-S Michael-S Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to see if something matters, we would have to benchmark it. My first thought would be to use a java.lang.String::matches with a regex like "[\",\r\n]" (but the first two entries would need to be dynamically added based on the input quote and separator characters). My second would be to do a single traversal like:

for (int i = 0; i < cell.length(); i++) {
   int c = cell.codePointAt(c);
   if (c == ... || c == ... || c == ... || c == ) {
       return quote + cell.replaceAll(quote, quote + quote) + quote;
    }
}
return cell;

(edit: fixed)

For right now, if I understand OpenSearch properly the plugin is restricted to 10,000 rows. I'm guessing on that scale, the extra string traversals are harmless. But that's a completely wild guess, I won't argue with anyone who suggests otherwise.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, checks out. Unless someone is outputting very large text fields the traversals shouldn't be noticeable. I think you could theoretically nitpick and say the "c == ... || c == ..." approach is better for CPU cache locality, but I don't think there's a realistic situation where that makes a real difference. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We're now traversing the string 4 times instead of 2.

It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?

You piqued my interest, so I wrote a simplistic benchmark. https://github.com/Michael-S/silly_java_string_search
Ten million random unicode strings, length 20-520, with the quotable characters sprinkled throughout. On my mid-range Intel laptop, in a single-threaded search: using the four contains calls took about 2.5 seconds; a manual loop to look at each character with four equality checks took about 3.8 seconds; and a regex took about 6.1 second. (And the regex didn't find the same number of matches as the others, which confused me because I thought the regex was simple.)

So it looks like contains is a decent default. Maybe some JVM wizard has a better solution.

@Swiddis Swiddis added the enhancement New feature or request label Apr 4, 2025
@Swiddis Swiddis added bug Something isn't working and removed enhancement New feature or request labels Apr 4, 2025
@Swiddis
Copy link
Collaborator

Swiddis commented Apr 4, 2025

Failing test tracked in #3516

@Swiddis Swiddis merged commit a039336 into opensearch-project:main Apr 5, 2025
23 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] CSV output does not handle entries with CRLF

3 participants