Fix: CSV handling of embedded crlf #3515

Michael-S · 2025-04-03T18:27:56Z

Description

According to RFC 4180 for the CSV file format,
section 2, item 6, if a CSV cell contains a
carriage return ('\r') or line feed ('\n') it
must be quoted.

Related Issues

Resolves #3514

Check List

[x ] New functionality includes testing.
New functionality has been documented. (N/A, bug fix)
New functionality has javadoc added. (N/A, bug fix)
New functionality has a user manual doc added. (N/A, bug fix)
API changes companion pull request created. (N/A, bug fix)
[x ] Commits are signed per the DCO using --signoff.
Public documentation issue/PR created. (N/A, bug fix)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

According to RFC 4180 for the CSV file format, section 2, item 6, if a CSV cell contains a carriage return ('\r') or line feed ('\n') it must be quoted. Signed-off-by: Mike Swierczek <441523+Michael-S@users.noreply.github.com>

Swiddis

Thanks for the contribution!

Swiddis · 2025-04-04T16:56:43Z

...ocol/src/test/java/org/opensearch/sql/protocol/response/format/RawResponseFormatterTest.java

+    // Pretty formatting raw or CSV data with embedded newlines and carriage
+    // returns will still look awful on output.


question (non-blocking): Is anything stopping us from fixing this?

I'm not sure what would look better for pretty-printing with newlines in fields.

IMO escaped literals like "\\n" would be better, but it's no big deal. Can let someone open a feature request down the road if they have a specific solution they like.

Swiddis · 2025-04-04T17:02:18Z

protocol/src/main/java/org/opensearch/sql/protocol/response/format/FlatResponseBase.java

+    if (cell.contains(separator)
+        || cell.contains(quote)
+        || cell.contains("\r")
+        || cell.contains("\n")) {


thought: We're now traversing the string 4 times instead of 2.

It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?

I think to see if something matters, we would have to benchmark it. My first thought would be to use a java.lang.String::matches with a regex like "[\",\r\n]" (but the first two entries would need to be dynamically added based on the input quote and separator characters). My second would be to do a single traversal like:

for (int i = 0; i < cell.length(); i++) { int c = cell.codePointAt(c); if (c == ... || c == ... || c == ... || c == ) { return quote + cell.replaceAll(quote, quote + quote) + quote; } } return cell;

(edit: fixed)

For right now, if I understand OpenSearch properly the plugin is restricted to 10,000 rows. I'm guessing on that scale, the extra string traversals are harmless. But that's a completely wild guess, I won't argue with anyone who suggests otherwise.

Yeah, checks out. Unless someone is outputting very large text fields the traversals shouldn't be noticeable. I think you could theoretically nitpick and say the "c == ... || c == ..." approach is better for CPU cache locality, but I don't think there's a realistic situation where that makes a real difference. Thanks!

thought: We're now traversing the string 4 times instead of 2.

It might not have much of an impact in practice (benchmark?). But for formatting a large amount of data, I could see all these redundant checks adding up (imagining something like 1M rows of small cells with 0 special characters). Is there any built-in that can do this better?

You piqued my interest, so I wrote a simplistic benchmark. https://github.com/Michael-S/silly_java_string_search
Ten million random unicode strings, length 20-520, with the quotable characters sprinkled throughout. On my mid-range Intel laptop, in a single-threaded search: using the four contains calls took about 2.5 seconds; a manual loop to look at each character with four equality checks took about 3.8 seconds; and a regex took about 6.1 second. (And the regex didn't find the same number of matches as the others, which confused me because I thought the regex was simple.)

So it looks like contains is a decent default. Maybe some JVM wizard has a better solution.

Swiddis · 2025-04-04T18:28:35Z

Failing test tracked in #3516

Fix: CSV handling of embedded crlf

9732554

According to RFC 4180 for the CSV file format, section 2, item 6, if a CSV cell contains a carriage return ('\r') or line feed ('\n') it must be quoted. Signed-off-by: Mike Swierczek <441523+Michael-S@users.noreply.github.com>

Michael-S requested review from GumpacG, LantaoJin, MaxKsyunz, Swiddis, YANG-DB, Yury-Fridlyand, acarbonetto, anirudha, dai-chen, derek-ho, forestmvey, joshuali925, kavithacm, mengweieric, noCharger, penghuo, ps48, qianheng-aws, seankao-az and ykmr1224 as code owners April 3, 2025 18:27

Swiddis approved these changes Apr 4, 2025

View reviewed changes

Swiddis added the enhancement New feature or request label Apr 4, 2025

Swiddis mentioned this pull request Apr 4, 2025

[BUG] Flaky test: CalcitePPLDateTimeBuiltinFunctionIT > testDateFormatAndDatetimeAndFromDays #3516

Closed

Swiddis added bug Something isn't working and removed enhancement New feature or request labels Apr 4, 2025

seankao-az approved these changes Apr 5, 2025

View reviewed changes

Swiddis mentioned this pull request Apr 5, 2025

Another method + improvements Michael-S/silly_java_string_search#1

Merged

Swiddis merged commit a039336 into opensearch-project:main Apr 5, 2025
23 of 24 checks passed

xinyual mentioned this pull request Jun 11, 2025

[backport to 2.19-dev] Backport calcite prs #3752

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix: CSV handling of embedded crlf #3515

Fix: CSV handling of embedded crlf #3515

Uh oh!

Michael-S commented Apr 3, 2025 •

edited

Loading

Uh oh!

Swiddis left a comment

Uh oh!

Swiddis Apr 4, 2025

Uh oh!

Michael-S Apr 4, 2025

Uh oh!

Swiddis Apr 4, 2025 •

edited

Loading

Uh oh!

Swiddis Apr 4, 2025

Uh oh!

Michael-S Apr 4, 2025 •

edited

Loading

Uh oh!

Swiddis Apr 4, 2025

Uh oh!

Michael-S Apr 4, 2025

Uh oh!

Swiddis commented Apr 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		// Pretty formatting raw or CSV data with embedded newlines and carriage
		// returns will still look awful on output.

Uh oh!

Fix: CSV handling of embedded crlf #3515

Fix: CSV handling of embedded crlf #3515

Uh oh!

Conversation

Michael-S commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

Swiddis left a comment

Choose a reason for hiding this comment

Uh oh!

Swiddis Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Michael-S Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Swiddis Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Swiddis Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Michael-S Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Swiddis Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Michael-S Apr 4, 2025

Choose a reason for hiding this comment

Uh oh!

Swiddis commented Apr 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Michael-S commented Apr 3, 2025 •

edited

Loading

Swiddis Apr 4, 2025 •

edited

Loading

Michael-S Apr 4, 2025 •

edited

Loading