Skip to content

Conversation

@LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different.

The reason for the difference is Spark use STOP_AT_DELIMITER as default UnescapedQuoteHandling to build CsvParser and it not configurable.

On the other hand, opencsv and commons-csv use the parsing mechanism similar to STOP_AT_CLOSING_QUOTE by default.

So this pr make unescapedQuoteHandling option configurable to get the same parsing result as opencsv and commons-csv.

Why are the changes needed?

Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass the Jenkins or GitHub Action

  • Add a new case similar to that described in SPARK-33566

@github-actions github-actions bot added the SQL label Nov 26, 2020
@LuciferYang
Copy link
Contributor Author

LuciferYang commented Nov 26, 2020

The original case described in SPARK-33566 as follows:

data:

"h1","h2","h3"
"one","two","three"
"abc","^@<b><i><span style=""font-family: tahoma,sans-serif;"">Referral from Joe Smith.<A0> Fred is hard working.<A0> Super smart, though you wouldn&#39;t know it at first.<A0> 6 months, and we sold this project.<A0> Phooey he said to me!<A0> What&#39;s up with you people.<A0> You&#39;ll say anything for a sale!<A0> Until he met me of course....haar haar!</span></i></b><br><A0><br><b><i><span style=""font-family: tahoma,sans-serif;"">Internet is spotty</span></i></b><br><b><i><span style=""font-family: tahoma,sans-serif;"">Working while at home so.<A0> Will be applied this weekend. <A0></span></i></b><br><A0><br><b><i><span style=""font-family: tahoma,sans-serif;"">On Bill Recovery and 20 yr warranty added.</span></i></b><br><A0><br><b><i><span style=""font-family: tahoma,sans-serif;"">Kindness made this deal happen!</span></i></b><br><A0>","xyz"

opencsv and commons-csv parse row 2 of h2 as follows:

^@<b><i><span style=""font-family: tahoma,sans-serif;"">Referral from Joe Smith.<A0> Fred is hard working.<A0> Super smart, though you wouldn&#39;t know it at first.<A0> 6 months, and we sold this project.<A0> Phooey he said to me!<A0> What&#39;s up with you people.<A0> You&#39;ll say anything for a sale!<A0> Until he met me of course....haar haar!</span></i></b><br><A0><br><b><i><span style=""font-family: tahoma,sans-serif;"">Internet is spotty</span></i></b><br><b><i><span style=""font-family: tahoma,sans-serif;"">Working while at home so.<A0> Will be applied this weekend. <A0></span></i></b><br><A0><br><b><i><span style=""font-family: tahoma,sans-serif;"">On Bill Recovery and 20 yr warranty added.</span></i></b><br><A0><br><b><i><span style=""font-family: tahoma,sans-serif;"">Kindness made this deal happen!</span></i></b><br><A0>

Without this pr Spark parse row 2 of h2 as follows:

^@<b><i><span style=""font-family: tahoma

@LuciferYang
Copy link
Contributor Author

cc @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Nov 26, 2020

Test build #131848 has finished for PR 30518 at commit b025271.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang LuciferYang changed the title [SPARK-33566][SQL] Make unescapedQuoteHandling option configurable when read CSV [SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV Nov 27, 2020
@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Thanks @LuciferYang.

@SparkQA
Copy link

SparkQA commented Nov 27, 2020

Test build #131857 has finished for PR 30518 at commit 1770c56.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor Author

thx @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Nov 27, 2020

Test build #131858 has finished for PR 30518 at commit ca79a48.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2020

Test build #131860 has finished for PR 30518 at commit 1646adb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 27, 2020

Test build #131863 has finished for PR 30518 at commit ca2900d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

HyukjinKwon added a commit that referenced this pull request May 3, 2021
…UE at CSV's unescapedQuoteHandling option documentation

### What changes were proposed in this pull request?

This is rather a followup of #30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.

### Why are the changes needed?

To correctly document.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing documentation.

### How was this patch tested?

I checked them via running linters.

Closes #32423 from HyukjinKwon/SPARK-35250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon added a commit that referenced this pull request May 3, 2021
…UE at CSV's unescapedQuoteHandling option documentation

### What changes were proposed in this pull request?

This is rather a followup of #30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.

### Why are the changes needed?

To correctly document.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing documentation.

### How was this patch tested?

I checked them via running linters.

Closes #32423 from HyukjinKwon/SPARK-35250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 8aaa9e8)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…UE at CSV's unescapedQuoteHandling option documentation

### What changes were proposed in this pull request?

This is rather a followup of apache#30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.

### Why are the changes needed?

To correctly document.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the user-facing documentation.

### How was this patch tested?

I checked them via running linters.

Closes apache#32423 from HyukjinKwon/SPARK-35250.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 8aaa9e8)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit 89f5ec7)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@LuciferYang LuciferYang deleted the SPARK-33566 branch June 6, 2022 03:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants