-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV #30518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The original case described in SPARK-33566 as follows: data: opencsv and commons-csv parse row 2 of h2 as follows: Without this pr Spark parse row 2 of h2 as follows: |
|
cc @HyukjinKwon |
|
Test build #131848 has finished for PR 30518 at commit
|
sql/core/src/test/resources/test-data/unescaped-quotes-unescaped-delimiter.csv
Outdated
Show resolved
Hide resolved
|
Merged to master. |
|
Thanks @LuciferYang. |
|
Test build #131857 has finished for PR 30518 at commit
|
|
thx @HyukjinKwon |
|
Test build #131858 has finished for PR 30518 at commit
|
|
Test build #131860 has finished for PR 30518 at commit
|
|
Test build #131863 has finished for PR 30518 at commit
|
…UE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of #30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes #32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…UE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of #30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes #32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8aaa9e8) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…UE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of apache#30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes apache#32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 8aaa9e8) Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 89f5ec7) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
There are some differences between Spark CSV, opencsv and commons-csv, the typical case are described in SPARK-33566, When there are both unescaped quotes and unescaped qualifier in value, the results of parsing are different.
The reason for the difference is Spark use
STOP_AT_DELIMITERas defaultUnescapedQuoteHandlingto buildCsvParserand it not configurable.On the other hand, opencsv and commons-csv use the parsing mechanism similar to
STOP_AT_CLOSING_QUOTEby default.So this pr make
unescapedQuoteHandlingoption configurable to get the same parsing result as opencsv and commons-csv.Why are the changes needed?
Make unescapedQuoteHandling option configurable when read CSV to make parsing more flexible。
Does this PR introduce any user-facing change?
No
How was this patch tested?
Pass the Jenkins or GitHub Action
Add a new case similar to that described in SPARK-33566