[SPARK-26108][SQL] Support custom lineSep in CSV datasource #23080

MaxGekk · 2018-11-18T11:56:17Z

What changes were proposed in this pull request?

In the PR, I propose new options for CSV datasource - lineSep similar to Text and JSON datasource. The option allows to specify custom line separator of maximum length of 2 characters (because of a restriction in uniVocity parser). New option can be used in reading and writing CSV files.

How was this patch tested?

Added a few tests with custom lineSep for enabled/disabled multiLine in read as well as tests in write. Also I added roundtrip tests.

MaxGekk · 2018-11-18T11:56:48Z

@HyukjinKwon Could you look at the PR, please.

SparkQA · 2018-11-18T12:24:48Z

Test build #98979 has finished for PR 23080 at commit 12022ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-18T13:00:33Z

jenkins, retest this, please

SparkQA · 2018-11-18T16:00:39Z

Test build #98980 has finished for PR 23080 at commit 12022ad.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-18T16:08:54Z

jenkins, retest this, please

SparkQA · 2018-11-18T19:49:16Z

Test build #98982 has finished for PR 23080 at commit 12022ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-19T01:36:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
+    require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
+    require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.")


@MaxGekk, might not be a super big deal but I believe this should be counted after converting it into UTF-8.

We could say the line separator should be 1 or 2 bytes (UTF-8) in read path specifically when multiline is enabled.

uniVocity parser checks number of chars, see https://github.com/uniVocity/univocity-parsers/blob/f616d151b48150bc9cb98943f9b6f8353b704359/src/main/java/com/univocity/parsers/common/Format.java#L120-L122

and those chars are in UTF-16, I guess.

HyukjinKwon · 2018-11-19T01:50:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

+  }
+
+  val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { lineSep =>
+    lineSep.getBytes("UTF-8")


@MaxGekk, CSV's multiline does not support encoding but I think normal mode supports encoding. It should be okay to get bytes from it. We can just throw an exception when multiline is enabled.

HyukjinKwon · 2018-11-19T01:56:20Z

Ah, also, CsvParser.beginParsing takes an additional argument Charset. It should rather be easily able to support encoding in multiLine. @MaxGekk, would you be able to find some time to work on it? If that change can make the current PR easier. we can merge that one first.

MaxGekk · 2018-11-19T09:55:04Z

would you be able to find some time to work on it? If that change can make the current PR easier. we can merge that one first.

I will try

HyukjinKwon · 2018-11-21T03:35:43Z

@MaxGekk, let's rebase this one accordingly with encoding support.

SparkQA · 2018-11-21T12:00:11Z

Test build #99105 has finished for PR 23080 at commit bb8a13b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-21T12:23:46Z

jenkins, retest this, please

SparkQA · 2018-11-21T15:02:20Z

Test build #99117 has finished for PR 23080 at commit 1f5399f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-21T15:59:15Z

Test build #99122 has finished for PR 23080 at commit 1f5399f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-22T02:53:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

    format.setQuoteEscape(escape)
+    lineSeparator.foreach {sep =>
+      format.setLineSeparator(sep)
+      format.setNormalizedNewline(0x00.toChar)


I know we have some problems here for setting newlines more then 1 character because setNormalizedNewline only supports one character.

This is related with #18581 (comment) and uniVocity/univocity-parsers#170

That's why I thought we can only support this for single character for now.

That's why I thought we can only support this for single character for now.

ok. I will restrict line separators by one character.

HyukjinKwon · 2018-11-22T02:53:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

    settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
-    settings.setLineSeparatorDetectionEnabled(multiLine == true)
+    settings.setLineSeparatorDetectionEnabled(lineSeparatorInRead.isEmpty && multiLine)
+    lineSeparatorInRead.foreach { _ =>


HyukjinKwon · 2018-11-22T02:54:02Z

LGTM except https://github.com/apache/spark/pull/23080/files#r235589426

SparkQA · 2018-11-22T14:30:00Z

Test build #99181 has finished for PR 23080 at commit 918d163.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-23T00:00:32Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

   * <li>`locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format.
   * For instance, this is used while parsing dates and timestamps.</li>
+   * <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
+   * that should be used for parsing. Maximum length is 2.</li>


I'm sorry. can you fix Maximum length is 2 as well? should be good to go.

SparkQA · 2018-11-23T08:05:03Z

Test build #99210 has finished for PR 23080 at commit a4c4b67.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-11-23T15:48:47Z

retest this please

HyukjinKwon · 2018-11-23T15:49:22Z

Last changes were only doc changes. Let me get this in.

HyukjinKwon · 2018-11-23T15:50:25Z

Merged to master.

HyukjinKwon · 2018-11-23T15:51:08Z

@MaxGekk, thanks for working on this one.

SparkQA · 2018-11-23T20:49:05Z

Test build #99215 has finished for PR 23080 at commit a4c4b67.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

pooja-murarka · 2018-12-04T01:11:05Z

I am testing lineSep with spark 2.4

data.csv : "a",1 "c",2 "d",3
val schema : StructType =
StructType(
Seq(
StructField(name = "dteday", dataType = StringType),
StructField(name = "hr", dataType = IntegerType)
)
val logData = spark.read.format("csv").schema(schema).option("lineSep", "\t").load("data.csv")
But can only see schema without any data.
scala> logData.show()
+------+----+
|dteday| hr|
+------+----+
| null|null|
+------+----+

Can you please suggest if i missed something or above fix has not been merged with branch.

HyukjinKwon · 2018-12-04T01:19:26Z

It's fixed in upcoming Spark. Spark 2.4 does not support it.

## What changes were proposed in this pull request? In the PR, I propose new options for CSV datasource - `lineSep` similar to Text and JSON datasource. The option allows to specify custom line separator of maximum length of 2 characters (because of a restriction in `uniVocity` parser). New option can be used in reading and writing CSV files. ## How was this patch tested? Added a few tests with custom `lineSep` for enabled/disabled `multiLine` in read as well as tests in write. Also I added roundtrip tests. Closes apache#23080 from MaxGekk/csv-line-sep. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>

thadeusb · 2019-04-05T18:02:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

+   */
+  val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
+    require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
+    require(sep.length == 1, "'lineSep' can contain only 1 character.")


I currently have a project where we are importing windows newlines CRLF from CSV files.

I backported these changes but ran into an issue with this check, because to properly parse Windows CSV files I must be able to set "\r\n" for lineSep in the settings.

It appears the reason this require was added is no longer needed as the code for asReaderSettings/asWriterSettings never calls that function anymore.

I was able to remove this assert and now able to import the windows newline CSV files into dataframes properly now.

Another issue I had before this was the very last column would always get a "\r" at the end of the column name, so something like "TEXT" would become "TEXT\r", and therefore we would be unable to query the TEXT column anymore. Setting lineSep to "\r\n" solved this issue as well.

I must be able to set "\r\n" for lineSep in the settings.

You don't need to set \r\n to lineSep to split an input by lines because Hadoop Line Reader can detect \r\n itself. In which mode do you parse the CSV files - per-line multiLine = false or multiline?

I am setting multiLine = "true".

The problem I am having with this is the column name of the last column in the CSV header gets a \r added to the end of it.

So if I have

name,age,text\r\nfred,30,"likes\r\npie,cookies,milk"\njill,30,"likes\ncake,cookies,milk"\r\n

I was getting schema with

StringType("NAME")
IntegerType("AGE")
StringType("TEXT\r")

Could it be the mixed use of \r\n and \n so it only wants to use \n for newlines?

Another issue is the configuration for lineSep is controlled upstream from a different configuration provided by users who have no knowledge of spark, but know how they formatted their CSV files, and without some re-architecture, it is not possible to detect that this setting is set to \r\n and then set it to None for the CSVOptions.

lineSeparator.foreach(format.setLineSeparator) already handles 1 to 2 characters so I figured this is a safe thing to support for lineSep configuration no?

For multiline true, we have fixed auto-multiline detect feature in CSV (see #22503) That will do the job.

That is taken care of in this by the following line that I backported no?

settings.setLineSeparatorDetectionEnabled(lineSeparatorInRead.isEmpty && multiLine)

I am still having the issue that univocity keeps a \r in the column name with multiline set to True and lineSeparatorInRead is unset.

The only way I seem to be able to get spark to not put a \r in the column name is to specifiy the lineSep option with two characters explicitly to \r\n. Then I get a normal set of column names and everything else parses correctly.

I'm wondering if this is just some really pedantic CSV file that I'm working with? Its a CSV that is exported upstream by python pandas.to_csv function with no extra arguments set.

Would you be able to file a JIRA after testing out against the master branch if the issue is persistent?

don4of4 · 2019-12-05T16:52:36Z

Is this feature in version 3.0? If not, when can we expect it?

MaxGekk · 2019-12-05T17:03:28Z

Is this feature in version 3.0?

@don4of4 It should be in 3.0.

lamduynguyen · 2020-01-13T09:10:24Z

@HyukjinKwon is there any plan to support longer line separator?

HyukjinKwon · 2020-01-13T09:15:38Z

That's blocked by Univocity library's limitation. You should ask it there first.

smparkes · 2020-03-23T19:35:08Z

python/pyspark/sql/readwriter.py

        :param emptyValue: sets the string representation of an empty value. If None is set, it uses
                           the default value, ``""``.
+        :param lineSep: defines the line separator that should be used for writing. If None is
+                        set, it uses the default value, ``\\n``. Maximum length is 1 character.


Not sure if I'm missing something, but has this removed the ability use \r\n?

Spark never supported \r\n in writing path.

Revisiting this since I'd like to get rid of a local patch.

Why do you say it doesn't support this?

Reverting to the 2 character restriction works in my testing, on both the read and write paths and using arbitrary two character delimiters.

Sorry for the extra comments: hadn't read deeply enough.

So the problem is Univocity's normalizedNewLine stuff? It fails in multiline cases? That's what I'm seeing in the tests and would explain why I don't see it in my use cases.

If that's the case, wondering if it's okay to allow two characters for the non-multiline cases?

Please file a JIRA and go ahead if you can.

MaxGekk added 11 commits November 17, 2018 22:44

Added a test for default line separator

a790bb3

Test for custom lineSep

7a47990

Test on read

be2870f

Support lineSep in write

a058a6f

Check roundtrip

7e3c026

Test another char

486b090

Don't keep quotes

a0fedbb

Support 2 chars as lineSep

5f013f5

Revert unrelated changes

65786df

Test restrictions for lineSep

49b91ea

Updating comments and docs

12022ad

HyukjinKwon approved these changes Nov 19, 2018

View reviewed changes

HyukjinKwon reviewed Nov 19, 2018

View reviewed changes

MaxGekk and others added 3 commits November 21, 2018 09:48

Merge branch 'master' into csv-line-sep

bb8a13b

Tests for lineSep in different encodings

0869b81

Support encoding for lineSep

1f5399f

HyukjinKwon reviewed Nov 22, 2018

View reviewed changes

Restrict lineSep by 1 character only

918d163

HyukjinKwon reviewed Nov 23, 2018

View reviewed changes

MaxGekk added 2 commits November 23, 2018 08:38

Merge remote-tracking branch 'origin/master' into csv-line-sep

c06899f

Fix comments

a4c4b67

asfgit closed this in 8e8d117 Nov 23, 2018

HyukjinKwon mentioned this pull request Mar 4, 2019

[SPARK-21098] Set lineseparator csv multiline and csv write to \n #18304

Closed

thadeusb reviewed Apr 5, 2019

View reviewed changes

MaxGekk deleted the csv-line-sep branch August 17, 2019 13:33

smparkes reviewed Mar 23, 2020

View reviewed changes

[SPARK-26108][SQL] Support custom lineSep in CSV datasource #23080

[SPARK-26108][SQL] Support custom lineSep in CSV datasource #23080

Uh oh!

Conversation

MaxGekk commented Nov 18, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MaxGekk commented Nov 18, 2018

Uh oh!

SparkQA commented Nov 18, 2018

Uh oh!

MaxGekk commented Nov 18, 2018

Uh oh!

SparkQA commented Nov 18, 2018

Uh oh!

MaxGekk commented Nov 18, 2018

Uh oh!

SparkQA commented Nov 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 19, 2018

Uh oh!

MaxGekk commented Nov 19, 2018

Uh oh!

HyukjinKwon commented Nov 21, 2018

Uh oh!

SparkQA commented Nov 21, 2018

Uh oh!

MaxGekk commented Nov 21, 2018

Uh oh!

SparkQA commented Nov 21, 2018

Uh oh!

SparkQA commented Nov 21, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Nov 22, 2018

Uh oh!

SparkQA commented Nov 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 23, 2018

Uh oh!

HyukjinKwon commented Nov 23, 2018

Uh oh!

HyukjinKwon commented Nov 23, 2018

Uh oh!

HyukjinKwon commented Nov 23, 2018

Uh oh!

HyukjinKwon commented Nov 23, 2018

Uh oh!

SparkQA commented Nov 23, 2018

Uh oh!

pooja-murarka commented Dec 4, 2018

Uh oh!

HyukjinKwon commented Dec 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Apr 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 19, 2018 •

edited

Loading

MaxGekk Apr 5, 2019 •

edited

Loading

thadeusb Apr 8, 2019 •

edited

Loading