[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885

MaxGekk · 2018-03-22T22:31:07Z

What changes were proposed in this pull request?

Currently, TextInputJsonDataSource uses HadoopFileLinesReader to split json file to separate lines. The former one splits json lines by LineRecordReader without providing recordDelimiter. As a consequence of that, the hadoop library reads lines terminated by one of CR, LF, or CRLF. The changes allow to specify the line separator instead of using the auto detection method of hadoop library. If the separator is not specified, the line separation method of Hadoop is used by default.

How was this patch tested?

Added new tests for writing/reading json files with custom line separator

… sequence of bytes in hex like x0d 0a

SparkQA · 2018-03-22T22:35:56Z

Test build #88529 has finished for PR 20885 at commit f99c1e1.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

# Conflicts: # python/pyspark/sql/tests.py # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala

SparkQA · 2018-03-22T23:10:55Z

Test build #88531 has finished for PR 20885 at commit 6d13d00.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-22T23:15:53Z

Test build #88532 has finished for PR 20885 at commit 77112ef.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-23T08:57:06Z

Test build #88539 has finished for PR 20885 at commit d632706.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-23T12:15:40Z

@cloud-fan and @hvanhovell. Do you think we need the flexible option for line separator?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

Lines 91 to 98 in bbff402

    
              * A sequence of bytes between two consecutive json records. Format of the option is: 
        
              *   selector (1 char) + delimiter body (any length) | sequence of chars 
        
              * The following selectors are supported: 
        
              * - 'x' + sequence of bytes in hexadecimal format. For example: "x0a 0d". 
        
              *   Hex pairs can be separated by any chars different from 0-9,A-F,a-f 
        
              * - '\' - reserved for a sequence of control chars like "\r\n" 
        
              *         and unicode escape like "\u000D\u000A" 
        
              * - 'r' and '/' - reserved for future use

SparkQA · 2018-03-23T14:37:51Z

Test build #88540 has finished for PR 20885 at commit bbff402.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-23T18:06:23Z

python/pyspark/sql/readwriter.py

@@ -770,12 +773,15 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm
                                formats follow the formats at ``java.text.SimpleDateFormat``.
                                This applies to timestamp type. If None is set, it uses the
                                default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
+        :param lineSep: defines the line separator that should be used for writing. If None is
+                        set, it uses the default value, ``\\n``.


it covers all ``\\r``, ``\\r\\n`` and ``\\n``.

It is a method of DataFrameWriter. It writes exactly '\n'

cloud-fan · 2018-03-23T18:09:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

+
+  /**
+   * A sequence of bytes between two consecutive json records. Format of the option is:
+   *   selector (1 char) + delimiter body (any length) | sequence of chars


I'm afraid of defining our own rule here, is there any standard we can follow?

gatorsmile · 2018-03-25T23:20:58Z

python/pyspark/sql/readwriter.py

@@ -176,7 +176,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
             allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None,
             allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None,
             mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None,
-             multiLine=None, allowUnquotedControlChars=None):
+             multiLine=None, allowUnquotedControlChars=None, lineSep=None):


rename it to recordDelimiter

gatorsmile · 2018-03-25T23:26:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

@@ -85,6 +85,38 @@ private[sql] class JSONOptions(

  val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

+  val charset: Option[String] = Some("UTF-8")


It sounds like we need to review #20849 first

MaxGekk · 2018-03-29T14:57:36Z

Please, look at #20937

MaxGekk added 18 commits March 22, 2018 20:42

Adding the delimiter option encoded in base64

a794988

Separator encoded as a sequence of bytes in hex

dccdaa2

Refactoring: removed unused imports and renaming a parameter

d0abab7

The sep option is renamed to recordSeparator. The supported format is…

6741796

… sequence of bytes in hex like x0d 0a

Renaming recordSeparator to recordDelimiter

e4faae1

Comments for the recordDelimiter option

01f4ef5

Support other formats of recordDelimiter

24cedb9

Checking different charsets and record delimiters

d40dda2

Renaming test's method to make it more readable

ad6496c

Test of reading json in different charsets and delimiters

358863d

Fix inferring of csv schema for any charsets

7e5be5e

Fix errors of scalastyle check

d138d2d

Reserving format for regular expressions and concatenated json

c26ef5d

Fix recordDelimiter tests

5f0b069

Additional cases are added to the delimiter test

ef8248f

Renaming recordDelimiter to lineSeparator

2efac08

Adding HyukjinKwon changes

b2020fa

Revert lineSepInWrite back to lineSep

f99c1e1

MaxGekk added 2 commits March 22, 2018 23:58

Fix passing of the lineSeparator to HadoopFileLinesReader

77112ef

MaxGekk mentioned this pull request Mar 22, 2018

[SPARK-23765][SQL] Supports custom line separator for json datasource #20877

Closed

Fix python style checking

d632706

Fix text source tests and javadoc comments

bbff402

cloud-fan reviewed Mar 23, 2018

View reviewed changes

gatorsmile reviewed Mar 25, 2018

View reviewed changes

MaxGekk closed this Mar 29, 2018

MaxGekk deleted the json-line-sep branch August 17, 2019 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885

[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885

Uh oh!

MaxGekk commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 23, 2018

Uh oh!

HyukjinKwon commented Mar 23, 2018

Uh oh!

SparkQA commented Mar 23, 2018

Uh oh!

cloud-fan Mar 23, 2018

Uh oh!

MaxGekk Mar 23, 2018

Uh oh!

cloud-fan Mar 23, 2018

Uh oh!

gatorsmile Mar 25, 2018

Uh oh!

gatorsmile Mar 25, 2018

Uh oh!

MaxGekk commented Mar 29, 2018

Uh oh!

Uh oh!

		@@ -85,6 +85,38 @@ private[sql] class JSONOptions(

		val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

		val charset: Option[String] = Some("UTF-8")

[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885

[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885

Uh oh!

Conversation

MaxGekk commented Mar 22, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 22, 2018

Uh oh!

SparkQA commented Mar 23, 2018

Uh oh!

HyukjinKwon commented Mar 23, 2018

Uh oh!

SparkQA commented Mar 23, 2018

Uh oh!

cloud-fan Mar 23, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Mar 23, 2018

Choose a reason for hiding this comment

Uh oh!

cloud-fan Mar 23, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 25, 2018

Choose a reason for hiding this comment

Uh oh!

gatorsmile Mar 25, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Mar 29, 2018

Uh oh!

Uh oh!