-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… sequence of bytes in hex like x0d 0a
Test build #88529 has finished for PR 20885 at commit
|
# Conflicts: # python/pyspark/sql/tests.py # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala # sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala
Test build #88531 has finished for PR 20885 at commit
|
Test build #88532 has finished for PR 20885 at commit
|
Test build #88539 has finished for PR 20885 at commit
|
@cloud-fan and @hvanhovell. Do you think we need the flexible option for line separator? spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala Lines 91 to 98 in bbff402
|
Test build #88540 has finished for PR 20885 at commit
|
@@ -770,12 +773,15 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm | |||
formats follow the formats at ``java.text.SimpleDateFormat``. | |||
This applies to timestamp type. If None is set, it uses the | |||
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``. | |||
:param lineSep: defines the line separator that should be used for writing. If None is | |||
set, it uses the default value, ``\\n``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a method of DataFrameWriter. It writes exactly '\n'
|
||
/** | ||
* A sequence of bytes between two consecutive json records. Format of the option is: | ||
* selector (1 char) + delimiter body (any length) | sequence of chars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid of defining our own rule here, is there any standard we can follow?
@@ -176,7 +176,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None, | |||
allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None, | |||
allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None, | |||
mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None, | |||
multiLine=None, allowUnquotedControlChars=None): | |||
multiLine=None, allowUnquotedControlChars=None, lineSep=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename it to recordDelimiter
@@ -85,6 +85,38 @@ private[sql] class JSONOptions( | |||
|
|||
val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false) | |||
|
|||
val charset: Option[String] = Some("UTF-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we need to review #20849 first
Please, look at #20937 |
What changes were proposed in this pull request?
Currently, TextInputJsonDataSource uses HadoopFileLinesReader to split json file to separate lines. The former one splits json lines by LineRecordReader without providing recordDelimiter. As a consequence of that, the hadoop library reads lines terminated by one of CR, LF, or CRLF. The changes allow to specify the line separator instead of using the auto detection method of hadoop library. If the separator is not specified, the line separation method of Hadoop is used by default.
How was this patch tested?
Added new tests for writing/reading json files with custom line separator