Skip to content

[SPARK-23724][SPARK-23765][SQL] Line separator for the json datasource #20885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 22 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 22, 2018

What changes were proposed in this pull request?

Currently, TextInputJsonDataSource uses HadoopFileLinesReader to split json file to separate lines. The former one splits json lines by LineRecordReader without providing recordDelimiter. As a consequence of that, the hadoop library reads lines terminated by one of CR, LF, or CRLF. The changes allow to specify the line separator instead of using the auto detection method of hadoop library. If the separator is not specified, the line separation method of Hadoop is used by default.

How was this patch tested?

Added new tests for writing/reading json files with custom line separator

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88529 has finished for PR 20885 at commit f99c1e1.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

MaxGekk added 2 commits March 22, 2018 23:58
# Conflicts:
#	python/pyspark/sql/tests.py
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala
#	sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextOptions.scala
@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88531 has finished for PR 20885 at commit 6d13d00.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2018

Test build #88532 has finished for PR 20885 at commit 77112ef.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Test build #88539 has finished for PR 20885 at commit d632706.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@cloud-fan and @hvanhovell. Do you think we need the flexible option for line separator?

* A sequence of bytes between two consecutive json records. Format of the option is:
* selector (1 char) + delimiter body (any length) | sequence of chars
* The following selectors are supported:
* - 'x' + sequence of bytes in hexadecimal format. For example: "x0a 0d".
* Hex pairs can be separated by any chars different from 0-9,A-F,a-f
* - '\' - reserved for a sequence of control chars like "\r\n"
* and unicode escape like "\u000D\u000A"
* - 'r' and '/' - reserved for future use

@SparkQA
Copy link

SparkQA commented Mar 23, 2018

Test build #88540 has finished for PR 20885 at commit bbff402.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -770,12 +773,15 @@ def json(self, path, mode=None, compression=None, dateFormat=None, timestampForm
formats follow the formats at ``java.text.SimpleDateFormat``.
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss.SSSXXX``.
:param lineSep: defines the line separator that should be used for writing. If None is
set, it uses the default value, ``\\n``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it covers all ``\\r``, ``\\r\\n`` and ``\\n``.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a method of DataFrameWriter. It writes exactly '\n'


/**
* A sequence of bytes between two consecutive json records. Format of the option is:
* selector (1 char) + delimiter body (any length) | sequence of chars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid of defining our own rule here, is there any standard we can follow?

@@ -176,7 +176,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
allowComments=None, allowUnquotedFieldNames=None, allowSingleQuotes=None,
allowNumericLeadingZero=None, allowBackslashEscapingAnyCharacter=None,
mode=None, columnNameOfCorruptRecord=None, dateFormat=None, timestampFormat=None,
multiLine=None, allowUnquotedControlChars=None):
multiLine=None, allowUnquotedControlChars=None, lineSep=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename it to recordDelimiter

@@ -85,6 +85,38 @@ private[sql] class JSONOptions(

val multiLine = parameters.get("multiLine").map(_.toBoolean).getOrElse(false)

val charset: Option[String] = Some("UTF-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we need to review #20849 first

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 29, 2018

Please, look at #20937

@MaxGekk MaxGekk closed this Mar 29, 2018
@MaxGekk MaxGekk deleted the json-line-sep branch August 17, 2019 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants