[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 #21247

MaxGekk · 2018-05-05T17:09:17Z

What changes were proposed in this pull request?

Currently, restrictions in JSONOptions for encoding and lineSep are the same for read and for write. For example, a requirement for lineSep in the code:

df.write.option("encoding", "UTF-32BE").json(file)

doesn't allow to skip lineSep and use its default value \n because it throws the exception:

equirement failed: The lineSep option must be specified for the UTF-32BE encoding
java.lang.IllegalArgumentException: requirement failed: The lineSep option must be specified for the UTF-32BE encoding

In the PR, I propose to separate JSONOptions in read and write, and make JSONOptions in write less restrictive.

How was this patch tested?

Added new test for blacklisted encodings in read. And the lineSep option was removed in write for some tests.

SparkQA · 2018-05-05T21:12:57Z

Test build #90257 has finished for PR 21247 at commit 7ea7dec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-05-06T03:01:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

+
+    val isLineSepRequired = !(multiLine == false &&
+      Charset.forName(enc) != StandardCharsets.UTF_8 && lineSeparator.isEmpty)
+    require(isLineSepRequired, s"The lineSep option must be specified for the $enc encoding")


@MaxGekk, how about we just try to remove this restriction? I thought that's your final goal in 2.4.0.

Do you mean rewriting Hadoop's LineReader to detect \n, \r and \r\n for any encoding? If so, I am working on it but I think we shouldn't restrict writer till we remove the restriction for reader.

Yea, and also I thought you are working on getting rid of blacklisting encodings too. Roughly this PR makes sense for the intermediate status since we have different requirements in write and read path; however, I thought we should just better try to remove the restrictions first until the release becomes close, and the current change should be done in the last minute if we failed to get rid of the restrictions.

I am a bit cautious of the current change since it's pretty new approach to datasources.

SparkQA · 2018-05-10T07:17:58Z

Test build #90444 has finished for PR 21247 at commit e786953.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-10T11:18:49Z

Test build #90448 has finished for PR 21247 at commit 9c366a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-10T15:55:03Z

Test build #90457 has finished for PR 21247 at commit 97c4af7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-05-10T16:46:48Z

cc @gengliangwang

gengliangwang · 2018-05-12T10:22:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

+    // encodings which can never present between lines.
+    val blacklist = Seq(Charset.forName("UTF-16"), Charset.forName("UTF-32"))
+    val isBlacklisted = blacklist.contains(Charset.forName(enc))
+    require(multiLine || !isBlacklisted,


Do we need to check blacklist in write path?

There is no reasons to blacklist UTF-16 and UTF-32 in write. I have checked the content of written JSON files on @gatorsmile 's test. For example, for UTF-16

$ hexdump -C ...c000.json 00000000 fe ff 00 7b 00 22 00 5f 00 31 00 22 00 3a 00 22 |...{."._.1.".:."| 00000010 00 61 00 22 00 2c 00 22 00 5f 00 32 00 22 00 3a |.a.".,."._.2.".:| 00000020 00 31 00 7d 00 0a 00 7b 00 22 00 5f 00 31 00 22 |.1.}...{."._.1."| 00000030 00 3a 00 22 00 63 00 22 00 2c 00 22 00 5f 00 32 |.:.".c.".,."._.2| 00000040 00 22 00 3a 00 33 00 7d 00 0a |.".:.3.}..| 0000004a

It contains BOM fe ff at the beginning as it is expected, and written line separator doesn't contains BOM (look at the position 0x24-0x25) - 00 7d 00 0a 00 7b. So, the json file in UTF-16 is correct, and I think we shouldn't blacklist the UTF-16 and UTF-32 encodings.

So I could be missing something, but it seems like we might allow folks to write data they can't read back in the same mode as write, would it make sense to have an equivalent checkedEncoding on the write side that just logs a warning for folks? I could have also misunderstood.

Yup and I think the final goal within 2.4.0 is to get rid of the blacklists in both read and write. Shall we just focus on getting rid of it?

@HyukjinKwon I have already implemented lineSep detection for different encodings (for UTF-16LE/BE and UTF-32LE/BE in particular). At the moment I am writing tests for that. I will prepare a PR soon.

So in that case logging the warning would be less important if we can read them back. In-case we can't would having the warning for now be ok?

The written JSON in the blacklisted encodings can be read back if multiLine is enabled. We can output a warning with such hint.

gatorsmile · 2018-05-30T16:16:00Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

-            .save(path.getCanonicalPath)
-        }.getMessage
-        assert(e.contains(
-          s"$encoding encoding in the blacklist is not allowed when multiLine is disabled"))


We can still keep this test case, right? We can change this negative test case to positive test case

Yes, sure. I converted the test to positive one.

This reverts commit 97c4af7

SparkQA · 2018-05-31T17:41:25Z

Test build #91344 has finished for PR 21247 at commit 2285652.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

I have two minor questions if you have time to fill me in :)

holdenk · 2018-06-08T16:47:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

+    // encodings which can never present between lines.
+    val blacklist = Seq(Charset.forName("UTF-16"), Charset.forName("UTF-32"))
+    val isBlacklisted = blacklist.contains(Charset.forName(enc))
+    require(multiLine || !isBlacklisted,


So I could be missing something, but it seems like we might allow folks to write data they can't read back in the same mode as write, would it make sense to have an equivalent checkedEncoding on the write side that just logs a warning for folks? I could have also misunderstood.

holdenk · 2018-06-08T16:56:47Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+    val exception = intercept[IllegalArgumentException] {
+      spark.read
+        .option("encoding", "UTF-16")
+        .json(testFile("test-data/utf16LE.json"))


super minor, but to more closely match the previous test maybe explicitly set multi-line to false explicitly.

I will set multiLine explicitly. Also need to check both encodings UTF-16 and UTF-32

holdenk · 2018-06-13T16:09:14Z

Jenkins ok to test.

holdenk

Another small suggestions for improvement.

holdenk · 2018-06-13T16:26:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala

  /**
   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and UTF-32BE.
-   * If the encoding is not specified (None), it will be detected automatically
+   * If the encoding is not specified (None) in read, it will be detected automatically


Since this comment mentions what happens in the read path, it would probably be good to mention what happens in the write path (e.g. based on JsonFileFormat.scala it looks like in the write path it defaults to UTF-8.

SparkQA · 2018-06-13T18:50:02Z

Test build #91783 has finished for PR 21247 at commit 2285652.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-15T16:52:48Z

Test build #91916 has finished for PR 21247 at commit c1971a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-22T00:46:32Z

Test build #92183 has finished for PR 21247 at commit ca1b243.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ForeachBatchFunction(object):
case class ArrayDistinct(child: Expression)
class PythonForeachWriter(func: PythonFunction, schema: StructType)
class UnsafeRowBuffer(taskMemoryManager: TaskMemoryManager, tempDir: File, numFields: Int)
trait MemorySinkBase extends BaseStreamingSink with Logging
class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)
class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])
trait PythonForeachBatchFunction
case class ForeachWriterProvider[T](
case class ForeachWriterFactory[T](
class ForeachDataWriter[T](
class MemoryWriter(
class MemoryStreamWriter(

holdenk

Looks really close, one small question with the imports would like to clarify.

holdenk · 2018-06-22T16:10:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala

@@ -26,7 +26,8 @@ import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.{AnalysisException, SparkSession}
 import org.apache.spark.sql.catalyst.InternalRow
-import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser, JSONOptions}
+import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser}
+import org.apache.spark.sql.catalyst.json.{JSONOptions, JSONOptionsInRead}


question/nit: Why is this split on 2 import statements this way?

Just to make Scala's code style checker happy and to fit to the 100 chars restriction. I will check if the style checker is happy if I fold imports to one line.

crafty-coder · 2018-06-22T16:39:39Z

Hi @holdenk , I'm working on a similar PR (#20949) to allow setting up the encoding when writing csv files.

It would be strange if you could set up the encoding when saving a json file and you can't do the same when saving a csv.

It would be nice if you could take a look!

Thanks you 🐱

…ptions-in-write

SparkQA · 2018-06-22T20:29:29Z

Test build #92220 has finished for PR 21247 at commit 2ad308d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-06-24T00:39:45Z

LGTM

Thanks! Merged to master.

Separate JSONOptions in read

7ea7dec

HyukjinKwon reviewed May 6, 2018

View reviewed changes

HyukjinKwon mentioned this pull request May 7, 2018

[SPARK-23094][SPARK-23723][SPARK-23724][SQL][FOLLOW-UP] Support custom encoding for json files #21254

Closed

Merge branch 'master' into json-options-in-write

e786953

Overriding parameters

9c366a0

Removing unnecessary tests

97c4af7

gengliangwang reviewed May 12, 2018

View reviewed changes

gatorsmile reviewed May 30, 2018

View reviewed changes

MaxGekk added 3 commits May 31, 2018 13:27

Merge remote-tracking branch 'origin/master' into json-options-in-write

5c16747

Revert "Removing unnecessary tests"

d2cd964

This reverts commit 97c4af7

Checking of written json in UTF-16 and UTF-32

2285652

holdenk reviewed Jun 8, 2018

View reviewed changes

holdenk reviewed Jun 13, 2018

View reviewed changes

MaxGekk added 5 commits June 15, 2018 11:54

Merge remote-tracking branch 'origin/master' into json-options-in-write

3d3c609

Fix test of incorrect encoding

21f609c

Check that both encodings are blacklisted in read

4a95588

Added comment about default encoding in write

6ddf503

Output warning for blacklisted encodings in write

c1971a5

MaxGekk changed the title ~~[SPARK-24190] Separating JSONOptions for read~~ [SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 Jun 15, 2018

Merge branch 'master' into json-options-in-write

ca1b243

holdenk reviewed Jun 22, 2018

View reviewed changes

MaxGekk added 2 commits June 22, 2018 18:46

Fold imports

2f7e7db

Merge remote-tracking branch 'fork/json-options-in-write' into json-o…

2ad308d

…ptions-in-write

gatorsmile mentioned this pull request Jun 23, 2018

[SPARK-24423][SQL] Add a new option for JDBC sources #21590

Closed

asfgit closed this in c7e2742 Jun 24, 2018

MaxGekk deleted the json-options-in-write branch August 17, 2019 13:34

[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 #21247

[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 #21247

Uh oh!

Conversation

MaxGekk commented May 5, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

gatorsmile commented May 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk May 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 31, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holdenk commented Jun 13, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 13, 2018

Uh oh!

SparkQA commented Jun 15, 2018

Uh oh!

SparkQA commented Jun 22, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

MaxGekk May 12, 2018 •

edited

Loading

HyukjinKwon Jun 8, 2018 •

edited

Loading

MaxGekk Jun 8, 2018 •

edited

Loading

MaxGekk Jun 15, 2018 •

edited

Loading