Skip to content

[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 #21247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented May 5, 2018

What changes were proposed in this pull request?

Currently, restrictions in JSONOptions for encoding and lineSep are the same for read and for write. For example, a requirement for lineSep in the code:

df.write.option("encoding", "UTF-32BE").json(file)

doesn't allow to skip lineSep and use its default value \n because it throws the exception:

equirement failed: The lineSep option must be specified for the UTF-32BE encoding
java.lang.IllegalArgumentException: requirement failed: The lineSep option must be specified for the UTF-32BE encoding

In the PR, I propose to separate JSONOptions in read and write, and make JSONOptions in write less restrictive.

How was this patch tested?

Added new test for blacklisted encodings in read. And the lineSep option was removed in write for some tests.

@SparkQA
Copy link

SparkQA commented May 5, 2018

Test build #90257 has finished for PR 21247 at commit 7ea7dec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val isLineSepRequired = !(multiLine == false &&
Charset.forName(enc) != StandardCharsets.UTF_8 && lineSeparator.isEmpty)
require(isLineSepRequired, s"The lineSep option must be specified for the $enc encoding")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk, how about we just try to remove this restriction? I thought that's your final goal in 2.4.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean rewriting Hadoop's LineReader to detect \n, \r and \r\n for any encoding? If so, I am working on it but I think we shouldn't restrict writer till we remove the restriction for reader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, and also I thought you are working on getting rid of blacklisting encodings too. Roughly this PR makes sense for the intermediate status since we have different requirements in write and read path; however, I thought we should just better try to remove the restrictions first until the release becomes close, and the current change should be done in the last minute if we failed to get rid of the restrictions.

I am a bit cautious of the current change since it's pretty new approach to datasources.

@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90444 has finished for PR 21247 at commit e786953.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90448 has finished for PR 21247 at commit 9c366a0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2018

Test build #90457 has finished for PR 21247 at commit 97c4af7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @gengliangwang

// encodings which can never present between lines.
val blacklist = Seq(Charset.forName("UTF-16"), Charset.forName("UTF-32"))
val isBlacklisted = blacklist.contains(Charset.forName(enc))
require(multiLine || !isBlacklisted,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check blacklist in write path?

Copy link
Member Author

@MaxGekk MaxGekk May 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reasons to blacklist UTF-16 and UTF-32 in write. I have checked the content of written JSON files on @gatorsmile 's test. For example, for UTF-16

$ hexdump -C ...c000.json
00000000  fe ff 00 7b 00 22 00 5f  00 31 00 22 00 3a 00 22  |...{."._.1.".:."|
00000010  00 61 00 22 00 2c 00 22  00 5f 00 32 00 22 00 3a  |.a.".,."._.2.".:|
00000020  00 31 00 7d 00 0a 00 7b  00 22 00 5f 00 31 00 22  |.1.}...{."._.1."|
00000030  00 3a 00 22 00 63 00 22  00 2c 00 22 00 5f 00 32  |.:.".c.".,."._.2|
00000040  00 22 00 3a 00 33 00 7d  00 0a                    |.".:.3.}..|
0000004a

It contains BOM fe ff at the beginning as it is expected, and written line separator doesn't contains BOM (look at the position 0x24-0x25) - 00 7d 00 0a 00 7b. So, the json file in UTF-16 is correct, and I think we shouldn't blacklist the UTF-16 and UTF-32 encodings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I could be missing something, but it seems like we might allow folks to write data they can't read back in the same mode as write, would it make sense to have an equivalent checkedEncoding on the write side that just logs a warning for folks? I could have also misunderstood.

Copy link
Member

@HyukjinKwon HyukjinKwon Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup and I think the final goal within 2.4.0 is to get rid of the blacklists in both read and write. Shall we just focus on getting rid of it?

Copy link
Member Author

@MaxGekk MaxGekk Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon I have already implemented lineSep detection for different encodings (for UTF-16LE/BE and UTF-32LE/BE in particular). At the moment I am writing tests for that. I will prepare a PR soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in that case logging the warning would be less important if we can read them back. In-case we can't would having the warning for now be ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The written JSON in the blacklisted encodings can be read back if multiLine is enabled. We can output a warning with such hint.

.save(path.getCanonicalPath)
}.getMessage
assert(e.contains(
s"$encoding encoding in the blacklist is not allowed when multiLine is disabled"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still keep this test case, right? We can change this negative test case to positive test case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. I converted the test to positive one.

@SparkQA
Copy link

SparkQA commented May 31, 2018

Test build #91344 has finished for PR 21247 at commit 2285652.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two minor questions if you have time to fill me in :)

// encodings which can never present between lines.
val blacklist = Seq(Charset.forName("UTF-16"), Charset.forName("UTF-32"))
val isBlacklisted = blacklist.contains(Charset.forName(enc))
require(multiLine || !isBlacklisted,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I could be missing something, but it seems like we might allow folks to write data they can't read back in the same mode as write, would it make sense to have an equivalent checkedEncoding on the write side that just logs a warning for folks? I could have also misunderstood.

val exception = intercept[IllegalArgumentException] {
spark.read
.option("encoding", "UTF-16")
.json(testFile("test-data/utf16LE.json"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super minor, but to more closely match the previous test maybe explicitly set multi-line to false explicitly.

Copy link
Member Author

@MaxGekk MaxGekk Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will set multiLine explicitly. Also need to check both encodings UTF-16 and UTF-32

@holdenk
Copy link
Contributor

holdenk commented Jun 13, 2018

Jenkins ok to test.

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another small suggestions for improvement.

/**
* Standard encoding (charset) name. For example UTF-8, UTF-16LE and UTF-32BE.
* If the encoding is not specified (None), it will be detected automatically
* If the encoding is not specified (None) in read, it will be detected automatically
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this comment mentions what happens in the read path, it would probably be good to mention what happens in the write path (e.g. based on JsonFileFormat.scala it looks like in the write path it defaults to UTF-8.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91783 has finished for PR 21247 at commit 2285652.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [SPARK-24190] Separating JSONOptions for read [SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 Jun 15, 2018
@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91916 has finished for PR 21247 at commit c1971a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2018

Test build #92183 has finished for PR 21247 at commit ca1b243.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ForeachBatchFunction(object):
  • case class ArrayDistinct(child: Expression)
  • class PythonForeachWriter(func: PythonFunction, schema: StructType)
  • class UnsafeRowBuffer(taskMemoryManager: TaskMemoryManager, tempDir: File, numFields: Int)
  • trait MemorySinkBase extends BaseStreamingSink with Logging
  • class MemorySink(val schema: StructType, outputMode: OutputMode, options: DataSourceOptions)
  • class ForeachBatchSink[T](batchWriter: (Dataset[T], Long) => Unit, encoder: ExpressionEncoder[T])
  • trait PythonForeachBatchFunction
  • case class ForeachWriterProvider[T](
  • case class ForeachWriterFactory[T](
  • class ForeachDataWriter[T](
  • class MemoryWriter(
  • class MemoryStreamWriter(

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really close, one small question with the imports would like to clarify.

@@ -26,7 +26,8 @@ import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext}
import org.apache.spark.internal.Logging
import org.apache.spark.sql.{AnalysisException, SparkSession}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser, JSONOptions}
import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser}
import org.apache.spark.sql.catalyst.json.{JSONOptions, JSONOptionsInRead}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/nit: Why is this split on 2 import statements this way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make Scala's code style checker happy and to fit to the 100 chars restriction. I will check if the style checker is happy if I fold imports to one line.

@crafty-coder
Copy link
Contributor

Hi @holdenk , I'm working on a similar PR (#20949) to allow setting up the encoding when writing csv files.

It would be strange if you could set up the encoding when saving a json file and you can't do the same when saving a csv.

It would be nice if you could take a look!

Thanks you 🐱

@SparkQA
Copy link

SparkQA commented Jun 22, 2018

Test build #92220 has finished for PR 21247 at commit 2ad308d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master.

@asfgit asfgit closed this in c7e2742 Jun 24, 2018
@MaxGekk MaxGekk deleted the json-options-in-write branch August 17, 2019 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants