-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-24190][SQL] Allow saving of JSON files in UTF-16 and UTF-32 #21247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #90257 has finished for PR 21247 at commit
|
|
||
val isLineSepRequired = !(multiLine == false && | ||
Charset.forName(enc) != StandardCharsets.UTF_8 && lineSeparator.isEmpty) | ||
require(isLineSepRequired, s"The lineSep option must be specified for the $enc encoding") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, how about we just try to remove this restriction? I thought that's your final goal in 2.4.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean rewriting Hadoop's LineReader to detect \n
, \r
and \r\n
for any encoding? If so, I am working on it but I think we shouldn't restrict writer till we remove the restriction for reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, and also I thought you are working on getting rid of blacklisting encodings too. Roughly this PR makes sense for the intermediate status since we have different requirements in write and read path; however, I thought we should just better try to remove the restrictions first until the release becomes close, and the current change should be done in the last minute if we failed to get rid of the restrictions.
I am a bit cautious of the current change since it's pretty new approach to datasources.
Test build #90444 has finished for PR 21247 at commit
|
Test build #90448 has finished for PR 21247 at commit
|
Test build #90457 has finished for PR 21247 at commit
|
// encodings which can never present between lines. | ||
val blacklist = Seq(Charset.forName("UTF-16"), Charset.forName("UTF-32")) | ||
val isBlacklisted = blacklist.contains(Charset.forName(enc)) | ||
require(multiLine || !isBlacklisted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check blacklist
in write path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no reasons to blacklist UTF-16
and UTF-32
in write. I have checked the content of written JSON files on @gatorsmile 's test. For example, for UTF-16
$ hexdump -C ...c000.json
00000000 fe ff 00 7b 00 22 00 5f 00 31 00 22 00 3a 00 22 |...{."._.1.".:."|
00000010 00 61 00 22 00 2c 00 22 00 5f 00 32 00 22 00 3a |.a.".,."._.2.".:|
00000020 00 31 00 7d 00 0a 00 7b 00 22 00 5f 00 31 00 22 |.1.}...{."._.1."|
00000030 00 3a 00 22 00 63 00 22 00 2c 00 22 00 5f 00 32 |.:.".c.".,."._.2|
00000040 00 22 00 3a 00 33 00 7d 00 0a |.".:.3.}..|
0000004a
It contains BOM fe ff
at the beginning as it is expected, and written line separator doesn't contains BOM (look at the position 0x24-0x25) - 00 7d
00 0a 00 7b
. So, the json file in UTF-16
is correct, and I think we shouldn't blacklist the UTF-16
and UTF-32
encodings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I could be missing something, but it seems like we might allow folks to write data they can't read back in the same mode as write, would it make sense to have an equivalent checkedEncoding on the write side that just logs a warning for folks? I could have also misunderstood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup and I think the final goal within 2.4.0 is to get rid of the blacklists in both read and write. Shall we just focus on getting rid of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon I have already implemented lineSep
detection for different encodings (for UTF-16LE/BE
and UTF-32LE/BE
in particular). At the moment I am writing tests for that. I will prepare a PR soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in that case logging the warning would be less important if we can read them back. In-case we can't would having the warning for now be ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The written JSON in the blacklisted encodings can be read back if multiLine
is enabled. We can output a warning with such hint.
.save(path.getCanonicalPath) | ||
}.getMessage | ||
assert(e.contains( | ||
s"$encoding encoding in the blacklist is not allowed when multiLine is disabled")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can still keep this test case, right? We can change this negative test case to positive test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. I converted the test to positive one.
Test build #91344 has finished for PR 21247 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have two minor questions if you have time to fill me in :)
// encodings which can never present between lines. | ||
val blacklist = Seq(Charset.forName("UTF-16"), Charset.forName("UTF-32")) | ||
val isBlacklisted = blacklist.contains(Charset.forName(enc)) | ||
require(multiLine || !isBlacklisted, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I could be missing something, but it seems like we might allow folks to write data they can't read back in the same mode as write, would it make sense to have an equivalent checkedEncoding on the write side that just logs a warning for folks? I could have also misunderstood.
val exception = intercept[IllegalArgumentException] { | ||
spark.read | ||
.option("encoding", "UTF-16") | ||
.json(testFile("test-data/utf16LE.json")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super minor, but to more closely match the previous test maybe explicitly set multi-line to false explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will set multiLine
explicitly. Also need to check both encodings UTF-16
and UTF-32
Jenkins ok to test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another small suggestions for improvement.
/** | ||
* Standard encoding (charset) name. For example UTF-8, UTF-16LE and UTF-32BE. | ||
* If the encoding is not specified (None), it will be detected automatically | ||
* If the encoding is not specified (None) in read, it will be detected automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this comment mentions what happens in the read path, it would probably be good to mention what happens in the write path (e.g. based on JsonFileFormat.scala
it looks like in the write path it defaults to UTF-8.
Test build #91783 has finished for PR 21247 at commit
|
Test build #91916 has finished for PR 21247 at commit
|
Test build #92183 has finished for PR 21247 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really close, one small question with the imports would like to clarify.
@@ -26,7 +26,8 @@ import org.apache.hadoop.mapreduce.{Job, TaskAttemptContext} | |||
import org.apache.spark.internal.Logging | |||
import org.apache.spark.sql.{AnalysisException, SparkSession} | |||
import org.apache.spark.sql.catalyst.InternalRow | |||
import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser, JSONOptions} | |||
import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JacksonParser} | |||
import org.apache.spark.sql.catalyst.json.{JSONOptions, JSONOptionsInRead} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question/nit: Why is this split on 2 import statements this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make Scala's code style checker happy and to fit to the 100 chars restriction. I will check if the style checker is happy if I fold imports to one line.
Test build #92220 has finished for PR 21247 at commit
|
LGTM Thanks! Merged to master. |
What changes were proposed in this pull request?
Currently, restrictions in JSONOptions for
encoding
andlineSep
are the same for read and for write. For example, a requirement forlineSep
in the code:doesn't allow to skip
lineSep
and use its default value\n
because it throws the exception:In the PR, I propose to separate JSONOptions in read and write, and make JSONOptions in write less restrictive.
How was this patch tested?
Added new test for blacklisted encodings in read. And the
lineSep
option was removed in write for some tests.