[SPARK-21839][SQL] Support SQL config for ORC compression #19055

dongjoon-hyun · 2017-08-25T19:18:37Z

What changes were proposed in this pull request?

This PR aims to support spark.sql.orc.compression.codec like Parquet's spark.sql.parquet.compression.codec. Users can use SQLConf to control ORC compression, too.

How was this patch tested?

Pass the Jenkins with new and updated test cases.

SparkQA · 2017-08-25T22:04:20Z

Test build #81139 has finished for PR 19055 at commit f3ccfec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-25T22:09:10Z

Hi, @cloud-fan and @gatorsmile .
Could you review this ORC option PR? This is spun off from #18953 in order to reduce the review burden.

SparkQA · 2017-08-25T22:24:15Z

Test build #81140 has finished for PR 19055 at commit 5998c29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-26T21:31:28Z

FYI, this PR can be reviewed before the other ORC PRs.

dongjoon-hyun · 2017-08-27T18:28:33Z

Hi, @gatorsmile .
Could you review this ORC configuration PR?

gatorsmile · 2017-08-28T06:23:52Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala


  /**
-   * Compression codec to use. By default snappy compression.
+   * Compression codec to use. By default use the value specified in SQLConf.


This is confusing. You can remove By default use the value specified in SQLConf.

Sure, I will. Historically, I brought this from ParquetOptions.

gatorsmile · 2017-08-28T06:24:25Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala

   * Acceptable values are defined in [[shortOrcCompressionCodecNames]].
   */
-  val compressionCodec: String = {
+  val compressionCodecClassName: String = {
    // `orc.compress` is a ORC configuration. So, here we respect this as an option but
    // `compression` has higher precedence than `orc.compress`. It means if both are set,
    // we will use `compression`.


Instead, update this paragraph to explain the priority.

gatorsmile · 2017-08-28T06:27:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "uncompressed, snappy, zlib.")
+    .stringConf
+    .transform(_.toLowerCase(Locale.ROOT))
+    .checkValues(Set("uncompressed", "snappy", "zlib"))


Why this is inconsistent with shortOrcCompressionCodecNames?

// The ORC compression short names private val shortOrcCompressionCodecNames = Map( "none" -> "NONE", "uncompressed" -> "NONE", "snappy" -> "SNAPPY", "zlib" -> "ZLIB", "lzo" -> "LZO") }

lzo will be added later. For none, I thought the usage in SQLConf is discouraged intentioanlly like in Parquet. I wanted to show the same behavior with Parquet.

I added "none" here. Thanks!

gatorsmile · 2017-08-28T06:28:49Z

I am not familiar with ORC. Above is just a quick look about the changes made in this PR.

HyukjinKwon · 2017-08-28T12:46:46Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala

   * Acceptable values are defined in [[shortOrcCompressionCodecNames]].
   */
-  val compressionCodec: String = {
+  val compressionCodecClassName: String = {
    // `orc.compress` is a ORC configuration. So, here we respect this as an option but
    // `compression` has higher precedence than `orc.compress`. It means if both are set,
    // we will use `compression`.


I guess we should update here too:

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Line 522 in 3c0c2d0

* This will override `orc.compress`.</li>

and I think we should prioritise

compression

spark.sql.orc.compression.codec

orc.compress

to be consistent.

Ur, there is a technical issue.
spark.sql.orc.compression.codec has a default value snappy. So, orc.comress cannot be used in that order.

HyukjinKwon · 2017-08-28T13:38:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "uncompressed, snappy, zlib.")
+    .stringConf
+    .transform(_.toLowerCase(Locale.ROOT))
+    .checkValues(Set("uncompressed", "snappy", "zlib"))


Hm, but lzo still seeks org.apache.hadoop.hive.ql.io.orc.LzoCodec though. Wouldn't it be better to leave as what it is intended for now if I understood correctly? I think it should be almost seldom but disallowing lzo actually looks disallowing the chance of a custom use allowed via .option() API.

I think none was introduced as a compression key for .option() first for Parquet and it was matched for consistency. I think we have kept none for .option() API specific somehow.

Yep. "none" is added back. "lzo" is not supported. It failed in the current master branch. We have a ignored test case for that LZO failure.

HyukjinKwon · 2017-08-28T13:48:37Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala

   * Acceptable values are defined in [[shortOrcCompressionCodecNames]].
   */
-  val compressionCodec: String = {
+  val compressionCodecClassName: String = {


I guess you matched this to ParquetOptions but actually I think it's ParquetOptions to be changed. I found this minor nit but after it got merged - #15580 (comment)

Oh, may I change Parquet at this time?

Anyway, I'll revert this for the other data soruces.

dongjoon-hyun · 2017-08-28T15:42:19Z

Thank you for review, @gatorsmile and @HyukjinKwon . I'll update the PR.

SparkQA · 2017-08-28T20:21:17Z

Test build #81178 has finished for PR 19055 at commit afbb6f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-28T20:50:11Z

Thank you again, @gatorsmile and @HyukjinKwon . I updated the PR.
I didn't change about LZO because it is not supported in Apache Spark yet. Please refer the test case in OrcQuerySuite.scala. Could you review this again?

HyukjinKwon · 2017-08-29T03:37:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "uncompressed, snappy, zlib.")
+    .stringConf
+    .transform(_.toLowerCase(Locale.ROOT))
+    .checkValues(Set("none", "uncompressed", "snappy", "zlib"))


@dongjoon-hyun, I think my only main concern is inconsistency between compression option and this config. If lzo is an unknown key and it directly throws an exception (or even this was not there in the first place), I would have been okay but it looks attempting to find org.apache.hadoop.hive.ql.io.orc.LzoCodec:

java.lang.IllegalArgumentException: LZO is not available. at org.apache.hadoop.hive.ql.io.orc.WriterImpl.createCodec(WriterImpl.java:331) at org.apache.hadoop.hive.ql.io.orc.WriterImpl.<init>(WriterImpl.java:201) at org.apache.hadoop.hive.ql.io.orc.OrcFile.createWriter(OrcFile.java:464) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:74) at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:55) ...

... case LZO: try { Class<? extends CompressionCodec> lzo = (Class<? extends CompressionCodec>) JavaUtils.loadClass("org.apache.hadoop.hive.ql.io.orc.LzoCodec"); return lzo.newInstance(); } catch (ClassNotFoundException e) { throw new IllegalArgumentException("LZO is not available.", e); } catch (InstantiationException e) { throw new IllegalArgumentException("Problem initializing LZO", e); } catch (IllegalAccessException e) { throw new IllegalArgumentException("Insufficient access to LZO", e); } ...

This appears that if we provide org.apache.hadoop.hive.ql.io.orc.LzoCodec anyhow in the classpath (as an extream case, the one implemented by an user), it should work, if I read this correctly. This should be almost seldom though.

Does your set of ORC related PRs eventually support lzo correctly?

I see. You mean user-provided hive or libraries. It makes sense. I will add lzo too here. Thank you for pointing out that.

SparkQA · 2017-08-29T07:04:47Z

Test build #81199 has finished for PR 19055 at commit 437d181.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-08-29T15:17:23Z

Retest this please.

gatorsmile · 2017-08-29T16:02:39Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

+
+  test("SPARK-21839: Add SQL config for ORC compression") {
+    val conf = sqlContext.sessionState.conf
+    assert(new OrcOptions(Map.empty[String, String], conf).compressionCodec == "SNAPPY")


Add a comment here to explain the test scenario.
// to test the default of spark.sql.orc.compression.codec is snappy

gatorsmile · 2017-08-29T16:04:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

+        new OrcOptions(Map("orc.compress" -> "zlib"), conf).compressionCodec == "ZLIB")
+    }
+
+    Seq("NONE", "SNAPPY", "ZLIB", "LZO").foreach { c =>


// to test all the valid options of spark.sql.orc.compression.codec

"none", "uncompressed", "snappy", "zlib", "lzo".

You missed one of it.

Ur, It's intentional. It's tested in the above. "UNCOMPRESSED" is replaced into "NONE"
So, I omitted here.

To add that, we need if condition. I'll add that.

gatorsmile · 2017-08-29T16:05:17Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

+    withSQLConf(SQLConf.ORC_COMPRESSION.key -> "uncompressed") {
+      assert(new OrcOptions(Map.empty[String, String], conf).compressionCodec == "NONE")
+      assert(
+        new OrcOptions(Map("orc.compress" -> "zlib"), conf).compressionCodec == "ZLIB")


Also please add another scenario when users specify compression

dongjoon-hyun · 2017-08-29T16:15:44Z

Thank you for review again, @gatorsmile . I fixed them.

gatorsmile · 2017-08-29T16:18:31Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

+    // Test all the valid options of spark.sql.orc.compression.codec
+    Seq("NONE", "UNCOMPRESSED", "SNAPPY", "ZLIB", "LZO").foreach { c =>
+      withSQLConf(SQLConf.ORC_COMPRESSION.key -> c) {
+        if (c == "UNCOMPRESSED") {


val expected = if ... else c
assert(new OrcOptions(Map.empty[String, String], conf).compressionCodec == expected)

Yep! It's much better.

SparkQA · 2017-08-29T18:11:54Z

Test build #81218 has finished for PR 19055 at commit 437d181.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-29T18:55:12Z

Test build #81219 has finished for PR 19055 at commit aafbfbb.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-29T19:06:26Z

Test build #81220 has finished for PR 19055 at commit 8aebcd3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon

LGTM except for minor comments. @gatorsmile, do you maybe have more comments?

HyukjinKwon · 2017-08-30T02:14:40Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcSourceSuite.scala

@@ -18,12 +18,13 @@
 package org.apache.spark.sql.hive.orc

 import java.io.File
+import java.util.Locale


Looks unused in this test.

Oops. Thanks!

HyukjinKwon · 2017-08-30T02:21:06Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala

    val orcCompressionConf = parameters.get(OrcRelation.ORC_COMPRESSION)
    val codecName = parameters
      .get("compression")
      .orElse(orcCompressionConf)
-      .getOrElse("snappy").toLowerCase(Locale.ROOT)
+      .getOrElse(sqlConf.orcCompressionCodec)


Could we update the default values for consistency with Parquet one? :

spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Line 520 in 3c0c2d0

* <li>`compression` (default `snappy`): compression codec to use when saving to file. This can be

spark/python/pyspark/sql/readwriter.py

Line 855 in 51620e2

default value, ``snappy``.

The default value is snappy, isn't it?

.createWithDefault("snappy")

I was thinking like:

spark/python/pyspark/sql/readwriter.py

Lines 751 to 753 in 51620e2

This will override ``spark.sql.parquet.compression.codec``. If None

is set, it uses the value specified in

``spark.sql.parquet.compression.codec``.

Wouldn't we use the value set in spark.sql.parquet.compression.codec by default if compression is unset via option API?

Actually, I thought the purpose of this configuration is rather for setting the default compression codec for ORC datasource ..

Yes. This is the priority. If compression and orc.compression is unset via option, we use SQLConf.
compression -> orc.compression -> spark.sql.orc.compression.codec

The main purpose of this PR is to support users to control ORC compression by using SQLConf, too.

The default codec is unchanged and the priority is the same. Also, all previous user-given options are respected.

Ah, it looks I had to be clear. I meant fixing the comment for default value from snappy to spark.sql.orc.compression.codec.

Thank you! I'll fix that.

dongjoon-hyun · 2017-08-30T02:29:16Z

Thank you very much for review, @HyukjinKwon ! 👍

dongjoon-hyun · 2017-08-30T03:00:35Z

According to your review comments, I updated the comment , too.
Thank you again for spending your time to review my PRs!

SparkQA · 2017-08-30T05:13:20Z

Test build #81244 has finished for PR 19055 at commit 94e624e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-30T05:36:12Z

Test build #81246 has finished for PR 19055 at commit 34e2845.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-30T06:10:00Z

Test build #81247 has finished for PR 19055 at commit 4c0300c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-30T17:52:43Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala

-    // `orc.compress` is a ORC configuration. So, here we respect this as an option but
-    // `compression` has higher precedence than `orc.compress`. It means if both are set,
-    // we will use `compression`.
+    // `compression`, `orc.compress`, and `spark.sql.orc.compression.codec` is used in order.


is used in order -> are in order of precedence from highest to lowest

gatorsmile · 2017-08-30T17:53:32Z

LGTM except a minor comment.

dongjoon-hyun · 2017-08-30T19:43:38Z

Thank you, @gatorsmile . I fixed it, too.

SparkQA · 2017-08-30T22:23:53Z

Test build #81265 has finished for PR 19055 at commit 9620e46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-08-30T23:20:17Z

Merged to master.

dongjoon-hyun · 2017-08-30T23:21:49Z

Thank you for merging, @HyukjinKwon !

dongjoon-hyun · 2017-08-30T23:27:20Z

Also, Thank you again, @gatorsmile !

dongjoon-hyun added 2 commits August 25, 2017 12:17

[SPARK-21839][SQL] Support SQL config for ORC compression

f3ccfec

fix doc.

5998c29

gatorsmile reviewed Aug 28, 2017

View reviewed changes

HyukjinKwon reviewed Aug 28, 2017

View reviewed changes

address comments.

afbb6f2

HyukjinKwon reviewed Aug 29, 2017

View reviewed changes

add lzo.

437d181

gatorsmile reviewed Aug 29, 2017

View reviewed changes

Address comments.

aafbfbb

gatorsmile reviewed Aug 29, 2017

View reviewed changes

fix

8aebcd3

HyukjinKwon approved these changes Aug 30, 2017

View reviewed changes

remove unused import.

94e624e

update comments according to the review comment.

34e2845

Update comments in readwriter.py, too.

4c0300c

gatorsmile reviewed Aug 30, 2017

View reviewed changes

fix

9620e46

asfgit closed this in d8f4540 Aug 30, 2017

dongjoon-hyun deleted the SPARK-21839 branch August 30, 2017 23:27

gatorsmile mentioned this pull request Sep 17, 2017

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #19218

Closed

	This will override ``spark.sql.parquet.compression.codec``. If None
	is set, it uses the value specified in
	``spark.sql.parquet.compression.codec``.

[SPARK-21839][SQL] Support SQL config for ORC compression #19055

[SPARK-21839][SQL] Support SQL config for ORC compression #19055

Uh oh!

Conversation

dongjoon-hyun commented Aug 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

dongjoon-hyun commented Aug 25, 2017

Uh oh!

SparkQA commented Aug 25, 2017

Uh oh!

dongjoon-hyun commented Aug 26, 2017

Uh oh!

dongjoon-hyun commented Aug 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 28, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 28, 2017

Uh oh!

SparkQA commented Aug 28, 2017

Uh oh!

dongjoon-hyun commented Aug 28, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Aug 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 29, 2017

Uh oh!

dongjoon-hyun commented Aug 29, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Aug 25, 2017 •

edited

Loading

dongjoon-hyun commented Aug 28, 2017 •

edited

Loading

HyukjinKwon Aug 29, 2017 •

edited

Loading

gatorsmile Aug 29, 2017 •

edited

Loading