[SPARK-15474][SQL] Write and read back non-emtpy schema with empty dataframe #19571

dongjoon-hyun · 2017-10-25T03:53:47Z

What changes were proposed in this pull request?

Previously, ORC file format cannot write a correct schema in case of empty dataframe. Instead, it creates an empty ORC file with empty schema, struct<>. So, Spark users cannot write and read back ORC files with non-empty schema and no rows. This PR uses new Apache ORC 1.4.1 to create an empty ORC file with a correct schema. Also, this PR uses ORC 1.4.1 to infer schema always.

BEFORE

scala> val emptyDf = Seq((true, 1, "str")).toDF("a", "b", "c").limit(0)
scala> emptyDf.write.format("orc").mode("overwrite").save("/tmp/empty")
scala> spark.read.format("orc").load("/tmp/empty").printSchema
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.;

AFTER

scala> spark.read.format("orc").load("/tmp/empty").printSchema
root
 |-- a: boolean (nullable = true)
 |-- b: integer (nullable = true)
 |-- c: string (nullable = true)

How was this patch tested?

Pass the Jenkins with newly added test cases.

viirya · 2017-10-25T04:11:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

+    val fs = FileSystem.get(conf)
+    val options = OrcFile.readerOptions(conf).filesystem(fs)
+    files.map(_.getPath).flatMap(readSchema(_, options))
+      .headOption.map { schema =>


Seems that you just take the first available schema. Looks like we don't need to read other files when we found the first available schema.

Yes. This is based on the existing OrcFileOperator.readSchema.

viirya · 2017-10-25T04:20:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

@@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
  override def close(): Unit = {
    if (recordWriterInstantiated) {
      recordWriter.close(Reporter.NULL)
+    } else {
+      // SPARK-15474 Write empty orc file with correct schema
+      val conf = context.getConfiguration()


Looks like the behavior to skip creating an empty file if no rows are written is deliberate. Is there any impact to current behavior?

Previously, only ORC does. So, it creates more issues like SPARK-22258 (#19477) and SPARK-21762. This is consistent with the other data sources like Parquet.

viirya · 2017-10-25T04:23:02Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

@@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
  override def close(): Unit = {
    if (recordWriterInstantiated) {


Btw, according to the existing comment, seems that we can simply remove recordWriterInstantiated to allow empty file created.

In this PR, I'm focusing on emtpy file. We will replace the whole writer and reader with ORC 1.4.1 eventually. The newly added test case in this PR will make us to transit safely.

SparkQA · 2017-10-25T06:46:40Z

Test build #83030 has finished for PR 19571 at commit be7ba9b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-10-25T08:16:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

+      val writer = org.apache.orc.OrcFile.createWriter(
+        new Path(path), org.apache.orc.mapred.OrcOutputFormat.buildOptions(conf))
+      new org.apache.orc.mapreduce.OrcMapreduceRecordWriter(writer)
+      writer.close()


So, if i understood correctly it will write out by org.apache.orc.mapreduce.OrcMapreduceRecordWriter when output is empty but, write out by org.apache.hadoop.hive.ql.io.orc.OrcRecordWriter when output is non-empty? I thought we should use the same writer for both paths if possible and this one looks rather a band-aid fix. It won't block this PR but I wonder if this is the only way we could do for now.

Yep. That's correct understanding. This PR intentionally focuses only on handling empty files and inferring schema. This will help us transit safely from old Hive ORC to new Apache ORC 1.4.1.

HyukjinKwon · 2017-10-25T08:18:17Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

@@ -73,6 +70,10 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable

    val configuration = job.getConfiguration

+    configuration.set(
+      MAPRED_OUTPUT_SCHEMA.getAttribute,
+      org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.getSchemaString(dataSchema))


Do we always need to set this?

Yes. This is the correct schema to be written.

HyukjinKwon · 2017-10-25T08:19:56Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

+      withTempDir { dir =>
+        val path = dir.getCanonicalPath
+        val emptyDf = Seq((true, 1, "str")).toDF.limit(0)
+        emptyDf.write.format(format).mode("overwrite").save(path)


Hm, why is withTempPath { path => not used without overwrite instead?

Thanks. No problem. I'll use withTempPath.

HyukjinKwon · 2017-10-25T08:24:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

-      files.map(_.getPath.toString),
-      Some(sparkSession.sessionState.newHadoopConf())
-    )
+    org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.readSchema(sparkSession, files)


I am not sure of this one too. This looks a complete rewrite of org.apache.spark.sql.hive.orc.OrcFileOperator.readSchema.. Is this change required to fix this issue?

Yes. It's intentional. OrcFileOperator will be replaced later completely. I made this PR as small as possible for review.

dongjoon-hyun · 2017-10-25T15:58:09Z

Thank you for review, @viirya and @HyukjinKwon .
You know that I tried to introduce new OrcFileFormat in sql/core before. But, it is too big for review. According to @cloud-fan 's advice, I'm trying to upgrade the existing OrcFileFormat one by one in a piece.

So far,

We introduced new ORC 1.4.1 dependency
Introduce new Spark SQL ORC parameters and replace Hive constant with new ORC parameters.

This is the actual first PR to use read and write using ORC 1.4.1 library.

It reads ORC file only for inferencing schema.
It writes only empty ORC file.

dongjoon-hyun · 2017-10-25T19:21:37Z

I updated the PR.
Could you review this PR again, @viirya , @HyukjinKwon , @gatorsmile , @cloud-fan ?

gatorsmile · 2017-10-25T21:02:58Z

What is the backward compatibility of ORC 1.4.1? Can we create multiple ORC files created by the previous versions and ensure they are not broken?

cloud-fan · 2017-10-25T21:11:52Z

I checked with how we introduce the new parquet reader before, and maybe we can follow it: #4308

Basically we leave the old orc data source as it is, and implement a new orc 1.4.1 data source in sql core module. Then we have an internal config to switch the implementation(by default prefer the new implementation), and remove the old implementation after one or two releases.

SparkQA · 2017-10-25T21:12:28Z

Test build #83055 has finished for PR 19571 at commit 8b4fc96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-25T21:14:39Z

What is the backward compatibility of ORC 1.4.1? Can we create multiple ORC files created by the previous versions and ensure they are not broken?

That a good point, and I think it's better to have these tests in the orc project. If they don't have, then we can take over and add these tests.

…taframe

dongjoon-hyun · 2017-10-26T02:20:49Z

Thank you for review, @gatorsmile and @cloud-fan . Especially, @cloud-fan 's opinion is my original approach in #17980 and #18953 (before Aug 16). I cannot agree any more.

Basically we leave the old orc data source as it is, and implement a new orc 1.4.1 data source in sql core module. Then we have an internal config to switch the implementation(by default prefer the new implementation), and remove the old implementation after one or two releases.

BTW, I'm wondering what is changed after you commented the following on that PR on 16th Aug.

Are the ORC APIs changed a lot in 1.4? I was expecting a small patch to upgrade the current ORC data source, without moving it to sql/core.

SparkQA · 2017-10-26T05:02:52Z

Test build #83072 has finished for PR 19571 at commit 8d212f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-10-26T16:42:17Z

@gatorsmile and @cloud-fan .
For ORC compatibility, I checked the ORC code, but it's not clearly tested.
I'll try to add some suite as a separate issue.

dongjoon-hyun · 2017-10-26T16:47:36Z

To be clear, for ORC File Versions, there exists some ORC test case against version 0.11, but it's not our scope because Spark (and Hive 1.2) uses 0.12 with HIVE_8732.

There are 6 versions with 0.12.

0 = original
1 = HIVE-8732 fixed (fixed stripe/file maximum statistics & string statistics use utf8 for min/max)
2 = HIVE-4243 fixed (use real column names from Hive tables)
3 = HIVE-12055 fixed (vectorized writer implementation)
4 = HIVE-13083 fixed (decimals write present stream correctly)
5 = ORC-101 fixed (bloom filters use utf8 consistently)
6 = ORC-135 fixed (timestamp statistics use utc)

cloud-fan · 2017-10-28T22:51:41Z

Sorry I miss-understood the problem at the beginning. I thought the new orc version just changes the existing APIs a little, but it turns out the new orc version has a new set of read/write APIs.

dongjoon-hyun · 2017-10-29T18:52:53Z

I see. Then, can we continue on #17980 Make ORCFileFormat configurable between sql/hive and sql/core?

cloud-fan · 2017-10-29T19:04:07Z

yes please

dongjoon-hyun · 2017-12-03T17:22:32Z

This is resolved in #19651 .

viirya reviewed Oct 25, 2017

View reviewed changes

HyukjinKwon reviewed Oct 25, 2017

View reviewed changes

[SPARK-15474][SQL] Write and read back non-emtpy schema with empty da…

8d212f0

…taframe

This was referenced Nov 1, 2017

[SPARK-22416][SQL] Move OrcOptions from sql/hive to sql/core #19636

Closed

[SPARK-20682][SPARK-15474][SPARK-21791] Add new ORCFileFormat based on ORC 1.4.1 #19651

Closed

dongjoon-hyun closed this Dec 3, 2017

dongjoon-hyun deleted the SPARK-15474 branch January 7, 2019 07:04

		@@ -252,6 +253,13 @@ private[orc] class OrcOutputWriter(
		override def close(): Unit = {
		if (recordWriterInstantiated) {

[SPARK-15474][SQL] Write and read back non-emtpy schema with empty dataframe #19571

[SPARK-15474][SQL] Write and read back non-emtpy schema with empty dataframe #19571

Uh oh!

Conversation

dongjoon-hyun commented Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 25, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 25, 2017

Uh oh!

gatorsmile commented Oct 25, 2017

Uh oh!

cloud-fan commented Oct 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 25, 2017

Uh oh!

cloud-fan commented Oct 25, 2017

Uh oh!

dongjoon-hyun commented Oct 26, 2017

Uh oh!

SparkQA commented Oct 26, 2017

Uh oh!

dongjoon-hyun commented Oct 26, 2017

Uh oh!

dongjoon-hyun commented Oct 26, 2017

Uh oh!

cloud-fan commented Oct 28, 2017

Uh oh!

dongjoon-hyun commented Oct 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Oct 29, 2017

Uh oh!

dongjoon-hyun commented Dec 3, 2017

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 25, 2017 •

edited

Loading

dongjoon-hyun commented Oct 25, 2017 •

edited

Loading

cloud-fan commented Oct 25, 2017 •

edited

Loading

dongjoon-hyun commented Oct 29, 2017 •

edited

Loading