[SPARK-16948][SQL] Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog #14537

rajeshbalamohan · 2016-08-08T08:30:45Z

What changes were proposed in this pull request?

Querying empty partitioned ORC tables from spark-sql throws exception with spark.sql.hive.convertMetastoreOrc=true. This PR fixes it by using metastoreSchema for ORC files in HiveMetastoreCatalog.

How was this patch tested?

Included unit tests and also tested it in small scale cluster.

…tion

SparkQA · 2016-08-08T09:57:31Z

Test build #63352 has finished for PR 14537 at commit 5721b88.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2016-08-09T06:24:43Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -294,7 +294,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
            ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred)
          }.getOrElse(metastoreSchema)
        } else {
-          defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()).get
+          val inferredSchema =


There's some code duplicated in both branches of this if expression. Can you refactor it to remove the duplication, please?

mallman · 2016-08-09T06:31:26Z

@rajeshbalamohan, the changes to HiveMetastoreCatalog.scala look reasonable. This mirrors the behavior of this method before the if (fileType.equals("parquet")) expression was introduced in 1e88615.

@tejasapatil, can you help review this PR? I ask because you're the author of 1e88615, which is where the code in question in HiveMetastoreCatalog.scala was written.

…tion

SparkQA · 2016-08-09T12:38:02Z

Test build #63441 has finished for PR 14537 at commit 4ae92d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mallman · 2016-08-09T16:22:07Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -287,14 +287,14 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
          new Path(metastoreRelation.catalogTable.storage.locationUri.get),
          partitionSpec)

+        val schema =


Thanks for refactoring this.

I think it makes more sense if

defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles())

is called inferredSchema and the value of the if (fileType.equals("parquet")) expression is called schema.

…tion

rajeshbalamohan · 2016-08-09T22:10:25Z

Thanks @mallman . Fixed review comments in latest commit.

SparkQA · 2016-08-09T23:36:33Z

Test build #63474 has finished for PR 14537 at commit 75bca1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rajeshbalamohan · 2016-08-12T13:29:28Z

@tejasapatil, @mallman - Can you please review when you find time?

tejasapatil · 2016-08-12T15:46:03Z

LGTM

mallman · 2016-08-12T15:57:45Z

@rajeshbalamohan We'll need a committer to review your patch.

rajeshbalamohan · 2016-08-13T02:04:39Z

@rxin Can you please review when you find time?

rajeshbalamohan · 2016-08-13T02:05:08Z

Thank you thejas and @mallman

rxin · 2016-08-17T06:18:33Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala

+    val schema = Try(OrcFileOperator.readSchema(
+        files.map(_.getPath.toUri.toString),
+        Some(sparkSession.sessionState.newHadoopConf())))
+      .recover { case _: FileNotFoundException => None }


why are we ignoring file not found exception here?

rxin · 2016-08-17T06:20:08Z

cc @cloud-fan @gatorsmile can you also take a look at this?

…6948

…tion

SparkQA · 2016-08-22T08:57:55Z

Test build #64183 has finished for PR 14537 at commit 4004c0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rajeshbalamohan · 2016-08-22T09:14:21Z

Thanks @rxin . Incorporated review comments.

cloud-fan · 2016-08-22T11:28:27Z

why do we infer schema for tables? Table schema should be persisted to metastore when it was created.

rajeshbalamohan · 2016-08-22T14:40:31Z

Right, for Parquet this could be part of initial codebase (from Spark-1251 I believe) which merges any metastore conflicts with parq files. But in the case of ORC, this inference is still valid as the column names stored in old ORC format could be different from that of Hive Metastore (e.g HIVE-4243). There is a separate PR:#14471 which track the ORC compatibility issue.

mallman · 2016-08-22T16:25:32Z

@rajeshbalamohan So for Orc 2.x files, would schema inference be unnecessary?

rajeshbalamohan · 2016-08-22T16:32:23Z

For latest ORC, if the data was written out by Hive, it would have the same mapping.

gatorsmile · 2016-08-22T22:04:06Z

uh, I missed this ping. Will review it tonight. Thanks!

gatorsmile · 2016-08-23T07:10:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala

+        withTable("empty_text_partitioned") {
+          spark.sql(
+            s"""CREATE TABLE empty_text_partitioned(key INT, value STRING)
+                | PARTITIONED BY (p INT) STORED AS TEXTFILE


Testing the textfile format sounds useless. We do not convert it to LogicalRelation.

…tion

rajeshbalamohan · 2016-08-25T23:59:20Z

Fixed the test case name. I haven't changed the parquet code path as I wasn't sure on whether it would break any backward compatibility.

gatorsmile · 2016-08-26T00:50:08Z

You might forget this comment #14537 (comment)

…tion

rajeshbalamohan · 2016-08-26T01:12:31Z

Thanks @gatorsmile . Removed the changes related to OrcFileFormat

SparkQA · 2016-08-26T01:45:30Z

Test build #64446 has finished for PR 14537 at commit fc14e2d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-08-26T02:06:51Z

@gatorsmile Thanks for cc'ing me.

As spark.sql.hive.convertMetastoreOrc is set to false by default, this change looks fine. However, if setting the config to true, and hitting with inconsistent schema between metastore and Orc files, I remember it will cause failure when reading the files.

I've implemented two approaches to this issue, #14282 is simply disabling Orc conversion if the case happens, #14365 is doing complicated schema mapping. Once this is merged, I think we should fix the schema inconsistency soon.

viirya · 2016-08-26T02:08:42Z

BTW, @rajeshbalamohan as you directly use metastore schema now, the PR description looks not correct anymore, can you also update it? Thanks.

SparkQA · 2016-08-26T02:46:36Z

Test build #64449 has finished for PR 14537 at commit 9ecb2ed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-02T01:11:29Z

@rajeshbalamohan do you have time to update it? thanks!

rajeshbalamohan · 2016-09-03T00:15:03Z

Sorry about the delay. Updated the PR.

cloud-fan · 2016-09-03T03:48:46Z

can you address this https://github.com/apache/spark/pull/14537/files#r76355262? thanks!

davies · 2016-09-14T18:20:05Z

What's the progress on this one?

rajeshbalamohan · 2016-09-21T08:26:56Z

@cloud-fan

For branch 2.0, we should open another PR to fix the OrcFileFormat.inferSchema, to not throw FileNotFoundException for empty table.

Code for not throwing FileNotFoundException in OrcFileFormat.inferSchema was removed from this patch. I can create separate JIRA for that; plz let me know if that is blocking this patch.

…tion

cloud-fan · 2016-09-22T04:48:00Z

LGTM, pending jenkins

SparkQA · 2016-09-22T05:04:24Z

Test build #65755 has finished for PR 14537 at commit fa71370.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rajeshbalamohan · 2016-09-22T08:07:51Z

@cloud-fan . Failure is related to the parquet changes introduced for returning metastoreSchema (it has issues with complex types). I am not very comfortable with the Parquet codepath. For time being, I would revert back the last change. We can create subsequent jira if needed for parq related changes; Alternatively I am fine with someone who is comfortable with parq code taking over this as well.

…tion

SparkQA · 2016-09-22T10:13:57Z

Test build #65765 has finished for PR 14537 at commit e39715e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-09-23T10:09:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+              ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred)
+            }.getOrElse(metastoreSchema)
+          case "orc" =>
+            metastoreSchema


I went through the code path again, seems we must infer the schema here.

In metastore, we store the table schema and partition columns. HadoopFsRelation need a dataSchema which is the real schema of the data files. Normally it's just the table schema exclude partition columns, however, Spark SQL supports a special case: partition columns can also exist in data files. (see the doc for HadoopFsRelation.dataSchema). This information is not preserved in metastore, so we have to infer the data schema based on data files here.

cc @yhuai @gatorsmile

@cloud-fan As we discussed offline yesterday, this is probably fine since ORC supports column pruning. Therefore, when reading an ORC file in a partitioned table, the reader always ignores partition columns stored inside the physical file and uses the value encoded in partition directory path.

We already have a test case for this case here.

liancheng · 2016-09-27T08:07:39Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcQuerySuite.scala

+            |PARTITIONED BY (p INT) STORED AS ORC
+          """.stripMargin)
+
+        val emptyDF = Seq.empty[(Int, String)].toDF("key", "value").coalesce(1)


You don't really need .coalesce(1) since the create DataFrame wraps a LocalRelation.

liancheng · 2016-09-27T08:11:52Z

LGTM. Thanks!

cloud-fan · 2016-09-27T13:07:58Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+            val inferredSchema =
+              defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles())
+            inferredSchema.map { inferred =>
+              ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred)


I'm a little worried here. If the table is partitioned, metastoreSchema will always contain partition columns, and thus the merged schema will contain partition columns too. This means, we always read parquet files with partition columns, I think we may have a hidden bug.

viirya · 2016-12-07T04:23:27Z

The schema inferring is replaced with metastore schema completely in #14690. I think we can close this now? cc @cloud-fan @liancheng

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

5721b88

…tion

rajeshbalamohan changed the title ~~[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…~~ [SPARK-16948][SQL] Querying empty partitioned orc tables throws exceptions Aug 8, 2016

mallman reviewed Aug 9, 2016
View reviewed changes

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

4ae92d8

…tion

mallman reviewed Aug 9, 2016
View reviewed changes

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

75bca1b

…tion

rxin reviewed Aug 17, 2016
View reviewed changes

rbalamohan added 2 commits August 22, 2016 12:57

Merge branch 'master' of https://github.com/apache/spark into SPARK-1…

97d21f6

…6948

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

4004c0a

…tion

gatorsmile reviewed Aug 23, 2016
View reviewed changes

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

fc14e2d

…tion

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

9ecb2ed

…tion

rajeshbalamohan changed the title ~~[SPARK-16948][SQL] Querying empty partitioned orc tables throws exceptions~~ [SPARK-16948][SQL] support empty orc table when converting hive serde table to data source table Aug 26, 2016

rajeshbalamohan changed the title ~~[SPARK-16948][SQL] support empty orc table when converting hive serde table to data source table~~ [SPARK-16948][SQL] Support empty orc table when converting hive serde table to data source table Aug 26, 2016

rajeshbalamohan changed the title ~~[SPARK-16948][SQL] Support empty orc table when converting hive serde table to data source table~~ [SPARK-16948][SQL] Use metastore schema instead of inferring schema in ORC in HiveMetastoreCatalog Sep 3, 2016

rajeshbalamohan changed the title ~~[SPARK-16948][SQL] Use metastore schema instead of inferring schema in ORC in HiveMetastoreCatalog~~ [SPARK-16948][SQL] Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog Sep 3, 2016

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

fa71370

…tion

[SPARK-16948][SQL] Querying empty partitioned orc tables throws excep…

e39715e

…tion

cloud-fan reviewed Sep 23, 2016

View reviewed changes

liancheng reviewed Sep 27, 2016

View reviewed changes

cloud-fan reviewed Sep 27, 2016

View reviewed changes

asfgit closed this in 08d6441 Dec 7, 2016

[SPARK-16948][SQL] Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog #14537

[SPARK-16948][SQL] Use metastore schema instead of inferring schema for ORC in HiveMetastoreCatalog #14537

Uh oh!

Conversation

rajeshbalamohan commented Aug 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 8, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mallman commented Aug 9, 2016

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rajeshbalamohan commented Aug 9, 2016

Uh oh!

SparkQA commented Aug 9, 2016

Uh oh!

rajeshbalamohan commented Aug 12, 2016

Uh oh!

tejasapatil commented Aug 12, 2016

Uh oh!

mallman commented Aug 12, 2016

Uh oh!

rajeshbalamohan commented Aug 13, 2016

Uh oh!

rajeshbalamohan commented Aug 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Aug 17, 2016

Uh oh!

SparkQA commented Aug 22, 2016

Uh oh!

rajeshbalamohan commented Aug 22, 2016

Uh oh!

cloud-fan commented Aug 22, 2016

Uh oh!

rajeshbalamohan commented Aug 22, 2016

Uh oh!

mallman commented Aug 22, 2016

Uh oh!

rajeshbalamohan commented Aug 22, 2016

Uh oh!

gatorsmile commented Aug 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rajeshbalamohan commented Aug 25, 2016

Uh oh!

gatorsmile commented Aug 26, 2016

Uh oh!

rajeshbalamohan commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

viirya commented Aug 26, 2016

Uh oh!

viirya commented Aug 26, 2016

Uh oh!

SparkQA commented Aug 26, 2016

Uh oh!

cloud-fan commented Sep 2, 2016

Uh oh!

rajeshbalamohan commented Sep 3, 2016

Uh oh!

cloud-fan commented Sep 3, 2016

Uh oh!

davies commented Sep 14, 2016

Uh oh!

rajeshbalamohan commented Sep 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cloud-fan commented Sep 22, 2016

Uh oh!

rajeshbalamohan commented Aug 8, 2016 •

edited

Loading

rajeshbalamohan commented Sep 21, 2016 •

edited

Loading