[SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext #4562

chenghao-intel · 2015-02-12T09:29:44Z

It's not allowed to be the empty directory for parquet, for example, it will failed when query the following

CREATE TABLE parquet_test (id int, str string) STORED AS PARQUET;
SELECT * FROM parquet_test;

It throws exception like:

java.lang.UnsupportedOperationException: empty.reduceLeft
    at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167)
    at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:47)
    at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
    at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
    at scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
    at org.apache.spark.sql.parquet.ParquetRelation2$.readSchema(newParquet.scala:633)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:349)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:290)
    at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:354)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToParquetRelation(HiveMetastoreCatalog.scala:218)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:446)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:445)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
    at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:47)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:445)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:422)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:917)
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:917)
    at org.apache.spark.sql.DataFrameImpl.<init>(DataFrameImpl.scala:61)
    at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:35)
    at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:77)
    at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)

SparkQA · 2015-02-12T09:32:55Z

Test build #27344 has started for PR 4562 at commit 33867c0.

This patch merges cleanly.

SparkQA · 2015-02-12T10:13:56Z

Test build #27344 has finished for PR 4562 at commit 33867c0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-12T10:13:59Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27344/
Test FAILed.

SparkQA · 2015-02-13T03:17:29Z

Test build #27414 has started for PR 4562 at commit cbb5460.

This patch merges cleanly.

SparkQA · 2015-02-13T04:26:52Z

Test build #27414 has finished for PR 4562 at commit cbb5460.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-13T04:26:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27414/
Test PASSed.

chenghao-intel · 2015-02-13T07:10:25Z

@liancheng

SparkQA · 2015-02-17T03:07:30Z

Test build #27607 has started for PR 4562 at commit a04930b.

This patch merges cleanly.

yhuai · 2015-02-17T03:58:35Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

          paths,
-          Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json))(hive))
+          Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json),
+          Some(metastoreSchema))(hive))


I think we cannot do it. See https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L194

OK, we can leave this file unchanged.

Yeah, evil case insensitivity...

SparkQA · 2015-02-17T04:30:01Z

Test build #27607 has finished for PR 4562 at commit a04930b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-17T04:30:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27607/
Test PASSed.

yhuai · 2015-02-17T04:32:44Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

+        parquetSchema = readSchema().getOrElse(maybeSchema.get)
+      } catch {
+        case e => throw new SparkException(s"Failed to find schema for ${paths.mkString(",")}", e)
+      }


How about this

parquetSchema = { if (maybeSchema.isDefined) { maybeSchema.get } else { (readSchema(), maybeMetastoreSchema) match { case (Some(dataSchema), _) => dataSchema case (None, Some(metastoreSchema)) => metastoreSchema case (None, None) => throw new SparkException("Failed to get the schema.") } } }

We first check if maybeSchema is defined. If not, we read the schema from existing data. If existing data does not exist, we are dealing with a newly created empty table and we will use maybeMetastoreSchema defined in the options.

Also, seems we do not need try ... catch at here.

After reading the source code, I am wondering if the maybeMetastoreSchema is redundant, and it probably should be always converted into maybeSchema when creating the ParquetRelation2 instance?

Based on Cheng's comment at https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L194, I think that it is better to keep maybeMetastoreSchema and we just fix the bug for now.

SparkQA · 2015-02-17T07:27:58Z

Test build #27615 has started for PR 4562 at commit 36978d1.

This patch merges cleanly.

SparkQA · 2015-02-17T08:48:41Z

Test build #27615 has finished for PR 4562 at commit 36978d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-02-17T08:48:45Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27615/
Test PASSed.

liancheng · 2015-02-17T12:43:42Z

Hey @chenghao-intel @yhuai, sorry I didn't notice this PR earlier, and I believe this issue has been fixed in #4563 (here).

…uet table to a data source parquet table. The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception. This PR is based on #4562 from chenghao-intel. JIRA: https://issues.apache.org/jira/browse/SPARK-5852 Author: Yin Huai <yhuai@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #4655 from yhuai/CTASParquet and squashes the following commits: b8b3450 [Yin Huai] Update tests. 2ac94f7 [Yin Huai] Update tests. 3db3d20 [Yin Huai] Minor update. d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala. 36978d1 [Cheng Hao] Update the code as feedback a04930b [Cheng Hao] fix bug of scan an empty parquet based table 442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext (cherry picked from commit 117121a) Signed-off-by: Michael Armbrust <michael@databricks.com>

…uet table to a data source parquet table. The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception. This PR is based on #4562 from chenghao-intel. JIRA: https://issues.apache.org/jira/browse/SPARK-5852 Author: Yin Huai <yhuai@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #4655 from yhuai/CTASParquet and squashes the following commits: b8b3450 [Yin Huai] Update tests. 2ac94f7 [Yin Huai] Update tests. 3db3d20 [Yin Huai] Minor update. d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala. 36978d1 [Cheng Hao] Update the code as feedback a04930b [Cheng Hao] fix bug of scan an empty parquet based table 442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext

yhuai · 2015-02-18T00:48:46Z

@chenghao-intel can you close it? It is has been fixed by #4655.

chenghao-intel · 2015-02-18T03:09:14Z

Thank you @yhuai , I am closing this PR. :)

chenghao-intel added 2 commits February 16, 2015 18:17

passdown the schema for Parquet File in HiveContext

442ffe0

fix bug of scan an empty parquet based table

a04930b

chenghao-intel force-pushed the parquet_error branch from cbb5460 to a04930b Compare February 17, 2015 03:06

chenghao-intel changed the title ~~[SQL] [Minor] Passdown the schema for Parquet File in HiveContext~~ [SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext Feb 17, 2015

yhuai reviewed Feb 17, 2015
View reviewed changes

Update the code as feedback

36978d1

yhuai mentioned this pull request Feb 17, 2015

[SPARK-5852][SQL]Fail to convert a newly created empty metastore parquet table to a data source parquet table. #4655

Closed

chenghao-intel closed this Feb 18, 2015

chenghao-intel deleted the parquet_error branch July 2, 2015 08:38

[SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext #4562

[SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext #4562

Uh oh!

Conversation

chenghao-intel commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 12, 2015

Uh oh!

AmplabJenkins commented Feb 12, 2015

Uh oh!

SparkQA commented Feb 13, 2015

Uh oh!

SparkQA commented Feb 13, 2015

Uh oh!

AmplabJenkins commented Feb 13, 2015

Uh oh!

chenghao-intel commented Feb 13, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

yhuai Feb 17, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Feb 17, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Feb 17, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

AmplabJenkins commented Feb 17, 2015

Uh oh!

yhuai Feb 17, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Feb 17, 2015

Choose a reason for hiding this comment

Uh oh!

chenghao-intel Feb 17, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Feb 17, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

SparkQA commented Feb 17, 2015

Uh oh!

AmplabJenkins commented Feb 17, 2015

Uh oh!

liancheng commented Feb 17, 2015

Uh oh!

yhuai commented Feb 18, 2015

Uh oh!

chenghao-intel commented Feb 18, 2015

Uh oh!

Uh oh!