Skip to content

[SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext #4562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

chenghao-intel
Copy link
Contributor

It's not allowed to be the empty directory for parquet, for example, it will failed when query the following

CREATE TABLE parquet_test (id int, str string) STORED AS PARQUET;
SELECT * FROM parquet_test;

It throws exception like:

java.lang.UnsupportedOperationException: empty.reduceLeft
    at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167)
    at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:47)
    at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
    at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
    at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
    at scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
    at org.apache.spark.sql.parquet.ParquetRelation2$.readSchema(newParquet.scala:633)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:349)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:290)
    at org.apache.spark.sql.parquet.ParquetRelation2.<init>(newParquet.scala:354)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToParquetRelation(HiveMetastoreCatalog.scala:218)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:446)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:445)
    at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
    at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
    at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:47)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:445)
    at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:422)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:917)
    at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:917)
    at org.apache.spark.sql.DataFrameImpl.<init>(DataFrameImpl.scala:61)
    at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:35)
    at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:77)
    at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:57)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
    at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27344 has started for PR 4562 at commit 33867c0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 12, 2015

Test build #27344 has finished for PR 4562 at commit 33867c0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27344/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 13, 2015

Test build #27414 has started for PR 4562 at commit cbb5460.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 13, 2015

Test build #27414 has finished for PR 4562 at commit cbb5460.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27414/
Test PASSed.

@chenghao-intel
Copy link
Contributor Author

@liancheng

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27607 has started for PR 4562 at commit a04930b.

  • This patch merges cleanly.

@chenghao-intel chenghao-intel changed the title [SQL] [Minor] Passdown the schema for Parquet File in HiveContext [SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext Feb 17, 2015
paths,
Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json))(hive))
Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json),
Some(metastoreSchema))(hive))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, we can leave this file unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, evil case insensitivity...

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27607 has finished for PR 4562 at commit a04930b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27607/
Test PASSed.

parquetSchema = readSchema().getOrElse(maybeSchema.get)
} catch {
case e => throw new SparkException(s"Failed to find schema for ${paths.mkString(",")}", e)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this

parquetSchema = {
  if (maybeSchema.isDefined) {
    maybeSchema.get
  } else {
    (readSchema(), maybeMetastoreSchema) match {
      case (Some(dataSchema), _) => dataSchema
      case (None, Some(metastoreSchema)) => metastoreSchema
      case (None, None) =>
        throw new SparkException("Failed to get the schema.")
     }
  }
}

We first check if maybeSchema is defined. If not, we read the schema from existing data. If existing data does not exist, we are dealing with a newly created empty table and we will use maybeMetastoreSchema defined in the options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, seems we do not need try ... catch at here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading the source code, I am wondering if the maybeMetastoreSchema is redundant, and it probably should be always converted into maybeSchema when creating the ParquetRelation2 instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on Cheng's comment at https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L194, I think that it is better to keep maybeMetastoreSchema and we just fix the bug for now.

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27615 has started for PR 4562 at commit 36978d1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 17, 2015

Test build #27615 has finished for PR 4562 at commit 36978d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27615/
Test PASSed.

@liancheng
Copy link
Contributor

Hey @chenghao-intel @yhuai, sorry I didn't notice this PR earlier, and I believe this issue has been fixed in #4563 (here).

asfgit pushed a commit that referenced this pull request Feb 17, 2015
…uet table to a data source parquet table.

The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception.

This PR is based on #4562 from chenghao-intel.

JIRA: https://issues.apache.org/jira/browse/SPARK-5852

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4655 from yhuai/CTASParquet and squashes the following commits:

b8b3450 [Yin Huai] Update tests.
2ac94f7 [Yin Huai] Update tests.
3db3d20 [Yin Huai] Minor update.
d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala.
36978d1 [Cheng Hao] Update the code as feedback
a04930b [Cheng Hao] fix bug of scan an empty parquet based table
442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext

(cherry picked from commit 117121a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
asfgit pushed a commit that referenced this pull request Feb 17, 2015
…uet table to a data source parquet table.

The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception.

This PR is based on #4562 from chenghao-intel.

JIRA: https://issues.apache.org/jira/browse/SPARK-5852

Author: Yin Huai <yhuai@databricks.com>
Author: Cheng Hao <hao.cheng@intel.com>

Closes #4655 from yhuai/CTASParquet and squashes the following commits:

b8b3450 [Yin Huai] Update tests.
2ac94f7 [Yin Huai] Update tests.
3db3d20 [Yin Huai] Minor update.
d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala.
36978d1 [Cheng Hao] Update the code as feedback
a04930b [Cheng Hao] fix bug of scan an empty parquet based table
442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext
@yhuai
Copy link
Contributor

yhuai commented Feb 18, 2015

@chenghao-intel can you close it? It is has been fixed by #4655.

@chenghao-intel
Copy link
Contributor Author

Thank you @yhuai , I am closing this PR. :)

@chenghao-intel chenghao-intel deleted the parquet_error branch July 2, 2015 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants