-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-5852] [SQL] Passdown the schema for Parquet File in HiveContext #4562
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #27344 has started for PR 4562 at commit
|
Test build #27344 has finished for PR 4562 at commit
|
Test FAILed. |
Test build #27414 has started for PR 4562 at commit
|
Test build #27414 has finished for PR 4562 at commit
|
Test PASSed. |
cbb5460
to
a04930b
Compare
Test build #27607 has started for PR 4562 at commit
|
paths, | ||
Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json))(hive)) | ||
Map(ParquetRelation2.METASTORE_SCHEMA -> metastoreSchema.json), | ||
Some(metastoreSchema))(hive)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, we can leave this file unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, evil case insensitivity...
Test build #27607 has finished for PR 4562 at commit
|
Test PASSed. |
parquetSchema = readSchema().getOrElse(maybeSchema.get) | ||
} catch { | ||
case e => throw new SparkException(s"Failed to find schema for ${paths.mkString(",")}", e) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this
parquetSchema = {
if (maybeSchema.isDefined) {
maybeSchema.get
} else {
(readSchema(), maybeMetastoreSchema) match {
case (Some(dataSchema), _) => dataSchema
case (None, Some(metastoreSchema)) => metastoreSchema
case (None, None) =>
throw new SparkException("Failed to get the schema.")
}
}
}
We first check if maybeSchema is defined. If not, we read the schema from existing data. If existing data does not exist, we are dealing with a newly created empty table and we will use maybeMetastoreSchema defined in the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, seems we do not need try ... catch
at here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading the source code, I am wondering if the maybeMetastoreSchema
is redundant, and it probably should be always converted into maybeSchema
when creating the ParquetRelation2
instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on Cheng's comment at https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L194, I think that it is better to keep maybeMetastoreSchema
and we just fix the bug for now.
Test build #27615 has started for PR 4562 at commit
|
Test build #27615 has finished for PR 4562 at commit
|
Test PASSed. |
Hey @chenghao-intel @yhuai, sorry I didn't notice this PR earlier, and I believe this issue has been fixed in #4563 (here). |
…uet table to a data source parquet table. The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception. This PR is based on #4562 from chenghao-intel. JIRA: https://issues.apache.org/jira/browse/SPARK-5852 Author: Yin Huai <yhuai@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #4655 from yhuai/CTASParquet and squashes the following commits: b8b3450 [Yin Huai] Update tests. 2ac94f7 [Yin Huai] Update tests. 3db3d20 [Yin Huai] Minor update. d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala. 36978d1 [Cheng Hao] Update the code as feedback a04930b [Cheng Hao] fix bug of scan an empty parquet based table 442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext (cherry picked from commit 117121a) Signed-off-by: Michael Armbrust <michael@databricks.com>
…uet table to a data source parquet table. The problem is that after we create an empty hive metastore parquet table (e.g. `CREATE TABLE test (a int) STORED AS PARQUET`), Hive will create an empty dir for us, which cause our data source `ParquetRelation2` fail to get the schema of the table. See JIRA for the case to reproduce the bug and the exception. This PR is based on #4562 from chenghao-intel. JIRA: https://issues.apache.org/jira/browse/SPARK-5852 Author: Yin Huai <yhuai@databricks.com> Author: Cheng Hao <hao.cheng@intel.com> Closes #4655 from yhuai/CTASParquet and squashes the following commits: b8b3450 [Yin Huai] Update tests. 2ac94f7 [Yin Huai] Update tests. 3db3d20 [Yin Huai] Minor update. d7e2308 [Yin Huai] Revert changes in HiveMetastoreCatalog.scala. 36978d1 [Cheng Hao] Update the code as feedback a04930b [Cheng Hao] fix bug of scan an empty parquet based table 442ffe0 [Cheng Hao] passdown the schema for Parquet File in HiveContext
@chenghao-intel can you close it? It is has been fixed by #4655. |
Thank you @yhuai , I am closing this PR. :) |
It's not allowed to be the empty directory for parquet, for example, it will failed when query the following
It throws exception like: