-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-17970][SQL] store partition spec in metastore for data source table #15515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
partitions' files from a table file catalog
BasicFileCatalog (or vice-versa)
instead of once per partition
moving a protected method down and making it private
* [SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query * Add a new catalyst optimizer rule to SQL core for pruning unneeded partitions' files from a table file catalog * Include the type of file catalog in the FileSourceScanExec metadata * TODO: Consider renaming FileCatalog to better differentiate it from BasicFileCatalog (or vice-versa) * try out parquet case insensitive fallback * Refactor the FileSourceScanExec.metadata val to make it prettier * fix and add test for input files * rename test * Refactor `TableFileCatalog.listFiles` to call `listDataLeafFiles` once instead of once per partition * fix it * more test cases * also fix a bug with zero partitions selected * feature flag * add comments * extend and fix flakiness in test * Enhance `ParquetMetastoreSuite` with mixed-case partition columns * Tidy up a little by removing some unused imports, an unused method and moving a protected method down and making it private * Put partition count in `FileSourceScanExec.metadata` for partitioned tables * Fix some errors in my revision of `ParquetSourceSuite` * Thu Oct 13 17:18:14 PDT 2016 * more generic * Thu Oct 13 18:09:42 PDT 2016 * Thu Oct 13 18:09:55 PDT 2016 * Thu Oct 13 18:22:31 PDT 2016
partition data from a HadoopFsRelation's file catalog
instead of once per partition
Test build #67604 has finished for PR 15515 at commit
|
Test build #67610 has finished for PR 15515 at commit
|
Test build #67615 has finished for PR 15515 at commit
|
Odd that the last commit could cause this, maybe it's a flake? jenkins test this please |
retest this please |
Test build #67625 has finished for PR 15515 at commit
|
retest this please |
Test build #67632 has finished for PR 15515 at commit
|
df.sparkSession.sqlContext.conf.manageFilesourcePartitions) { | ||
// Need to recover partitions into the metastore so our saved data is visible. | ||
val recoverPartitionCmd = AlterTableRecoverPartitionsCommand(tableDesc.identifier) | ||
Union(createCmd, recoverPartitionCmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a special node for running a sequence of commands. We are relying on the implementation of Union at here. Let's address this in a follow-up PR.
@@ -50,7 +50,8 @@ case class AnalyzeColumnCommand( | |||
AnalyzeTableCommand.calculateTotalSize(sessionState, catalogRel.catalogTable)) | |||
|
|||
case logicalRel: LogicalRelation if logicalRel.catalogTable.isDefined => | |||
updateStats(logicalRel.catalogTable.get, logicalRel.relation.sizeInBytes) | |||
updateStats(logicalRel.catalogTable.get, | |||
AnalyzeTableCommand.calculateTotalSize(sessionState, logicalRel.catalogTable.get)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How's the cost of AnalyzeTableCommand.calculateTotalSize
? Also, why is sizeInBytes
not the latest size?
sparkSession.sqlContext.conf.manageFilesourcePartitions => | ||
// Need to recover partitions into the metastore so our saved data is visible. | ||
sparkSession.sessionState.executePlan( | ||
AlterTableRecoverPartitionsCommand(table.identifier)).toRdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we have a similar logic at https://github.com/apache/spark/pull/15515/files#diff-94fbd986b04087223f53697d4b6cab24R396. Are we recovering partitions twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just double checked with create table test_sel2 USING parquet PARTITIONED BY (fieldone, fieldtwo) AS SELECT id as fieldzero, id as fieldone, id as fieldtwo from range(100)
and it is using a different path, so the recovery is not duplicated.
// This is always the case for Hive format tables, but is not true for Datasource tables created | ||
// before Spark 2.1 unless they are converted via `msck repair table`. | ||
spark.sessionState.catalog.alterTable(table.copy(partitionProviderIsHive = true)) | ||
catalog.refreshTable(tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also update the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a pretty major change for 2.1. Shall we do it in a followup once the patches for 2.1 are finalized?
Looks good. I left a few questions. Let me know if you want to address them in follow-up prs. |
df.sparkSession.sqlContext.conf.manageFilesourcePartitions) { | ||
// Need to recover partitions into the metastore so our saved data is visible. | ||
val recoverPartitionCmd = AlterTableRecoverPartitionsCommand(tableDesc.identifier) | ||
Union(createCmd, recoverPartitionCmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll file a ticket after this is merged
@@ -50,7 +50,8 @@ case class AnalyzeColumnCommand( | |||
AnalyzeTableCommand.calculateTotalSize(sessionState, catalogRel.catalogTable)) | |||
|
|||
case logicalRel: LogicalRelation if logicalRel.catalogTable.isDefined => | |||
updateStats(logicalRel.catalogTable.get, logicalRel.relation.sizeInBytes) | |||
updateStats(logicalRel.catalogTable.get, | |||
AnalyzeTableCommand.calculateTotalSize(sessionState, logicalRel.catalogTable.get)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that the relation's size is no longer computed when it is resolved, so we have to force a table scan here to get an updated size.
Weird that github reordered my comment above actually ^
sparkSession.sqlContext.conf.manageFilesourcePartitions => | ||
// Need to recover partitions into the metastore so our saved data is visible. | ||
sparkSession.sessionState.executePlan( | ||
AlterTableRecoverPartitionsCommand(table.identifier)).toRdd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just double checked with create table test_sel2 USING parquet PARTITIONED BY (fieldone, fieldtwo) AS SELECT id as fieldzero, id as fieldone, id as fieldtwo from range(100)
and it is using a different path, so the recovery is not duplicated.
// This is always the case for Hive format tables, but is not true for Datasource tables created | ||
// before Spark 2.1 unless they are converted via `msck repair table`. | ||
spark.sessionState.catalog.alterTable(table.copy(partitionProviderIsHive = true)) | ||
catalog.refreshTable(tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is a pretty major change for 2.1. Shall we do it in a followup once the patches for 2.1 are finalized?
|
||
if (l.catalogTable.isDefined && l.catalogTable.get.partitionColumnNames.nonEmpty && | ||
l.catalogTable.get.partitionProviderIsHive) { | ||
// TODO(ekl) we should be more efficient here and only recover the newly added partitions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a follow-up pr for this. Will cc you.
Cool. I am merging this pr to unblock other tasks. |
…table ## What changes were proposed in this pull request? We should follow hive table and also store partition spec in metastore for data source table. This brings 2 benefits: 1. It's more flexible to manage the table data files, as users can use `ADD PARTITION`, `DROP PARTITION` and `RENAME PARTITION` 2. We don't need to cache all file status for data source table anymore. ## How was this patch tested? existing tests. Author: Eric Liang <ekl@databricks.com> Author: Michael Allman <michael@videoamp.com> Author: Eric Liang <ekhliang@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15515 from cloud-fan/partition.
} | ||
} | ||
|
||
test("when partition management is disabled, we preserve the old behavior even for new tables") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked the old behavior. It is different from the existing behavior in our Spark 2.0 build. Let me do a quick fix to resolve it.
…table ## What changes were proposed in this pull request? We should follow hive table and also store partition spec in metastore for data source table. This brings 2 benefits: 1. It's more flexible to manage the table data files, as users can use `ADD PARTITION`, `DROP PARTITION` and `RENAME PARTITION` 2. We don't need to cache all file status for data source table anymore. ## How was this patch tested? existing tests. Author: Eric Liang <ekl@databricks.com> Author: Michael Allman <michael@videoamp.com> Author: Eric Liang <ekhliang@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15515 from cloud-fan/partition.
…ABLE EXTENDED ### What changes were proposed in this pull request? Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up #16373 and #15515. ### Why are the changes needed? To output an user friendly error with recommendation like **" ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` "** instead of silently output an empty result. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Closes #30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ABLE EXTENDED ### What changes were proposed in this pull request? Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up #16373 and #15515. ### Why are the changes needed? To output an user friendly error with recommendation like **" ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` "** instead of silently output an empty result. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Closes #30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 29096a8) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ABLE EXTENDED Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up apache#16373 and apache#15515. To output an user friendly error with recommendation like **" ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` "** instead of silently output an empty result. Yes. By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Closes apache#30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 29096a8) Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ABLE EXTENDED Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up apache#16373 and apache#15515. To output an user friendly error with recommendation like **" ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` "** instead of silently output an empty result. Yes. By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Closes apache#30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 29096a8) Signed-off-by: Max Gekk <max.gekk@gmail.com>
…HOW TABLE EXTENDED ### What changes were proposed in this pull request? Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up #16373 and #15515. ### Why are the changes needed? To output an user friendly error with recommendation like **" ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` "** instead of silently output an empty result. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 29096a8) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #30640 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive-3.0. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…HOW TABLE EXTENDED ### What changes were proposed in this pull request? Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up #16373 and #15515. ### Why are the changes needed? To output an user friendly error with recommendation like **" ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` "** instead of silently output an empty result. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: HyukjinKwon <gurwls223apache.org> (cherry picked from commit 29096a8) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #30641 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive-2.4. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
We should follow hive table and also store partition spec in metastore for data source table.
This brings 2 benefits:
ADD PARTITION
,DROP PARTITION
andRENAME PARTITION
How was this patch tested?
existing tests.