[SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE #15633

ericl · 2016-10-26T01:08:02Z

What changes were proposed in this pull request?

When inserting into datasource tables with partitions managed by the hive metastore, we need to notify the metastore of newly added partitions. Previously this was implemented via msck repair table, but this is more expensive than needed.

This optimizes the insertion path to add only the updated partitions.

How was this patch tested?

Existing tests (I verified manually that tests fail if the repair operation is omitted).

SparkQA · 2016-10-26T02:37:03Z

Test build #67543 has finished for PR 15633 at commit fa91e39.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T08:54:07Z

Test build #67567 has finished for PR 15633 at commit a361583.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T21:37:38Z

Test build #67595 has finished for PR 15633 at commit 01e73bb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-26T23:52:32Z

Test build #67601 has finished for PR 15633 at commit 6f8a3a0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2016-10-27T21:48:38Z

cc @cloud-fan @davies

davies · 2016-10-27T23:04:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

@@ -386,13 +390,18 @@ object WriteOutput extends Logging {
          logDebug(s"Writing partition: $currentKey")

          currentWriter = newOutputWriter(currentKey, getPartitionString)
+          val partitionStr = getPartitionString(currentKey).getString(0)


partitionStr => partitionPath ?

davies · 2016-10-27T23:07:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteOutput.scala

@@ -375,6 +378,7 @@ object WriteOutput extends Logging {

      // If anything below fails, we should abort the task.
      var currentKey: UnsafeRow = null
+      var updatedPartitions: List[String] = Nil


In case of bucketing, There are multiple files (writer) per partition, so partitionPath will have duplicated value, should we use Set[String] here?

SparkQA · 2016-10-27T23:50:21Z

Test build #67671 has finished for PR 15633 at commit 8c4ae5e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-28T02:15:01Z

Test build #67682 has finished for PR 15633 at commit 4d96725.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2016-10-28T03:07:54Z

should we also fix DataFrameWriter.saveAsTable?

ericl · 2016-10-28T17:26:28Z

I think that one is ok since we have to scan the full table anyways. If it becomes a performance issue we can also add this optimization.

…nd repair partition commands ## What changes were proposed in this pull request? The behavior of union is not well defined here. It is safer to explicitly execute these commands in order. The other use of `Union` in this way will be removed by apache#15633 ## How was this patch tested? Existing tests. cc yhuai cloud-fan Author: Eric Liang <ekhliang@gmail.com> Author: Eric Liang <ekl@databricks.com> Closes apache#15665 from ericl/spark-18146.

cloud-fan · 2016-10-30T14:54:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

@@ -179,24 +180,30 @@ case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] {
          "Cannot overwrite a path that is also being read from.")
      }

+      def refreshPartitionsCallback(updatedPartitions: Seq[TablePartitionSpec]): Unit = {
+        if (l.catalogTable.isDefined &&


shall we move this if out of the function? e.g.

val refreshPartitionsCallback = if (...) { ... } else { _ => () }

imo that is a little harder to read, since you have two anonymous function declarations instead of one.

cloud-fan · 2016-10-30T14:56:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala

+        if (l.catalogTable.isDefined &&
+            l.catalogTable.get.partitionColumnNames.nonEmpty &&
+            l.catalogTable.get.partitionProviderIsHive) {
+          val metastoreUpdater = AlterTableAddPartitionCommand(


shall we just copy the main logic of AlterTableAddPartitionCommand here? or we have to fetch the table metadata from metastore everytime.

I'd rather keep it, since the fetch overhead is pretty small

cloud-fan · 2016-11-01T00:37:39Z

LGTM, pending jenkins

[SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE

SparkQA · 2016-11-01T02:27:13Z

Test build #3381 has finished for PR 15633 at commit 4d96725.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-11-01T02:46:14Z

Merging in master.

…nd repair partition commands ## What changes were proposed in this pull request? The behavior of union is not well defined here. It is safer to explicitly execute these commands in order. The other use of `Union` in this way will be removed by apache#15633 ## How was this patch tested? Existing tests. cc yhuai cloud-fan Author: Eric Liang <ekhliang@gmail.com> Author: Eric Liang <ekl@databricks.com> Closes apache#15665 from ericl/spark-18146.

rxin · 2016-11-03T08:54:15Z

I took a look at this again tonight, in the context of consolidating this with Hive. I think doing it through callback is actually not ideal, as callbacks are harder to trace. In 2.2 we should make this an explicit action, rather than callbacks.

…nd repair partition commands ## What changes were proposed in this pull request? The behavior of union is not well defined here. It is safer to explicitly execute these commands in order. The other use of `Union` in this way will be removed by apache#15633 ## How was this patch tested? Existing tests. cc yhuai cloud-fan Author: Eric Liang <ekhliang@gmail.com> Author: Eric Liang <ekl@databricks.com> Closes apache#15665 from ericl/spark-18146.

## What changes were proposed in this pull request? When inserting into datasource tables with partitions managed by the hive metastore, we need to notify the metastore of newly added partitions. Previously this was implemented via `msck repair table`, but this is more expensive than needed. This optimizes the insertion path to add only the updated partitions. ## How was this patch tested? Existing tests (I verified manually that tests fail if the repair operation is omitted). Author: Eric Liang <ekl@databricks.com> Closes apache#15633 from ericl/spark-18087.

Thu Oct 27 14:45:52 PDT 2016

8c4ae5e

ericl force-pushed the spark-18087 branch from f01aeb3 to 8c4ae5e Compare October 27, 2016 21:45

ericl changed the title ~~[SPARK-18087] [SQL] [WIP] Optimize insert to not require REPAIR TABLE~~ [SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE Oct 27, 2016

ericl mentioned this pull request Oct 27, 2016

[SPARK-18146] [SQL] Avoid using Union to chain together create table and repair partition commands #15665

Closed

davies reviewed Oct 27, 2016

View reviewed changes

ericl added 2 commits October 27, 2016 17:53

Thu Oct 27 17:53:13 PDT 2016

2484809

Thu Oct 27 17:53:29 PDT 2016

4d96725

cloud-fan reviewed Oct 30, 2016

View reviewed changes

rxin added a commit to rxin/spark that referenced this pull request Nov 1, 2016

Merge pull request apache#15633 from ericl/spark-18087

0647959

[SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE

rxin mentioned this pull request Nov 1, 2016

[SPARK-18024][SQL] Introduce an internal commit protocol API #15707

Closed

asfgit closed this in efc254a Nov 1, 2016

[SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE #15633

[SPARK-18087] [SQL] Optimize insert to not require REPAIR TABLE #15633

Uh oh!

Conversation

ericl commented Oct 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

SparkQA commented Oct 26, 2016

Uh oh!

ericl commented Oct 27, 2016

Uh oh!

davies Oct 27, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Oct 28, 2016

Choose a reason for hiding this comment

Uh oh!

davies Oct 27, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Oct 28, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 27, 2016

Uh oh!

SparkQA commented Oct 28, 2016

Uh oh!

cloud-fan commented Oct 28, 2016

Uh oh!

ericl commented Oct 28, 2016

Uh oh!

cloud-fan Oct 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Oct 31, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 30, 2016

Choose a reason for hiding this comment

Uh oh!

ericl Oct 31, 2016

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 1, 2016

Uh oh!

SparkQA commented Nov 1, 2016

Uh oh!

rxin commented Nov 1, 2016

Uh oh!

rxin commented Nov 3, 2016

Uh oh!

Uh oh!

ericl commented Oct 26, 2016 •

edited

Loading

cloud-fan Oct 30, 2016 •

edited

Loading