[SPARK-20236][SQL] dynamic partition overwrite #18714

cloud-fan · 2017-07-22T16:46:33Z

What changes were proposed in this pull request?

When overwriting a partitioned table with dynamic partition columns, the behavior is different between data source and hive tables.

data source table: delete all partition directories that match the static partition values provided in the insert statement.

hive table: only delete partition directories which have data written into it

This PR adds a new config to make users be able to choose hive's behavior.

How was this patch tested?

new tests

cloud-fan · 2017-07-22T16:46:52Z

cc @gatorsmile @ericl

SparkQA · 2017-07-22T19:46:31Z

Test build #79870 has finished for PR 18714 at commit 8e7f5dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HadoopMapReduceCommitProtocol(
class SQLHadoopMapReduceCommitProtocol(

rxin · 2017-07-23T01:21:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -881,6 +881,16 @@ object SQLConf {
      .intConf
      .createWithDefault(10000)

+  val HIVE_STYLE_PARTITION_OVERWRITE =
+    buildConf("spark.sql.hiveStylePartitionOverwrite")


i wouldn't call it like this. I'd actually describe what it does, tableOverwrite vs partitionOverwrite.

do you wanna hide the hive stuff?

how about spark.sql.runtimePartitionOverwrite ?

ericl · 2017-07-23T03:00:33Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

@@ -52,12 +55,22 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String)
   */
  @transient private var addedAbsPathFiles: mutable.Map[String, String] = null

+  @transient private var partitionPaths: mutable.Set[String] = null
+
+  @transient private var stagingDir: Path = _


Do you need to add these fields? It seems like they can be computed from addedAbsPathFiles and the constructor params respectively.

Maybe faster? We are not deleting the files one by one. We drop the whole staging directory.

I mean, we can turn stagingDir into private def stagingDir or a local variable in a function.

Similarly, partitionPaths can be computed as filesToMove.map(_.getPath).distinct during the commit phase.

stagingDir may not needed, but we do need partitionPaths, which tracks partitions with default path.

gatorsmile · 2017-07-23T06:06:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -881,6 +881,16 @@ object SQLConf {
      .intConf
      .createWithDefault(10000)

+  val HIVE_STYLE_PARTITION_OVERWRITE =
+    buildConf("spark.sql.hiveStylePartitionOverwrite")
+      .doc("When insert overwrite a partitioned table with dynamic partition columns, Spark " +


dynamic -> dynamic and mixed

gatorsmile · 2017-07-23T06:10:22Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+class HadoopMapReduceCommitProtocol(
+    jobId: String,
+    path: String,
+    runtimeOverwritePartition: Boolean = false)


Not easy to understand the purpose of this parameter by reading the name. We might need a @param

gatorsmile · 2017-07-23T06:21:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "to keep the previous behavior, which means Spark will delete all partition directories " +
+        "that match the static partition values provided in the insert statement.")
+      .booleanConf
+      .createWithDefault(false)


Could we turn this true and show how many existing test cases failed? And then, turn it off.

a lot of tests will fail because we explicitly assert the old behavior, but I can try

gatorsmile · 2017-07-23T07:10:54Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

@@ -162,5 +198,8 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String)
      val tmp = new Path(src)
      tmp.getFileSystem(taskContext.getConfiguration).delete(tmp, false)
    }
+    if (runtimeOverwritePartition) {


If we just read this function without the context of this PR, I might ask why we drop the staging directory only when runtimeOverwritePartition is true?

Any reason we want to keep it unchanged when runtimeOverwritePartition is false?

gatorsmile · 2017-07-23T07:38:30Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

-    val stagingDir: String = committer match {
+    val stagingDir: Path = committer match {
+      case _ if runtimeOverwritePartition =>
+        assert(dir.isDefined)


Added an error message just in case the assert does not match? It helps us read the log.

cloud-fan · 2017-07-23T14:27:21Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+   * e.g. a=1/b=2. Files under these partitions will be saved into staging directory and moved to
+   * destination directory at the end, if `runtimeOverwritePartition` is true.
+   */
+  @transient private var partitionPaths: mutable.Set[String] = null


cc @ericl , addedAbsPathFiles only tracks partitions with custom path, we still need this partitionPaths to track partitions with default path

cloud-fan · 2017-07-23T14:33:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+        "will delete all partition directories that match the static partition values provided " +
+        "in the insert statement.")
+      .booleanConf
+      .createWithDefault(false)


CC @gatorsmile I decided not to try it, because this config only take effect when overwriting partitioned table with dynamic partition columns, and this config will change the behavior and fail all the related tests.

SparkQA · 2017-07-23T17:22:26Z

Test build #79887 has finished for PR 18714 at commit 9d6eeaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ericl · 2017-07-23T23:21:22Z

Got it.

…

On Sun, Jul 23, 2017, 10:40 PM Wenchen Fan ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala <#18714 (comment)>: > @@ -52,12 +55,22 @@ class HadoopMapReduceCommitProtocol(jobId: String, path: String) */ @transient private var addedAbsPathFiles: mutable.Map[String, String] = null + @transient private var partitionPaths: mutable.Set[String] = null + + @transient private var stagingDir: Path = _ stagingDir may not needed, but we do need partitionPaths, which tracks partitions with default path. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18714 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAA6SjVFAGGASJljw9mcxp92eUnErt5sks5sQ01OgaJpZM4OgOKK> .

SparkQA · 2017-07-24T19:11:24Z

Test build #79910 has finished for PR 18714 at commit 8abffd0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-24T19:17:30Z

Test build #79911 has finished for PR 18714 at commit 0630372.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-08T20:28:16Z

retest this please

SparkQA · 2017-08-08T23:30:32Z

Test build #80415 has finished for PR 18714 at commit 0630372.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-09T15:57:30Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

+      outputPath = outputPath.toString,
+      // If there is no matching partitions, overwrite is same as append, so here we only enable
+      // runtime partition overwrite when there are matching partitions.
+      runtimeOverwritePartition = runtimePartitionOverwrite && matchingPartitions.nonEmpty)


matchingPartitions.nonEmpty needs to be removed too.

gatorsmile · 2017-08-09T15:57:50Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -2658,4 +2659,62 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
      checkAnswer(sql("SELECT __auto_generated_subquery_name.i from (SELECT i FROM v)"), Row(1))
    }
  }
+
+  test("SPARK-20236: runtime partition overwrite") {


Need to move the test cases with more test cases

gatorsmile · 2017-08-09T15:59:28Z

LGTM except the above comments. Thanks!

jiangxb1987 · 2017-10-03T14:44:10Z

Do we still want this? @cloud-fan @gatorsmile

gatorsmile · 2017-10-03T16:12:42Z

Yes. This is still needed. The target is 2.3 release

felixcheung · 2017-11-29T05:35:34Z

ping, very interested in this.

jiangxb1987 · 2018-01-01T17:19:52Z

Is this PR still targeted to 2.3? @cloud-fan @gatorsmile

felixcheung · 2018-01-01T19:41:00Z

ah yes, please please :)

SparkQA · 2018-01-02T11:54:56Z

Test build #85590 has finished for PR 18714 at commit 65a9741.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class HadoopMapReduceCommitProtocol(
class SQLHadoopMapReduceCommitProtocol(

gatorsmile · 2018-01-03T02:07:50Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+class HadoopMapReduceCommitProtocol(
+     jobId: String,
+     path: String,
+     dynamicPartitionOverwrite: Boolean = false)


gatorsmile · 2018-01-03T02:09:57Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+    val stagingDir: Path = committer match {
+      case _ if dynamicPartitionOverwrite =>
+        assert(dir.isDefined,
+          "The dataset to be written must be partitioned when runtimeOverwritePartition is true.")


runtimeOverwritePartition -> dynamicPartitionOverwrite

gatorsmile · 2018-01-03T02:56:28Z

LGTM

SparkQA · 2018-01-03T08:05:01Z

Test build #85621 has finished for PR 18714 at commit f7745a0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-01-03T08:06:19Z

retest this please

SparkQA · 2018-01-03T11:46:45Z

Test build #85622 has finished for PR 18714 at commit f7745a0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-03T14:16:55Z

Thanks! Merged to master/2.3

## What changes were proposed in this pull request? When overwriting a partitioned table with dynamic partition columns, the behavior is different between data source and hive tables. data source table: delete all partition directories that match the static partition values provided in the insert statement. hive table: only delete partition directories which have data written into it This PR adds a new config to make users be able to choose hive's behavior. ## How was this patch tested? new tests Author: Wenchen Fan <wenchen@databricks.com> Closes #18714 from cloud-fan/overwrite-partition. (cherry picked from commit a66fe36) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

koertkuipers · 2018-07-15T16:39:48Z

should this be exposed per write instead of as a global variable?
e.g.
dataframe.write.csv.partitionMode(Dynamic).partitionBy(...).save(...)

cloud-fan · 2018-07-16T06:32:43Z

@koertkuipers makes sense to me, but I won't add a new API in DataFrameWriter for it, I think we can just add a write option for file source, e.g. df.write.option("partitionOverwriteMode", "dynamic").parquet...

koertkuipers · 2018-07-16T17:17:03Z

@cloud-fan OK, that works just as well

koertkuipers · 2018-07-19T17:52:57Z

@cloud-fan i created SPARK-24860 for this

rxin reviewed Jul 23, 2017

View reviewed changes

ericl reviewed Jul 23, 2017

View reviewed changes

gatorsmile reviewed Jul 23, 2017

View reviewed changes

cloud-fan commented Jul 23, 2017

View reviewed changes

cloud-fan changed the title ~~[SPARK-20236][SQL] hive style partition overwrite~~ [SPARK-20236][SQL] runtime partition overwrite Jul 24, 2017

cloud-fan force-pushed the overwrite-partition branch from 8abffd0 to 0630372 Compare July 24, 2017 16:13

gatorsmile reviewed Aug 9, 2017

View reviewed changes

dynamic partition overwrite

65a9741

cloud-fan force-pushed the overwrite-partition branch from 0630372 to 65a9741 Compare January 2, 2018 08:14

cloud-fan changed the title ~~[SPARK-20236][SQL] runtime partition overwrite~~ [SPARK-20236][SQL] dynamic partition overwrite Jan 2, 2018

gatorsmile reviewed Jan 3, 2018

View reviewed changes

address comments

f7745a0

asfgit closed this in a66fe36 Jan 3, 2018

[SPARK-20236][SQL] dynamic partition overwrite #18714

[SPARK-20236][SQL] dynamic partition overwrite #18714

Uh oh!

Conversation

cloud-fan commented Jul 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Jul 22, 2017

Uh oh!

SparkQA commented Jul 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl Jul 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Jul 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 23, 2017

Uh oh!

ericl commented Jul 23, 2017 via email

Uh oh!

SparkQA commented Jul 24, 2017

Uh oh!

SparkQA commented Jul 24, 2017

Uh oh!

gatorsmile commented Aug 8, 2017

Uh oh!

SparkQA commented Aug 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile Aug 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 9, 2017

Uh oh!

jiangxb1987 commented Oct 3, 2017

Uh oh!

gatorsmile commented Oct 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felixcheung commented Nov 29, 2017

Uh oh!

jiangxb1987 commented Jan 1, 2018

Uh oh!

felixcheung commented Jan 1, 2018

Uh oh!

SparkQA commented Jan 2, 2018

Uh oh!

Choose a reason for hiding this comment

ericl Jul 23, 2017 •

edited

Loading

gatorsmile Jul 23, 2017 •

edited

Loading

gatorsmile Aug 9, 2017 •

edited

Loading

gatorsmile commented Oct 3, 2017 •

edited

Loading