[SPARK-25421][SQL] Abstract an output path field in trait DataWritingCommand #22411

LantaoJin · 2018-09-13T11:06:18Z

What changes were proposed in this pull request?

#22353 import a metadata field in SparkPlanInfo and it could dump the input location for read. Corresponding, we need add a field in DataWritingCommand for output path.

How was this patch tested?

Unit test

LantaoJin · 2018-09-13T11:08:35Z

Gently ping @cloud-fan @dongjoon-hyun , would you please help to review?

dongjoon-hyun · 2018-09-13T16:34:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -440,7 +440,7 @@ case class DataSource(
    // ordering of data.logicalPlan (partition columns are all moved after data column).  This
    // will be adjusted within InsertIntoHadoopFsRelation.
    InsertIntoHadoopFsRelationCommand(
-      outputPath = outputPath,
+      outputFsPath = outputPath,


Could you undo this redundant change, @LantaoJin ?

This field overwrites the outputPath in DataWritingCommand and the return type is different (Path vs Option[Path]), so I rename this.

dongjoon-hyun · 2018-09-13T16:38:05Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanInfo.scala

@@ -18,6 +18,7 @@
 package org.apache.spark.sql.execution

 import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.sql.execution.command.{DataWritingCommand, DataWritingCommandExec}


Do we need DataWritingCommand here, too?

Will remove it

LantaoJin · 2018-09-17T12:38:17Z

Gently ping @dongjoon-hyun @cloud-fan

cloud-fan · 2018-09-17T12:51:17Z

This time it's not a regression right? I'd like to not change the interface, but explicitly catch the plan we are interested in fromSparkPlan, e.g.

case DataWritingCommandExec(i: InsertIntoHadoopFsRelation, _) =>
  Map("path" -> i.outputPath)

LantaoJin · 2018-09-17T13:55:11Z

Isn't common? I am afraid not only one InsertIntoHadoopFsRelation need to added in case statment.

LantaoJin · 2018-09-17T14:21:23Z

If almost implementations need to add to case statment, partten matching each implementations seems weird and easy to causes missing when adds a new implementation in future.

cloud-fan · 2018-09-17T14:31:23Z

Without a thorough design, I hesitate to change the DataWritingCommand interface only for event log. Do you have any more plans to improve the event log?

LantaoJin · 2018-09-18T04:08:55Z

Most of the information we wanted could be analyzed out from event log except some metrics in Executor side which doesn't heartbeat to Driver, e.g RPC count with NameNode. Another case is #21221, before that we had to hack code to get the similar metrics. Event log as a structured, unified, overall, replay-able log, it offers a possibility to analysis offline, even realtime. We prefer to use it since the history UI exposes less information than user expected, further more not smart and hard to customize. We are on going on this based on event log. Thanks @cloud-fan, I suggest to add this interface to DataWritingCommand. Pattern matching each implementations looks trick. It looks common, maybe it could be used in physical plan optimization in future.

cloud-fan · 2018-09-18T04:16:46Z

Since this is a new feature, we can't just merge it like #22353 without a proper design.

Making the event logs as a structured, unified and reliable source for Spark metrics looks like a good idea. Let's write a design doc to explain what we already have in the event logs, and what is missing, and how to make it reliable, and what's the issue if we read it in real time. It's better to discuss it in the dev list and see if other people have different ideas to get Spark metrics.

LantaoJin · 2018-09-18T05:35:52Z

Agree that. Since this field is important to us. Could I refactor it following your advice and file a discussion in another Jira?

LantaoJin · 2018-09-18T05:56:39Z

Using pattern matching will face a problem. InsertIntoHiveDirCommand,CreateHiveTableAsSelectCommand and InsertIntoHiveTable are all in spark-hive module. SparkPlanInfo could not include them.

cloud-fan · 2018-09-18T07:25:52Z

We can't merge new features to maintenance branches(2.4 as well), so we don't need to rush here, as this feature can only be available in the next release.

LantaoJin · 2018-09-18T13:40:16Z

@cloud-fan I refactor and remove the function outputPath in DataWritingCommand. Besides the unit test you could see, in my local, I added below test in HiveQuerySuite.scala:

  test("SPARK-25421 DataWritingCommandExec(hive) should contains 'OutputPath' metadata") {
    withTable("t") {
      sql("CREATE TABLE t(col_I int)")
      val f = sql("INSERT OVERWRITE TABLE t SELECT 1")
      assert(SparkPlanInfo.fromSparkPlan(f.queryExecution.sparkPlan).metadata
        .contains("OutputPath"))
    }
  }

But since HiveQuerySuite can not access SparkPlanInfo, after test passed in my local, I remove it again.

LantaoJin · 2018-09-21T08:05:43Z

Gently ping @cloud-fan

AmplabJenkins · 2019-09-16T18:19:20Z

Can one of the admins verify this patch?

github-actions · 2020-01-07T00:07:39Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

Abstract an output path field in trait DataWritingCommand

6074e36

dongjoon-hyun reviewed Sep 13, 2018

View reviewed changes

LantaoJin added 2 commits September 15, 2018 00:27

code review comments

4de9f8d

rename OutputPath to OutputDir to avoid conflict

c0d6c83

code review comments

9850250

LantaoJin added 2 commits September 18, 2018 21:49

refine

adb1207

trivial

26365d6

dongjoon-hyun added the SQL label Jun 14, 2019

github-actions bot added the Stale label Jan 7, 2020

github-actions bot closed this Jan 8, 2020

[SPARK-25421][SQL] Abstract an output path field in trait DataWritingCommand #22411

[SPARK-25421][SQL] Abstract an output path field in trait DataWritingCommand #22411

Uh oh!

Conversation

LantaoJin commented Sep 13, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

LantaoJin commented Sep 13, 2018

Uh oh!

dongjoon-hyun Sep 13, 2018

Choose a reason for hiding this comment

Uh oh!

LantaoJin Sep 14, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 13, 2018

Choose a reason for hiding this comment

Uh oh!

LantaoJin Sep 14, 2018

Choose a reason for hiding this comment

Uh oh!

LantaoJin commented Sep 17, 2018

Uh oh!

cloud-fan commented Sep 17, 2018

Uh oh!

LantaoJin commented Sep 17, 2018

Uh oh!

LantaoJin commented Sep 17, 2018

Uh oh!

cloud-fan commented Sep 17, 2018

Uh oh!

LantaoJin commented Sep 18, 2018

Uh oh!

cloud-fan commented Sep 18, 2018

Uh oh!

LantaoJin commented Sep 18, 2018

Uh oh!

LantaoJin commented Sep 18, 2018

Uh oh!

cloud-fan commented Sep 18, 2018

Uh oh!

LantaoJin commented Sep 18, 2018

Uh oh!

LantaoJin commented Sep 21, 2018

Uh oh!

AmplabJenkins commented Sep 16, 2019

Uh oh!

github-actions bot commented Jan 7, 2020

Uh oh!

Uh oh!