Skip to content

[SPARK-25421][SQL] Abstract an output path field in trait DataWritingCommand #22411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

LantaoJin
Copy link
Contributor

What changes were proposed in this pull request?

#22353 import a metadata field in SparkPlanInfo and it could dump the input location for read. Corresponding, we need add a field in DataWritingCommand for output path.

How was this patch tested?

Unit test

@LantaoJin
Copy link
Contributor Author

Gently ping @cloud-fan @dongjoon-hyun , would you please help to review?

@@ -440,7 +440,7 @@ case class DataSource(
// ordering of data.logicalPlan (partition columns are all moved after data column). This
// will be adjusted within InsertIntoHadoopFsRelation.
InsertIntoHadoopFsRelationCommand(
outputPath = outputPath,
outputFsPath = outputPath,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you undo this redundant change, @LantaoJin ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field overwrites the outputPath in DataWritingCommand and the return type is different (Path vs Option[Path]), so I rename this.

@@ -18,6 +18,7 @@
package org.apache.spark.sql.execution

import org.apache.spark.annotation.DeveloperApi
import org.apache.spark.sql.execution.command.{DataWritingCommand, DataWritingCommandExec}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need DataWritingCommand here, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it

@LantaoJin
Copy link
Contributor Author

Gently ping @dongjoon-hyun @cloud-fan

@cloud-fan
Copy link
Contributor

This time it's not a regression right? I'd like to not change the interface, but explicitly catch the plan we are interested in fromSparkPlan, e.g.

case DataWritingCommandExec(i: InsertIntoHadoopFsRelation, _) =>
  Map("path" -> i.outputPath)

@LantaoJin
Copy link
Contributor Author

Isn't common? I am afraid not only one InsertIntoHadoopFsRelation need to added in case statment.

@LantaoJin
Copy link
Contributor Author

If almost implementations need to add to case statment, partten matching each implementations seems weird and easy to causes missing when adds a new implementation in future.

@cloud-fan
Copy link
Contributor

Without a thorough design, I hesitate to change the DataWritingCommand interface only for event log. Do you have any more plans to improve the event log?

@LantaoJin
Copy link
Contributor Author

Most of the information we wanted could be analyzed out from event log except some metrics in Executor side which doesn't heartbeat to Driver, e.g RPC count with NameNode. Another case is #21221, before that we had to hack code to get the similar metrics. Event log as a structured, unified, overall, replay-able log, it offers a possibility to analysis offline, even realtime. We prefer to use it since the history UI exposes less information than user expected, further more not smart and hard to customize. We are on going on this based on event log. Thanks @cloud-fan, I suggest to add this interface to DataWritingCommand. Pattern matching each implementations looks trick. It looks common, maybe it could be used in physical plan optimization in future.

@cloud-fan
Copy link
Contributor

Since this is a new feature, we can't just merge it like #22353 without a proper design.

Making the event logs as a structured, unified and reliable source for Spark metrics looks like a good idea. Let's write a design doc to explain what we already have in the event logs, and what is missing, and how to make it reliable, and what's the issue if we read it in real time. It's better to discuss it in the dev list and see if other people have different ideas to get Spark metrics.

@LantaoJin
Copy link
Contributor Author

Agree that. Since this field is important to us. Could I refactor it following your advice and file a discussion in another Jira?

@LantaoJin
Copy link
Contributor Author

Using pattern matching will face a problem. InsertIntoHiveDirCommand,CreateHiveTableAsSelectCommand and InsertIntoHiveTable are all in spark-hive module. SparkPlanInfo could not include them.

@cloud-fan
Copy link
Contributor

We can't merge new features to maintenance branches(2.4 as well), so we don't need to rush here, as this feature can only be available in the next release.

@LantaoJin
Copy link
Contributor Author

@cloud-fan I refactor and remove the function outputPath in DataWritingCommand. Besides the unit test you could see, in my local, I added below test in HiveQuerySuite.scala:

  test("SPARK-25421 DataWritingCommandExec(hive) should contains 'OutputPath' metadata") {
    withTable("t") {
      sql("CREATE TABLE t(col_I int)")
      val f = sql("INSERT OVERWRITE TABLE t SELECT 1")
      assert(SparkPlanInfo.fromSparkPlan(f.queryExecution.sparkPlan).metadata
        .contains("OutputPath"))
    }
  }

But since HiveQuerySuite can not access SparkPlanInfo, after test passed in my local, I remove it again.

@LantaoJin
Copy link
Contributor Author

Gently ping @cloud-fan

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@github-actions
Copy link

github-actions bot commented Jan 7, 2020

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

@github-actions github-actions bot added the Stale label Jan 7, 2020
@github-actions github-actions bot closed this Jan 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants