-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-25421][SQL] Abstract an output path field in trait DataWritingCommand #22411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Gently ping @cloud-fan @dongjoon-hyun , would you please help to review? |
@@ -440,7 +440,7 @@ case class DataSource( | |||
// ordering of data.logicalPlan (partition columns are all moved after data column). This | |||
// will be adjusted within InsertIntoHadoopFsRelation. | |||
InsertIntoHadoopFsRelationCommand( | |||
outputPath = outputPath, | |||
outputFsPath = outputPath, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you undo this redundant change, @LantaoJin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field overwrites the outputPath in DataWritingCommand and the return type is different (Path vs Option[Path]), so I rename this.
@@ -18,6 +18,7 @@ | |||
package org.apache.spark.sql.execution | |||
|
|||
import org.apache.spark.annotation.DeveloperApi | |||
import org.apache.spark.sql.execution.command.{DataWritingCommand, DataWritingCommandExec} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need DataWritingCommand
here, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it
Gently ping @dongjoon-hyun @cloud-fan |
This time it's not a regression right? I'd like to not change the interface, but explicitly catch the plan we are interested in
|
Isn't common? I am afraid not only one InsertIntoHadoopFsRelation need to added in case statment. |
If almost implementations need to add to case statment, partten matching each implementations seems weird and easy to causes missing when adds a new implementation in future. |
Without a thorough design, I hesitate to change the |
Most of the information we wanted could be analyzed out from event log except some metrics in Executor side which doesn't heartbeat to Driver, e.g RPC count with NameNode. Another case is #21221, before that we had to hack code to get the similar metrics. Event log as a structured, unified, overall, replay-able log, it offers a possibility to analysis offline, even realtime. We prefer to use it since the history UI exposes less information than user expected, further more not smart and hard to customize. We are on going on this based on event log. Thanks @cloud-fan, I suggest to add this interface to |
Since this is a new feature, we can't just merge it like #22353 without a proper design. Making the event logs as a structured, unified and reliable source for Spark metrics looks like a good idea. Let's write a design doc to explain what we already have in the event logs, and what is missing, and how to make it reliable, and what's the issue if we read it in real time. It's better to discuss it in the dev list and see if other people have different ideas to get Spark metrics. |
Agree that. Since this field is important to us. Could I refactor it following your advice and file a discussion in another Jira? |
Using pattern matching will face a problem. |
We can't merge new features to maintenance branches(2.4 as well), so we don't need to rush here, as this feature can only be available in the next release. |
@cloud-fan I refactor and remove the function outputPath in test("SPARK-25421 DataWritingCommandExec(hive) should contains 'OutputPath' metadata") {
withTable("t") {
sql("CREATE TABLE t(col_I int)")
val f = sql("INSERT OVERWRITE TABLE t SELECT 1")
assert(SparkPlanInfo.fromSparkPlan(f.queryExecution.sparkPlan).metadata
.contains("OutputPath"))
}
} But since |
Gently ping @cloud-fan |
Can one of the admins verify this patch? |
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
#22353 import a metadata field in
SparkPlanInfo
and it could dump the input location for read. Corresponding, we need add a field inDataWritingCommand
for output path.How was this patch tested?
Unit test