Skip to content

[SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page #24389

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

pengbo
Copy link
Contributor

@pengbo pengbo commented Apr 17, 2019

What changes were proposed in this pull request?

Currently, the SparkSQL UI page shows only actual metric info in each SparkPlan node. However with statistics info may help us understand how the plan is designed and the reason it runs slowly. This PR is to show BroadcastHashJoinExec numOutputRows statistic info on SparkSQL UI page first when it's available.

The main changes:

  1. Add stats field in SparkPlan
  2. Initialize BroadcastHashJoinExec with the stats in Join LogicalPlan
  3. Add stats field in SQLMetric and show it on UI page when it's available

How was this patch tested?

Regarding unit test has been added, manual UI test has been tested

SPARK-27482-1

@pengbo
Copy link
Contributor Author

pengbo commented Apr 17, 2019

CC @cloud-fan @srowen @dongjoon-hyun

@cloud-fan
Copy link
Contributor

cloud-fan commented Apr 17, 2019

I agree with the use case, but the implementation is too hacky. We need a general approach to propagate the statistics from logical plan to physical plan.

@pengbo
Copy link
Contributor Author

pengbo commented Apr 18, 2019

I agree with the use case, but the implementation is too hacky. We need a general approach to propagate the statistics from logical plan to physical plan.

Thanks for your comments.

How about:

  1. Add stats field and getter/setter in SparkPlanStats (trait of SparkPlan)
  2. Add propagateProperty(candidate, plan) before collectPlaceholders in QueryPlanner.plan which is override by SparkPlanner to set the stats value in candidate SparkPlan from logicalPlan

Your feedback is appreciated. @cloud-fan @dongjoon-hyun

@pengbo
Copy link
Contributor Author

pengbo commented Apr 26, 2019

Propagate the statistics from logical plan to physical plan in Strategy.apply method instead

@dongjoon-hyun
Copy link
Member

ok to test

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27482][SQL][Web UI]: Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page [SPARK-27482][SQL][WEBUI] Show BroadcastHashJoinExec numOutputRows statistics info on SparkSQL UI page May 7, 2019
@SparkQA
Copy link

SparkQA commented May 8, 2019

Test build #105235 has finished for PR 24389 at commit b3cd0b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pengbo
Copy link
Contributor Author

pengbo commented May 9, 2019

@dongjoon-hyun @cloud-fan
Back to use propagate template method in QueryPlanner.plan method, as the previous way may break the SparkStrategy Extension compatibility.

@pengbo
Copy link
Contributor Author

pengbo commented May 9, 2019

retest it please

@SparkQA
Copy link

SparkQA commented May 9, 2019

Test build #105279 has finished for PR 24389 at commit 83483b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@pengbo
Copy link
Contributor Author

pengbo commented May 17, 2019

Thanks @cloud-fan has created a more general approach
#24626

@pengbo pengbo closed this May 17, 2019
@cloud-fan
Copy link
Contributor

Hi @pengbo, my PR just adds the foundation: physical plan knows the statistics of its corresponding logical plan. But we still need some work to make the statistics available in the SQL UI. We still need your PR after my PR is merged.

@pengbo
Copy link
Contributor Author

pengbo commented May 17, 2019

Hi @pengbo, my PR just adds the foundation: physical plan knows the statistics of its corresponding logical plan. But we still need some work to make the statistics available in the SQL UI. We still need your PR after my PR is merged.

Thanks for info, I will open another one when your PR is merged.

j-baker pushed a commit to palantir/spark that referenced this pull request Jan 25, 2020
It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in apache#24389

This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied.

The planner leverages this feature to put the logical plan into the physical plan.

a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans.

Closes apache#24626 from cloud-fan/link.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Peng Bo <bo.peng1019@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request Jan 31, 2020
It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in apache#24389

This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied.

The planner leverages this feature to put the logical plan into the physical plan.

a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans.

Closes apache#24626 from cloud-fan/link.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Peng Bo <bo.peng1019@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants