Skip to content

backport [SPARK-27747][SPARK-27816][SPARK-28344] #27417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

cloud-fan
Copy link
Contributor

backport 3 PRs to 2.4 to fix ambiguous self-join. This PR excludes the logical plan link which is not necessary in 2.4.

For more details about ambiguous self-join, please refer to #25107

cloud-fan and others added 4 commits January 31, 2020 18:10
It's pretty useful if we can convert a physical plan back to a logical plan, e.g., in apache#24389

This PR introduces a new feature to `TreeNode`, which allows `TreeNode` to carry some extra information via a mutable map, and keep the information when it's copied.

The planner leverages this feature to put the logical plan into the physical plan.

a test suite that runs all TPCDS queries and checks that some common physical plans contain the corresponding logical plans.

Closes apache#24626 from cloud-fan/link.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Peng Bo <bo.peng1019@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?

Add type parameter to `TreeNodeTag`.

## How was this patch tested?

existing tests

Closes apache#24687 from cloud-fan/tag.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
This is an alternative solution of apache#24442 . It fails the query if ambiguous self join is detected, instead of trying to disambiguate it. The problem is that, it's hard to come up with a reasonable rule to disambiguate, the rule proposed by apache#24442 is mostly a heuristic.

This is a long-standing bug and I've seen many people complaining about it in JIRA/dev list.

A typical example:
```
val df1 = …
val df2 = df1.filter(...)
df1.join(df2, df1("a") > df2("a")) // returns empty result
```
The root cause is, `Dataset.apply` is so powerful that users think it returns a column reference which can point to the column of the Dataset at anywhere. This is not true in many cases. `Dataset.apply` returns an `AttributeReference` . Different Datasets may share the same `AttributeReference`. In the example above, `df2` adds a Filter operator above the logical plan of `df1`, and the Filter operator reserves the output `AttributeReference` of its child. This means, `df1("a")` is exactly the same as `df2("a")`, and `df1("a") > df2("a")` always evaluates to false.

We can reuse the infra in apache#24442 :
1. each Dataset has a globally unique id.
2. the `AttributeReference` returned by `Dataset.apply` carries the ID and column position(e.g. 3rd column of the Dataset) via metadata.
3. the logical plan of a `Dataset` carries the ID via `TreeNodeTag`

When self-join happens, the analyzer asks the right side plan of join to re-generate output attributes with new exprIds. Based on it, a simple rule to detect ambiguous self join is:
1. find all column references (i.e. `AttributeReference`s with Dataset ID and col position) in the root node of a query plan.
2. for each column reference, traverse the query plan tree, find a sub-plan that carries Dataset ID and the ID is the same as the one in the column reference.
3. get the corresponding output attribute of the sub-plan by the col position in the column reference.
4. if the corresponding output attribute has a different exprID than the column reference, then it means this sub-plan is on the right side of a self-join and has regenerated its output attributes. This is an ambiguous self join because the column reference points to a table being self-joined.

existing tests and new test cases

Closes apache#25107 from cloud-fan/new-self-join.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor Author

cc @dongjoon-hyun

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117647 has finished for PR 27417 at commit 11b07e0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DetectAmbiguousSelfJoin(conf: SQLConf) extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Jan 31, 2020

Test build #117653 has finished for PR 27417 at commit 8199e30.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

Seems we need backport more:
#25111
#25055

Shall we reconsider the decision of backport?

@dongjoon-hyun
Copy link
Member

Thank you so much, @cloud-fan . Sure, I'm +1 to removeTarget Version with the reason Technical Difficulty. While removing the Target Version, could you add a comment with the reason Technical Difficulty and a pointer to this PR please?

@dongjoon-hyun
Copy link
Member

You mean this is difficult because it requires at least 5 patches. Did I understand correctly?

@cloud-fan
Copy link
Contributor Author

Yea so far I find that 5 commits need to be backported. It's still possible to backport but we need to re-evaluate the risks.

@dongjoon-hyun
Copy link
Member

Got it~ cc @srowen, @tgravescs , @gatorsmile

@srowen
Copy link
Member

srowen commented Jan 31, 2020

I don't know enough to really eval, but trust @cloud-fan on this one

@tgravescs
Copy link
Contributor

its definitely starting to be a lot of changes to pull back that could cause side affects, I'm ok with not backporting to 2.4.5

@dongjoon-hyun dongjoon-hyun changed the title backport [SPARK-27746][SPARK-27816][SPARK-28344] backport [SPARK-27747][SPARK-27816][SPARK-28344] Jan 31, 2020
@gatorsmile
Copy link
Member

The risk is high based on the code changes in this PR.

@dongjoon-hyun
Copy link
Member

Thank you all for your opinion. Especially, thank you for making this PR, @cloud-fan . I'll remove Target Version: 2.4.5 for now. If we need this in branch-2.4, Target Version will be 2.4.6.

@rshkv
Copy link
Contributor

rshkv commented Feb 9, 2021

@cloud-fan, @tgravescs, @gatorsmile, I was wondering if you could say a little more about the risks you saw here.

We found some "surprising" results and consider taking this backport over on our palantir/spark fork. I just want to better understand why you decided against merging and see if we're willing to take risk. Were you afraid of false positives? Was it the amount of changes? I understand it's been a while since you thought about this.

@cloud-fan
Copy link
Contributor Author

The risk is that many commits need to be backported. And there are probably more now as we have new bug fixes of the self-join detection in 3.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants