[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin#30488
Closed
Ngone51 wants to merge 11 commits intoapache:masterfrom
Closed
[SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin#30488Ngone51 wants to merge 11 commits intoapache:masterfrom
Ngone51 wants to merge 11 commits intoapache:masterfrom
Conversation
Member
Author
|
cc @cloud-fan @xuanyuanking Could you take a look? Thanks! |
cloud-fan
approved these changes
Nov 24, 2020
|
Test build #131668 has finished for PR 30488 at commit
|
|
Test build #131728 has finished for PR 30488 at commit
|
Member
|
Seems the failed UT is related. |
Member
Author
|
Yeah...I fixed it just now. |
Ngone51
commented
Nov 25, 2020
Ngone51
commented
Nov 25, 2020
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Outdated
Show resolved
Hide resolved
cloud-fan
reviewed
Nov 25, 2020
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Outdated
Show resolved
Hide resolved
|
Test build #131778 has finished for PR 30488 at commit
|
5cff25f to
85f6f12
Compare
|
Test build #131994 has finished for PR 30488 at commit
|
cloud-fan
reviewed
Dec 1, 2020
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Show resolved
Hide resolved
cloud-fan
reviewed
Dec 1, 2020
cloud-fan
reviewed
Dec 1, 2020
sql/catalyst/src/main/scala/org/apache/spark/sql/types/Metadata.scala
Outdated
Show resolved
Hide resolved
cloud-fan
reviewed
Dec 2, 2020
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
Outdated
Show resolved
Hide resolved
cloud-fan
approved these changes
Dec 2, 2020
|
Test build #132041 has finished for PR 30488 at commit
|
Contributor
|
retest this please |
|
Test build #132054 has finished for PR 30488 at commit
|
Contributor
|
GA passed, merging to master, thanks! |
|
Test build #132059 has finished for PR 30488 at commit
|
HyukjinKwon
added a commit
that referenced
this pull request
Dec 9, 2020
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon
added a commit
that referenced
this pull request
Dec 9, 2020
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of #30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b5399d4) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
a0x8o
added a commit
to a0x8o/spark
that referenced
this pull request
Dec 9, 2020
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache/spark#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv
pushed a commit
to palantir/spark
that referenced
this pull request
Jan 28, 2021
…lan in join() to not break DetectAmbiguousSelfJoin
Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.
In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.
Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.
For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:
```scala
val emp1 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop"),
TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
.select(emp1.col("*"), emp3.col("key").as("e2")).show()
// wrong result
+---+---------+---+
|key| value| e2|
+---+---------+---+
| 1| sales| 1|
| 2|personnel| 2|
| 3| develop| 3|
| 4| IT| 4|
+---+---------+---+
```
This PR fixes the wrong behaviour.
Yes, users hit the exception instead of the wrong result after this PR.
Added a new unit test.
Closes apache#30488 from Ngone51/fix-self-join.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
rshkv
pushed a commit
to palantir/spark
that referenced
this pull request
Jan 28, 2021
…to nonInheritableMetadataKeys in Alias This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. To make it easier to maintain and read. No. This is rather a code cleanup. Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
laflechejonathan
pushed a commit
to palantir/spark
that referenced
this pull request
Sep 27, 2021
…lan in join() to not break DetectAmbiguousSelfJoin
Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.
In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.
Besides, this PR also removes related metadata (`DATASET_ID_KEY`, `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`. To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.
For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:
```scala
val emp1 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop"),
TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
TestData(1, "sales"),
TestData(2, "personnel"),
TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
.select(emp1.col("*"), emp3.col("key").as("e2")).show()
// wrong result
+---+---------+---+
|key| value| e2|
+---+---------+---+
| 1| sales| 1|
| 2|personnel| 2|
| 3| develop| 3|
| 4| IT| 4|
+---+---------+---+
```
This PR fixes the wrong behaviour.
Yes, users hit the exception instead of the wrong result after this PR.
Added a new unit test.
Closes apache#30488 from Ngone51/fix-self-join.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
laflechejonathan
pushed a commit
to palantir/spark
that referenced
this pull request
Sep 27, 2021
…to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of apache#30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes apache#30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Currently,
join()useswithPlan(logicalPlan)for convenient to call some Dataset functions. But it leads to thedataset_idinconsistent between thelogicalPlanand the originalDataset(becausewithPlan(logicalPlan)will create a new Dataset with the new id and reset thedataset_idwith the new id of thelogicalPlan). As a result, it breaks the ruleDetectAmbiguousSelfJoin.In this PR, we propose to drop the usage of
withPlanbut use thelogicalPlandirectly so itsdataset_iddoesn't change.Besides, this PR also removes related metadata (
DATASET_ID_KEY,COL_POS_KEY) when anAliastries to construct its own metadata. Because theAliasis no longer a reference column after converting to anAttribute. To achieve that, we add a new field,deniedMetadataKeys, to indicate the metadata that needs to be removed.Why are the changes needed?
For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:
This PR fixes the wrong behaviour.
Does this PR introduce any user-facing change?
Yes, users hit the exception instead of the wrong result after this PR.
How was this patch tested?
Added a new unit test.