-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-19065][SQL]Don't inherit expression id in dropDuplicates #16564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There may be similar issues in other SQL codes when replacing attributes. Maybe we should revert #15427? |
LGTM as long as we decide to preserve #15427. The root cause of this issue is still the |
Hmm, I'm not sure that I agree with the solution from #15427. I do not think that it should be valid to have to different expressions that have the same expression id. There are many case where we break the |
Test build #71275 has finished for PR 16564 at commit
|
Discussed with @marmbrus offline and decided to not support using |
Also cc @cloud-fan |
Test build #71277 has finished for PR 16564 at commit
|
checkDataset( | ||
ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]), | ||
("a", 1), ("b", 1)) | ||
test("SPARK-19065 dropDuplicates should not create expressions using the same id") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an end-to-end test to show that using same id when alias will cause troubles?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in my fist commit: 13f54a9
I removed it because it's not a Structured Streaming issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may introduce other unknown issues because I saw right now SQL rules that replace attributes don't deal with Alias
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we remove this test and add a new test to show the behavior change more obvious?
val df = ...
val df2 = df.dropDuplicates("i")
intercept[AnalysisException] { df2.select(df("i")) }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't work. The test needs to trigger an attribute replacement to expose this bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I didn't mean to test this bug, but the behavior change, i.e. what you discussed with Michael: #16564 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to me that adding a test to verify that we don't support some feature, so I just added my previous regression test back in order to have a test to catch this issue.
Test build #71487 has started for PR 16564 at commit |
retest this please |
LGTM |
Test build #71506 has finished for PR 16564 at commit
|
## What changes were proposed in this pull request? `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary. ## How was this patch tested? test("SPARK-19065: dropDuplicates should not create expressions using the same id") Author: Shixiong Zhu <shixiong@databricks.com> Closes #16564 from zsxwing/SPARK-19065. (cherry picked from commit a83accf) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
thanks, merging to master/2.1! |
## What changes were proposed in this pull request? `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary. ## How was this patch tested? test("SPARK-19065: dropDuplicates should not create expressions using the same id") Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16564 from zsxwing/SPARK-19065.
## What changes were proposed in this pull request? `dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary. ## How was this patch tested? test("SPARK-19065: dropDuplicates should not create expressions using the same id") Author: Shixiong Zhu <shixiong@databricks.com> Closes apache#16564 from zsxwing/SPARK-19065.
What changes were proposed in this pull request?
dropDuplicates
will create an Alias using the same exprId, soStreamExecution
should also replace Alias if necessary.How was this patch tested?
test("SPARK-19065: dropDuplicates should not create expressions using the same id")