Skip to content

[SPARK-19065][SQL]Don't inherit expression id in dropDuplicates #16564

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

[SPARK-19065][SQL]Don't inherit expression id in dropDuplicates #16564

wants to merge 3 commits into from

Conversation

zsxwing
Copy link
Member

@zsxwing zsxwing commented Jan 12, 2017

What changes were proposed in this pull request?

dropDuplicates will create an Alias using the same exprId, so StreamExecution should also replace Alias if necessary.

How was this patch tested?

test("SPARK-19065: dropDuplicates should not create expressions using the same id")

@zsxwing
Copy link
Member Author

zsxwing commented Jan 12, 2017

cc @marmbrus @liancheng

There may be similar issues in other SQL codes when replacing attributes. Maybe we should revert #15427?

@liancheng
Copy link
Contributor

LGTM as long as we decide to preserve #15427.

The root cause of this issue is still the df("col") syntax, which is the motivation behind #15427. We decided not to deprecate/remove this syntax to ensure backward compatibility but maybe we should reconsider that decision as it keeps causing issues and confuses users?

@marmbrus
Copy link
Contributor

Hmm, I'm not sure that I agree with the solution from #15427. I do not think that it should be valid to have to different expressions that have the same expression id. There are many case where we break the df("col") syntax by adding new operations, and I don't think it is worth that hack to preserve this particular case with drop duplicates.

@SparkQA
Copy link

SparkQA commented Jan 12, 2017

Test build #71275 has finished for PR 16564 at commit 13f54a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zsxwing zsxwing changed the title [SPARK-19065][SS]Rewrite Alias in StreamExecution if necessary [SPARK-19065][SS] Don't inherit expression id in dropDuplicates Jan 12, 2017
@zsxwing
Copy link
Member Author

zsxwing commented Jan 12, 2017

Discussed with @marmbrus offline and decided to not support using df("col") like ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]).

@zsxwing
Copy link
Member Author

zsxwing commented Jan 12, 2017

Also cc @cloud-fan

@zsxwing zsxwing changed the title [SPARK-19065][SS] Don't inherit expression id in dropDuplicates [SPARK-19065][SQL]Don't inherit expression id in dropDuplicates Jan 12, 2017
@SparkQA
Copy link

SparkQA commented Jan 13, 2017

Test build #71277 has finished for PR 16564 at commit 4f38e3b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkDataset(
ds.dropDuplicates("_1").select(ds("_1").as[String], ds("_2").as[Int]),
("a", 1), ("b", 1))
test("SPARK-19065 dropDuplicates should not create expressions using the same id") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an end-to-end test to show that using same id when alias will cause troubles?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in my fist commit: 13f54a9

I removed it because it's not a Structured Streaming issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may introduce other unknown issues because I saw right now SQL rules that replace attributes don't deal with Aliass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we remove this test and add a new test to show the behavior change more obvious?

val df = ...
val df2 = df.dropDuplicates("i")
intercept[AnalysisException] { df2.select(df("i")) }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't work. The test needs to trigger an attribute replacement to expose this bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I didn't mean to test this bug, but the behavior change, i.e. what you discussed with Michael: #16564 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to me that adding a test to verify that we don't support some feature, so I just added my previous regression test back in order to have a test to catch this issue.

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71487 has started for PR 16564 at commit 26652a0.

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Jan 17, 2017

Test build #71506 has finished for PR 16564 at commit 26652a0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jan 17, 2017
## What changes were proposed in this pull request?

`dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.

## How was this patch tested?

test("SPARK-19065: dropDuplicates should not create expressions using the same id")

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16564 from zsxwing/SPARK-19065.

(cherry picked from commit a83accf)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.1!

@asfgit asfgit closed this in a83accf Jan 17, 2017
@zsxwing zsxwing deleted the SPARK-19065 branch January 17, 2017 19:16
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

`dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.

## How was this patch tested?

test("SPARK-19065: dropDuplicates should not create expressions using the same id")

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#16564 from zsxwing/SPARK-19065.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

`dropDuplicates` will create an Alias using the same exprId, so `StreamExecution` should also replace Alias if necessary.

## How was this patch tested?

test("SPARK-19065: dropDuplicates should not create expressions using the same id")

Author: Shixiong Zhu <shixiong@databricks.com>

Closes apache#16564 from zsxwing/SPARK-19065.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants