Skip to content

[SPARK-29947][SQL][followup] ResolveRelations should return relations with fresh attribute IDs #28717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule ResolveRelations may return exactly the same relations, while before it always returns relations with fresh attribute IDs.

This PR is to eliminate this side effect.

Why are the changes needed?

There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations.

Does this PR introduce any user-facing change?

no

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

@wangyum

@cloud-fan
Copy link
Contributor Author

also cc @HyukjinKwon @brkyvz

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks resonable changes

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jun 3, 2020

Test build #123472 has finished for PR 28717 at commit 6924118.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 3, 2020

Test build #123485 has finished for PR 28717 at commit 6924118.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@brkyvz brkyvz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Can you please add unit test? A triple self-join should be able to reproduce this issue

@@ -1024,7 +1024,12 @@ class Analyzer(
DataSourceV2Relation.create(table, Some(catalog), Some(ident)))
}
val key = catalog.name +: ident.namespace :+ ident.name
Option(AnalysisContext.get.relationCache.getOrElseUpdate(key, loaded.orNull))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just:

Option(AnalysisContext.get.relationCache.getOrElseUpdate(key, loaded.orNull)).map { rel =>
  rel.transform { case multi: MultiInstanceRelation => multi.newInstance() }
} 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry just saw your comment after running the merge script.

We only need to refresh the attributes if we get the relation from the cache. Otherwise, it's already a fresh relation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is correct, but IMO it's a little cleaner. Creating new attributes should be super cheap, and not worth having a orElses and foreachs.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in dc0709f Jun 3, 2020
cloud-fan added a commit that referenced this pull request Jun 3, 2020
… with fresh attribute IDs

### What changes were proposed in this pull request?

This is a followup of #26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs.

This PR is to eliminate this side effect.

### Why are the changes needed?

There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #28717 from cloud-fan/fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit dc0709f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jun 3, 2020

@brkyvz I tried triple self-join but can't find a buggy case:

sql("select * from t t1 join t t2 join t t3 where t1.a = t3.a").show

sql("select * from t t1 join t t2 on t1.b = t2.a join t t3 on t1.a = t3.a").show

sql("select * from t t1 join t t2 on t1.b = t2.a join t t3 on t2.a = t3.a").show

Please let me know if you see some broken join queries.

@brkyvz
Copy link
Contributor

brkyvz commented Jun 3, 2020

what if you do something like:

  1. join t with t on a
  2. Do a group by to get distinct values of a and b
  3. Join grouped results again with t on b
    so that the expression Ids pass through

@cloud-fan
Copy link
Contributor Author

sql("select * from (select t1.a, t1.b from t t1 join t t2 on t1.a = t2.a group by t1.a, t1.b) j join t t3 where j.b = t3.b").show

This also works. Please let me know if I miss something, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants