[SPARK-29947][SQL][followup] ResolveRelations should return relations with fresh attribute IDs #28717

cloud-fan · 2020-06-03T07:59:39Z

What changes were proposed in this pull request?

This is a followup of #26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule ResolveRelations may return exactly the same relations, while before it always returns relations with fresh attribute IDs.

This PR is to eliminate this side effect.

Why are the changes needed?

There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations.

Does this PR introduce any user-facing change?

no

How was this patch tested?

N/A

cloud-fan · 2020-06-03T08:00:10Z

@wangyum

cloud-fan · 2020-06-03T08:20:34Z

also cc @HyukjinKwon @brkyvz

maropu

Looks resonable changes

HyukjinKwon

LGTM

SparkQA · 2020-06-03T13:14:52Z

Test build #123472 has finished for PR 28717 at commit 6924118.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-06-03T13:35:26Z

retest this please

SparkQA · 2020-06-03T18:59:12Z

Test build #123485 has finished for PR 28717 at commit 6924118.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

@cloud-fan Can you please add unit test? A triple self-join should be able to reproduce this issue

brkyvz · 2020-06-03T19:06:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1024,7 +1024,12 @@ class Analyzer(
                DataSourceV2Relation.create(table, Some(catalog), Some(ident)))
          }
          val key = catalog.name +: ident.namespace :+ ident.name
-          Option(AnalysisContext.get.relationCache.getOrElseUpdate(key, loaded.orNull))


Why not just:

Option(AnalysisContext.get.relationCache.getOrElseUpdate(key, loaded.orNull)).map { rel => rel.transform { case multi: MultiInstanceRelation => multi.newInstance() } }

sorry just saw your comment after running the merge script.

We only need to refresh the attributes if we get the relation from the cache. Otherwise, it's already a fresh relation.

that is correct, but IMO it's a little cleaner. Creating new attributes should be super cheap, and not worth having a orElses and foreachs.

cloud-fan · 2020-06-03T19:07:27Z

thanks, merging to master/3.0!

… with fresh attribute IDs ### What changes were proposed in this pull request? This is a followup of #26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs. This PR is to eliminate this side effect. ### Why are the changes needed? There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28717 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit dc0709f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2020-06-03T19:17:30Z

@brkyvz I tried triple self-join but can't find a buggy case:

sql("select * from t t1 join t t2 join t t3 where t1.a = t3.a").show

sql("select * from t t1 join t t2 on t1.b = t2.a join t t3 on t1.a = t3.a").show

sql("select * from t t1 join t t2 on t1.b = t2.a join t t3 on t2.a = t3.a").show

Please let me know if you see some broken join queries.

brkyvz · 2020-06-03T19:33:55Z

what if you do something like:

join t with t on a
Do a group by to get distinct values of a and b
Join grouped results again with t on b
so that the expression Ids pass through

cloud-fan · 2020-06-04T04:12:40Z

sql("select * from (select t1.a, t1.b from t t1 join t t2 on t1.a = t2.a group by t1.a, t1.b) j join t t3 where j.b = t3.b").show

This also works. Please let me know if I miss something, thanks!

ResolveRelations should return relations with fresh attribute IDs

6924118

probot-autolabeler bot added the SQL label Jun 3, 2020

maropu approved these changes Jun 3, 2020

View reviewed changes

wangyum approved these changes Jun 3, 2020

View reviewed changes

HyukjinKwon approved these changes Jun 3, 2020

View reviewed changes

brkyvz suggested changes Jun 3, 2020

View reviewed changes

cloud-fan closed this in dc0709f Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-29947][SQL][followup] ResolveRelations should return relations with fresh attribute IDs #28717

[SPARK-29947][SQL][followup] ResolveRelations should return relations with fresh attribute IDs #28717

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

maropu left a comment

Uh oh!

HyukjinKwon left a comment

Uh oh!

SparkQA commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

SparkQA commented Jun 3, 2020

Uh oh!

brkyvz left a comment

Uh oh!

brkyvz Jun 3, 2020

Uh oh!

cloud-fan Jun 3, 2020

Uh oh!

brkyvz Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020 •

edited

Loading

Uh oh!

brkyvz commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 4, 2020

Uh oh!

Uh oh!

[SPARK-29947][SQL][followup] ResolveRelations should return relations with fresh attribute IDs #28717

[SPARK-29947][SQL][followup] ResolveRelations should return relations with fresh attribute IDs #28717

Uh oh!

Conversation

cloud-fan commented Jun 3, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

SparkQA commented Jun 3, 2020

Uh oh!

brkyvz left a comment

Choose a reason for hiding this comment

Uh oh!

brkyvz Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

brkyvz Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brkyvz commented Jun 3, 2020

Uh oh!

cloud-fan commented Jun 4, 2020

Uh oh!

Uh oh!

cloud-fan commented Jun 3, 2020 •

edited

Loading