[SPARK-47129][CONNECT][SQL][3.5] Make ResolveRelations cache connect plan properly by zhengruifeng · Pull Request #46291 · apache/spark

zhengruifeng · 2024-04-30T01:44:34Z

What changes were proposed in this pull request?

Make ResolveRelations handle plan id properly

cherry-pick bugfix #45214 to 3.5

Why are the changes needed?

bug fix for Spark Connect, it won't affect classic Spark SQL

before this PR:

from pyspark.sql import functions as sf

spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1")
spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2")


df1 = spark.read.table("test_table_1")
df2 = spark.read.table("test_table_2")
df3 = spark.read.table("test_table_1")


join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2)
join2 = df3.join(join1, how="left", on=join1.index==df3.id)

join2.schema

fails with

AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704

That is due to existing plan caching in ResolveRelations doesn't work with Spark Connect

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations ===
 '[#12]Join LeftOuter, '`==`('index, 'id)                     '[#12]Join LeftOuter, '`==`('index, 'id)
!:- '[#9]UnresolvedRelation [test_table_1], [], false         :- '[#9]SubqueryAlias spark_catalog.default.test_table_1
!+- '[#11]Project ['index, 'value_2]                          :  +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!   +- '[#10]Join Inner, '`==`('id, 'index)                   +- '[#11]Project ['index, 'value_2]
!      :- '[#7]UnresolvedRelation [test_table_1], [], false      +- '[#10]Join Inner, '`==`('id, 'index)
!      +- '[#8]UnresolvedRelation [test_table_2], [], false         :- '[#9]SubqueryAlias spark_catalog.default.test_table_1
!                                                                   :  +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false
!                                                                   +- '[#8]SubqueryAlias spark_catalog.default.test_table_2
!                                                                      +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false



Can not resolve 'id with plan 7

[#7]UnresolvedRelation [test_table_1], [], false was wrongly resolved to the cached one

:- '[#9]SubqueryAlias spark_catalog.default.test_table_1
   +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

ci

dongjoon-hyun · 2024-04-30T02:08:59Z

Please update the JIRA information first because we cannot backport an improvement, @zhengruifeng .

zhengruifeng · 2024-04-30T02:33:05Z

thanks @dongjoon-hyun for reminder, I have updated the Type and versions

dongjoon-hyun

+1, LGTM. Thank you, @zhengruifeng .

zhengruifeng · 2024-04-30T02:41:17Z

the failed java linter seems unrelated, it was already broken in branch-3.5 https://github.com/apache/spark/actions/runs/8886943128/job/24404782422

…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix #45214 to 3.5 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes #46291 from zhengruifeng/connect_fix_read_join_35. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

zhengruifeng · 2024-04-30T04:44:57Z

thanks @dongjoon-hyun so much!

merged to branch-3.5

nit

7971ac3

github-actions bot added SQL CORE PYTHON labels Apr 30, 2024

dongjoon-hyun approved these changes Apr 30, 2024

View reviewed changes

zhengruifeng closed this Apr 30, 2024

zhengruifeng deleted the connect_fix_read_join_35 branch April 30, 2024 04:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47129][CONNECT][SQL][3.5] Make ResolveRelations cache connect plan properly#46291

[SPARK-47129][CONNECT][SQL][3.5] Make ResolveRelations cache connect plan properly#46291
zhengruifeng wants to merge 1 commit intoapache:branch-3.5from
zhengruifeng:connect_fix_read_join_35

zhengruifeng commented Apr 30, 2024 •

edited by dongjoon-hyun

Loading

Uh oh!

dongjoon-hyun commented Apr 30, 2024

Uh oh!

zhengruifeng commented Apr 30, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

zhengruifeng commented Apr 30, 2024

Uh oh!

zhengruifeng commented Apr 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhengruifeng commented Apr 30, 2024 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun commented Apr 30, 2024

Uh oh!

zhengruifeng commented Apr 30, 2024

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Apr 30, 2024

Uh oh!

zhengruifeng commented Apr 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhengruifeng commented Apr 30, 2024 •

edited by dongjoon-hyun

Loading