-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-50903][CONNECT] Cache logical plans after analysis #49584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
db8db47
to
8b1df40
Compare
6fc5c1b
to
38d2d1d
Compare
022f1e3
to
4264bdd
Compare
e948a7c
to
5962637
Compare
5962637
to
337871f
Compare
337871f
to
9748999
Compare
7b7430f
to
622bac9
Compare
622bac9
to
18c2344
Compare
ca61632
to
1f8594e
Compare
@xi-db FYI
|
1f8594e
to
4aac2c3
Compare
43b810f
to
5981821
Compare
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala
Outdated
Show resolved
Hide resolved
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SessionHolder.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good.
logDebug(s"Cache a lazyily analyzed logical plan for '$rel': $plan") | ||
planCache.get.put(rel, plan) | ||
} else { | ||
val plan = df.queryExecution.analyzed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may have to add some invalidation logic here. The problem is that some of objects (tables/views/udfs) used in the query can change, in that case we may want to validate that the objects used are the most recent ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I'll need to think about the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added LogicalPlan.isOutdated and updated the usePlanCache method.
-> Calling 'refresh' seemed suboptimal, as schema changes often require a re-planning of the plan.
259022b
to
ca3cebf
Compare
ed80473
to
1ce371e
Compare
1ce371e
to
649ba6c
Compare
What changes were proposed in this pull request?
Update the plan cache after logical plans are analyzed.
Additionally, this PR includes,
Why are the changes needed?
The session local plan cache may store unanalyzed logical plans, which causes them to be repeatedly analyzed, resulting in performance degradation.
To be specific, the session local plan cache usage pattern is,
Where an unanalyzed plan may be cached during transformation, and Dataset.ofRows analyzes the unanalyzed plan. This leads to a situation where the same query may hit the cache, returning an unanalyzed plan, and Dataset.ofRows again analyzes the same plan.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
testOnly org.apache.spark.sql.connect.service.SparkConnectSessionHolderSuite
Was this patch authored or co-authored using generative AI tooling?
No.