Spark: Analyze but don't optimize view body during creation#14681
Spark: Analyze but don't optimize view body during creation#14681nastra merged 5 commits intoapache:mainfrom
Conversation
| rewritten: Boolean = false) extends BinaryCommand { | ||
| override def left: LogicalPlan = child | ||
| rewritten: Boolean = false, | ||
| isAnalyzed: Boolean = false) extends AnalysisOnlyCommand { |
There was a problem hiding this comment.
nit: Can we add a comment to explain why we want to use AnalysisOnlyCommand?
There was a problem hiding this comment.
Good call! See db21f91 for added comments
|
LGTM. This aligns Iceberg |
|
🙇 Thank you for the review @huaxingao ! Any timeline on when this can be merged? |
|
@jbewing I'll also take a look at this PR this week |
|
Sounds good! Thank you @nastra ! |
| override protected def withNewChildrenInternal( | ||
| newLeft: LogicalPlan, newRight: LogicalPlan): LogicalPlan = | ||
| copy(child = newLeft, query = newRight) | ||
| def markAsAnalyzed(analysisContext: AnalysisContext): LogicalPlan = { |
There was a problem hiding this comment.
override def markAsAnalyzed(analysisContext: AnalysisContext): LogicalPlan = {
copy(isAnalyzed = true)
}
There was a problem hiding this comment.
can you please also fix the other spark versions?
There was a problem hiding this comment.
@jbewing I think this is missing the override for markAsAnalyzed?
| rewritten: Boolean = false) extends BinaryCommand { | ||
| override def left: LogicalPlan = child | ||
| rewritten: Boolean = false, | ||
| // Align Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand. |
There was a problem hiding this comment.
| // Align Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand. | |
| // Align Iceberg's CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand. |
There was a problem hiding this comment.
also can we move the comment right above case class CreateIcebergView?
- Improve + move CreateIcebergView comment for clarity - Remove excessive indentation in `markAsAnalyzed` scala function
nastra
left a comment
There was a problem hiding this comment.
LGTM with one final comment
|
Actually, hold for a re-review as a recent PR #8023 has been merged which creates a ton of conflicts here that I need to resolve |
|
Alright, I've rebased on master! |
What
This PR updates view creation from Spark 4, 3.5, & 3.4 to analyze, but not optimize the view body when creating a view. Previously, the view body would be optimized which could result in long view creation times with larger tables. When creating views over a larger table (hundreds of TBs), creating a small number of views (say just a couple thousand) takes about ~12 hours and requires a moderately sized Spark cluster (~100 CPUs). Without running optimization over a view body, the view body is still analyzed for invalid syntax or references.
How
Upon looking upstream at Spark, we can see that for similar pieces of the view creation logic that views have some explicit code that enables the view body to be analyzed, but not optimized.
In Iceberg, we hijack the regular upstream Spark code and run our own variants of view creation that don't pull in this optimization. Given that a table scan planning is both redundant and slow in this case, we should update the internal Iceberg view creation code to only be used in Spark analysis, but not optimization phases.
Testing
I've run the existing test suite locally for Spark 3.4, 3.5, & 4 to verify that they still pass. Additionally, I've run this iceberg patch on an fork of Iceberg 1.10.0 on a fork of Spark 3.5 an observed in a staging environment that a task which creates some views over a smaller (~10TB) table that used to take 2 hours now takes 14 minutes consistently. Additionally, no errors or bugs were observed with the created views when testing in this staging environment.
Issue: #14680