Skip to content

Spark: Analyze but don't optimize view body during creation#14681

Merged
nastra merged 5 commits intoapache:mainfrom
jbewing:analyze-dont-optimize-view-creation
Dec 3, 2025
Merged

Spark: Analyze but don't optimize view body during creation#14681
nastra merged 5 commits intoapache:mainfrom
jbewing:analyze-dont-optimize-view-creation

Conversation

@jbewing
Copy link
Contributor

@jbewing jbewing commented Nov 25, 2025

What

This PR updates view creation from Spark 4, 3.5, & 3.4 to analyze, but not optimize the view body when creating a view. Previously, the view body would be optimized which could result in long view creation times with larger tables. When creating views over a larger table (hundreds of TBs), creating a small number of views (say just a couple thousand) takes about ~12 hours and requires a moderately sized Spark cluster (~100 CPUs). Without running optimization over a view body, the view body is still analyzed for invalid syntax or references.

How

Upon looking upstream at Spark, we can see that for similar pieces of the view creation logic that views have some explicit code that enables the view body to be analyzed, but not optimized.

In Iceberg, we hijack the regular upstream Spark code and run our own variants of view creation that don't pull in this optimization. Given that a table scan planning is both redundant and slow in this case, we should update the internal Iceberg view creation code to only be used in Spark analysis, but not optimization phases.

Testing

I've run the existing test suite locally for Spark 3.4, 3.5, & 4 to verify that they still pass. Additionally, I've run this iceberg patch on an fork of Iceberg 1.10.0 on a fork of Spark 3.5 an observed in a staging environment that a task which creates some views over a smaller (~10TB) table that used to take 2 hours now takes 14 minutes consistently. Additionally, no errors or bugs were observed with the created views when testing in this staging environment.

Issue: #14680

@github-actions github-actions bot added the spark label Nov 25, 2025
@nastra nastra requested review from huaxingao and nastra November 25, 2025 08:30
rewritten: Boolean = false) extends BinaryCommand {
override def left: LogicalPlan = child
rewritten: Boolean = false,
isAnalyzed: Boolean = false) extends AnalysisOnlyCommand {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add a comment to explain why we want to use AnalysisOnlyCommand?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! See db21f91 for added comments

@huaxingao
Copy link
Contributor

LGTM. This aligns Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand. The command’s children are analyzed then hidden, so the optimizer/planner won’t traverse the view body.

@jbewing jbewing requested a review from huaxingao November 25, 2025 22:23
Copy link
Contributor

@huaxingao huaxingao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jbewing
Copy link
Contributor Author

jbewing commented Nov 26, 2025

🙇 Thank you for the review @huaxingao ! Any timeline on when this can be merged?

@nastra
Copy link
Contributor

nastra commented Nov 26, 2025

@jbewing I'll also take a look at this PR this week

@jbewing
Copy link
Contributor Author

jbewing commented Nov 26, 2025

Sounds good! Thank you @nastra !

override protected def withNewChildrenInternal(
newLeft: LogicalPlan, newRight: LogicalPlan): LogicalPlan =
copy(child = newLeft, query = newRight)
def markAsAnalyzed(analysisContext: AnalysisContext): LogicalPlan = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override def markAsAnalyzed(analysisContext: AnalysisContext): LogicalPlan = {
    copy(isAnalyzed = true)
  }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please also fix the other spark versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1a09db6!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbewing I think this is missing the override for markAsAnalyzed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right @nastra! I've addressed that in f77a893!

rewritten: Boolean = false) extends BinaryCommand {
override def left: LogicalPlan = child
rewritten: Boolean = false,
// Align Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Align Iceberg CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand.
// Align Iceberg's CreateIcebergView with Spark’s CreateViewCommand by extending AnalysisOnlyCommand.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we move the comment right above case class CreateIcebergView?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1a09db6

- Improve + move CreateIcebergView comment for clarity
- Remove excessive indentation in `markAsAnalyzed` scala function
@jbewing jbewing requested a review from nastra December 1, 2025 18:34
Copy link
Contributor

@nastra nastra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one final comment

@jbewing
Copy link
Contributor Author

jbewing commented Dec 2, 2025

Thank you for the review @nastra! I've addressed your final comment in f77a893

@jbewing
Copy link
Contributor Author

jbewing commented Dec 2, 2025

Actually, hold for a re-review as a recent PR #8023 has been merged which creates a ton of conflicts here that I need to resolve

@jbewing jbewing requested a review from nastra December 2, 2025 19:33
@jbewing
Copy link
Contributor Author

jbewing commented Dec 2, 2025

Alright, I've rebased on master!

@nastra nastra merged commit fac485c into apache:main Dec 3, 2025
29 checks passed
thomaschow pushed a commit to thomaschow/iceberg that referenced this pull request Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants