-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-24172][SQL] we should not apply operator pushdown to data source v2 many times #21230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #90140 has finished for PR 21230 at commit
|
var pushed = false | ||
plan transformDown { | ||
// PhysicalOperation guarantees that filters are deterministic; no need to check | ||
case PhysicalOperation(project, filters, relation: DataSourceV2Relation) if !pushed => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that possible one plan has multiple PhysicalOperation
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PhysicalOperation
just accumulates project and filter above a specific node, if we transform down a tree, and only transform once, we will never hit PhysicalOperation
more than once.
So it was the use of That approach fits with what I suggested on #21118. We could have the scan node handle the filter and the projection so that it doesn't matter whether the source produces |
@cloud-fan, I opened #21262 that is similar to this, but does pushdown when converting to a physical plan. You might like that as an alternative because it cleans up The drawback to that approach that I had forgotten about is that it breaks Up to you how to continue with this work, I just think we should consider the other approach since it solves a few problems. And |
Hi @rdblue , thanks for your new approach! Like you said, the major problem is about statistics. This is unfortunately a problem of Spark's CBO design: the statistics should belong to physical node but it currently belongs to logical node. For file-based data sources, since they are builtin sources, we can create rules to update statistics at logical phase, i.e. That said, I do like your approach if we can fix the statistics problem first. I'm not sure how hard and how soon it can be fixed, cc @wzhfy Before that, I'd like to still keep the pushdown logic in optimizer and left the hard work to Spark instead of users. What do you think? |
Sounds good to me. Lets plan on getting this one in to fix the current problem, and commit the other approach when stats are fixed. |
retest this please |
Test build #90436 has finished for PR 21230 at commit
|
+1 (assuming tests pass) |
@@ -23,17 +23,10 @@ import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project | |||
import org.apache.spark.sql.catalyst.rules.Rule | |||
|
|||
object PushDownOperatorsToDataSource extends Rule[LogicalPlan] { | |||
override def apply( | |||
plan: LogicalPlan): LogicalPlan = plan transformUp { | |||
override def apply(plan: LogicalPlan): LogicalPlan = plan.mapChildren { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the PR description (transformDown
-> mapChildren
), too?
Test build #90467 has finished for PR 21230 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Test build #90488 has finished for PR 21230 at commit
|
retest this please |
Test build #90506 has finished for PR 21230 at commit
|
LGTM Thanks! Merged to master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ce v2 many times ## What changes were proposed in this pull request? In `PushDownOperatorsToDataSource`, we use `transformUp` to match `PhysicalOperation` and apply pushdown. This is problematic if we have multiple `Filter` and `Project` above the data source v2 relation. e.g. for a query ``` Project Filter DataSourceV2Relation ``` The pattern match will be triggered twice and we will do operator pushdown twice. This is unnecessary, we can use `mapChildren` to only apply pushdown once. ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#21230 from cloud-fan/step2.
What changes were proposed in this pull request?
In
PushDownOperatorsToDataSource
, we usetransformUp
to matchPhysicalOperation
and apply pushdown. This is problematic if we have multipleFilter
andProject
above the data source v2 relation.e.g. for a query
The pattern match will be triggered twice and we will do operator pushdown twice. This is unnecessary, we can use
mapChildren
to only apply pushdown once.How was this patch tested?
existing test