-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support limit pushdown through left right outer join #2580
Conversation
apache#2569) * add scan_empty method to tests * update tests to use new scan_empty test method * remove LogicalPlanBuilder::scan_empty * LogicalPlanBuilder now uses TableSource instead of TableProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution @Ted-Jiang , howe I don't think this is a valid optimization.
Specifically, because a join can filter out rows, if you limit the input you may actually end up with fewer output rows than the limit.
Consider this input:
left
:
l |
---|
1 |
2 |
... |
100 |
right
:
r |
---|
99 |
100 |
The output of select * from left LEFT JOIN right ON (l = r)
should be:
l | r |
---|---|
99 | 99 |
100 | 100 |
However, if you push the limit down to the scan on left
it would only send this into the JOIN
left
:
l |
---|
1 |
2 |
And thus would produce no output.
If we want to optimize limits in Joins, I think it would have to be done in the Join Operator itself (to stop producing rows once the limit is hit). However, as long as the Join operator is producing rows in batches, the effect of implementing an internal limit will likely be small (because it would save only a part of one output batch, when the limit on the output of a join is hit)
I think this situation is
But this rule will apply
@alamb is this situation make sense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question is df LEFT JOIN
is equal to LEFT OUTER JOIN
?
Yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right @Ted-Jiang -- I was confused about the side of the join 🤦
Upon more thought I think this is correct.
Thank you again.
I'll leave it open for another day or so before merging in case anyone else has thoughts
Thanks @alamb ❤️. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
It can reduce data source in the side of join.
Great job❤️.
Co-authored-by: jakevin <30525741+jackwener@users.noreply.github.com>
it seems as if there is a clippy error now 😢 |
And now there is a conflicts 🤦 |
# Conflicts: # datafusion/core/src/optimizer/limit_push_down.rs
reopened as part of #2596 |
#2596 is now merged 👏 |
Which issue does this PR close?
run
before
after:
Closes #2579.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?