-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints #11372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #51980 has finished for PR 11372 at commit
|
06d74da to
2345075
Compare
|
Test build #51994 has finished for PR 11372 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if the optimizer is the right place for this rule. My main concern is that if we can preserve this ordering through the rest of query compilation. Will it be better to do it inside the physical Filter operator (just before we start to generate the code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, that sounds like a good idea! Could there be any other downside of not doing it in the optimizer? /cc @nongli
2345075 to
c08b7fb
Compare
c08b7fb to
28050b3
Compare
|
Test build #52338 has finished for PR 11372 at commit
|
|
Test build #52383 has finished for PR 11372 at commit
|
|
Test build #52406 has finished for PR 11372 at commit
|
|
Test build #52416 has finished for PR 11372 at commit
|
|
test this please |
|
Test build #52431 has finished for PR 11372 at commit
|
| } | ||
|
|
||
| /** | ||
| * Attempts to eliminate reading (unnecessary) NULL values if they are not required for correctness |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"in the query plan"
|
Thanks @nongli, all comments addressed. |
|
Test build #52494 has finished for PR 11372 at commit
|
|
LGTM |
…ns based on their data constraints ## What changes were proposed in this pull request? This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints. Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins. ## How was this patch tested? 1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules. 2. Test generated ExpressionTrees via `OrcFilterSuite` 3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite` cc yhuai nongli Author: Sameer Agarwal <sameer@databricks.com> Closes apache#11372 from sameeragarwal/gen-isnotnull.
What changes were proposed in this pull request?
This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting
isNotNullfilters is the query plan. These filters are currently inserted beneath existingFilterandJoinoperators and are inferred based on their data constraints.Note: While this optimization is applicable to all types of join, it primarily benefits
InnerandLeftSemijoins.How was this patch tested?
NullFilteringSuitethat tests forIsNotNullfilters in the query plan for joins and filters. Also, tests interaction with theCombineFiltersoptimizer rules.OrcFilterSuiteSimpleTextHadoopFsRelationSuitecc @yhuai @nongli