-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-12639][SQL] Improve Explain for Datasources with Handled Predicates #10655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
SPARK-11661 Makes all predicates pushed down to underlying Datasources regardless of whether the source can handle them or not. This makes the explain command slightly confusing as it will always list all filters whether or not the underlying source can actually use them. Instead now we should only list those filters which are expressly handled by the underlying source.
if (pushedFilters.nonEmpty) { | ||
pairs += (PUSHED_FILTERS -> pushedFilters.mkString("[", ", ", "]")) | ||
if (handledPredicates.nonEmpty) { | ||
pairs += (HANDLED_FILTERS -> handledPredicates.mkString("[", ", ", "]")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also keep pushed filters? For some data source like orc, a pushed filter will be evaluated at a coarse grain level instead of on every rows. I think it is better to keep that information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought 11663 meant all filters are pushed down, regardless so I wondered if that was redundant? It's also still a bit confusing since although it says the filters are "pushed" there is no guarantee that the underlying source will do anything with them at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry. I think I understand the change now. handledPredicates
contains all filters that are pushed to the data source except those filters returned by the unhandledFilters method. I think the change is good and HandledFilters
is a proper name.
Can you in the pull request description includes a before/after change? |
@rxin Added, basically I think the current "PushedFilters" list isn't very valuable if everything is listed there. So instead we should just list those filters which the source can actually do something with. If there is a possibility that a source might do something (bloom flitery) we should have a third category but currently there is no way of knowing (i think) |
Thanks @RussellSpitzer. I will let @yhuai review and merge this. One question, do you know why the filter is "if (isnull(acc#2)) null else CASE 1000 WHEN 1 THEN acc#2 WHEN 0 THEN NOT acc#2 ELSE false"? Seems so complicated for "acc = 1000" |
OK I think I figured out why. "acc" is a boolean column. |
@RussellSpitzer Thank you for the PR! The change looks good. Can you also try ORC and Parquet table and attach the before/after change to the PR description? |
ok to test |
Test build #49155 has finished for PR 10655 at commit
|
@yhuai I removed the PushedFilters and add the other examples. We could read-add the "PushedFilters" if you like. I wasn't sure if you still wanted that. I'm still not sure if it's very valuable info since everything is |
Test build #49502 has finished for PR 10655 at commit
|
@RussellSpitzer Thanks for the change. I have thought about it again. My only concern is that So, how about we still show |
I personally think the ambiguous In my mind their are 3 Categories of predicates
Currently we can only tell whether or not a predicate is in on of the first two categories or if it is in the third. This leaves is awkwardly stating that the source has had a predicate So to me it is more confusing to say something is If you want to just go with Asterisks thats fine with me too just wanted to make my argument :D |
@RussellSpitzer Thank you for the comment. I totally agree with you. As mentioned before, my only concern is that for ORC/Parquet, we will not be able to see pushed filters in the explain output after the current change in this PR. As a user of Parquet/ORC, I do want to see that a filter has been pushed down even if this filter will not be applied to every row. How about we go with Asterisks for now? You may need to keep the expression that maps to a data source filter (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L536) to generate the string. |
Haven't forgotten this will have a new pr soon :) |
no problem! Thank you :) |
Maybe we might have to correct the title just like the others |
ping @RussellSpitzer |
Sorry I forgot about this, I'll clean this up tomorrow and get it ready |
test this please |
Test build #65267 has finished for PR 10655 at commit
|
We fixed this on a different pr #11317 |
SPARK-11661 Makes all predicates pushed down to underlying Datasources
regardless of whether the source can handle them or not. This makes the
explain command slightly confusing as it will always list all filters
whether or not the underlying source can actually use them. Instead
now we should only list those filters which are expressly handled by the
underlying source.
All predicates are pushed down so there really isn't any value in listing them.
In the following example
plate_number
is a predicate which can be pushed to the underlying source and handled at the source level.mileage
andacc
cannot be handled in any way by the underlying source.Previously
With this patch
TLDR;
CassandraSource
Parquet
Orc