-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support RowFilter
in ParquetExec
#3360
Comments
@tustvold @Ted-Jiang I've been tinkering with this and can submit an RFC today or tomorrow for feedback. If you guys already have thought though we can discuss here :) |
100% we should get this integrated 👍, awesome that you're working on this. Some miscellaneous thoughts:
|
@thinkharderdev Wow! So looking forward! 💪
Nice write up! Thanks👍 I think one thing we should talk about , how to define the |
maybe i will check how impala did this😂 |
This isn't necessarily the case. Even if we don't prune any pages it can still be a pretty significant performance boost to skip decoding. The general problem with selectivity is that we really don't have much to go on at the time we need to build the filters. We have parquet metadata but that isn't much :). I think the approach I'll go with for the draft PR is something like:
From there we can tweak it to include fancier hueristics (null counts, etc) |
Thank you @thinkharderdev @tustvold and @Ted-Jiang for driving this forwardf |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
(This section helps Arrow developers understand the context and why for this feature, in addition to the what)
Describe the solution you'd like
A clear and concise description of what you want to happen.
arrow-rs has recently added the ability to do row-level filtering while decoding parquet files. This can dramatically reduce decoding and IO overhead when appropriately selective pruning predicates are pushed down to the table scan. We should support this in
ParquetExec
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
If a
ParquetExec
has aPruningPredicate
it should be "compiled" to a vector ofArrowPredicate
(s) and supplied as aRowFilter
to theParquetRecordBatchStream
. We can implement this a couple of different ways:Expr
, create aPhysicalExpr
and then implement a singleArrowPredictaeFn
which evaluates it.Expr
and compile each to a separateArrowPredicateFn
that will be applied sequentially. We can either take the ordering as given or apply some heuristic to determine the ordering.Some considerations:
TableProvider
they can control what gets pushed down. Is that enough configurability?Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: