Description
openedon Oct 22, 2024
Is your feature request related to a problem? Please describe.
In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.
If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader does support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.
For example, column_name_reference("a") < literal(...)
is a supported expression, but literal(...) > column_name_reference("a")
is not (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.
This is suboptimal in two ways:
- Exactly which filter expressions are supported is now encoded in two places, and I might have got it wrong
- If parts of the whole filter expression are supported by the parquet reader, they are still applied in a post-filter stage, rather than being applied at the row group level.
Describe the solution you'd like
- I'd like the parquet reader to accept arbitrary expressions as filters and do the right thing. Much of the facility already exists since for correctness, the filter must always be applied as a post-filter even after row groups have been discarded.
- Bonus points if expressions where only part of the expression is supported by the statistics filters still uses statistics to discard row groups.
Describe alternatives you've considered
For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.
For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.
Additional context
What I'm doing now: #17141
Additional feature req: support filtering row groups based on nulls, i.e. support is_null(column_name_reference(...))
in the statistics reader.
Activity