[FEA] Parquet reader filter improvements

**Is your feature request related to a problem? Please describe.**

In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.

If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader _does_ support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.

For example, `column_name_reference("a") < literal(...)` is a supported expression, but `literal(...) > column_name_reference("a")` is _not_ (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.

This is suboptimal in two ways:

1. Exactly which filter expressions are supported is now encoded in two places, and I might have got it wrong
2. If _parts_ of the whole filter expression are supported by the parquet reader, they are still applied in a post-filter stage, rather than being applied at the row group level.

**Describe the solution you'd like**

- I'd like the parquet reader to accept arbitrary expressions as filters and do the right thing. Much of the facility already exists since for correctness, the filter must always be applied as a post-filter even after row groups have been discarded.
- Bonus points if expressions where only part of the expression is supported by the statistics filters still uses statistics to discard row groups.

**Describe alternatives you've considered**

For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.

For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.

**Additional context**

What I'm doing now: https://github.com/rapidsai/cudf/pull/17141

Additional feature req: support filtering row groups based on nulls, i.e. support `is_null(column_name_reference(...))` in the statistics reader.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Parquet reader filter improvements #17142

wence-
openedon Oct 22, 2024

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Parquet reader filter improvements #17142

Description

wence-openedon Oct 22, 2024

Activity

Metadata