Skip to content

[FEA] Parquet reader filter improvements #17142

Open

Description

Is your feature request related to a problem? Please describe.

In cudf-polars, predicate pushdown can result in arbitrary expressions being part of the parquet read phase. Not all of these expressions make sense for discarding rows at the row group level based on statistics, however, they can still be applied in a post-filtering stage.

If I naively translate the generic expression I get from polars to a libcudf expression and use it in the parquet reader, libcudf might throw at runtime with an unsupported operation. I must therefore encode in my transliteration, exactly which ast expressions the parquet reader does support in its statistics filters and only deliver the filter to the parquet reader if it is one that is understood.

For example, column_name_reference("a") < literal(...) is a supported expression, but literal(...) > column_name_reference("a") is not (this one I translate to something that is supported). But if the parquet reader were extended to handle both types, I'd now be doing unnecessary work.

This is suboptimal in two ways:

  1. Exactly which filter expressions are supported is now encoded in two places, and I might have got it wrong
  2. If parts of the whole filter expression are supported by the parquet reader, they are still applied in a post-filter stage, rather than being applied at the row group level.

Describe the solution you'd like

  • I'd like the parquet reader to accept arbitrary expressions as filters and do the right thing. Much of the facility already exists since for correctness, the filter must always be applied as a post-filter even after row groups have been discarded.
  • Bonus points if expressions where only part of the expression is supported by the statistics filters still uses statistics to discard row groups.

Describe alternatives you've considered

For point one, I can do the thing I'm doing right now and just bail if I hit a feature I've determined as unsupported.

For point two, I can convert to some kind of normal form and pick apart the pieces that are supported and deliver those to the parquet reader. However, I'd love not to have to write another propositional formula -> CNF converter :), and this still suffers from point 1: the final decision to discard things encodes information in two places.

Additional context

What I'm doing now: #17141

Additional feature req: support filtering row groups based on nulls, i.e. support is_null(column_name_reference(...)) in the statistics reader.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    cuIOcuIO issuecudf.polarsIssues specific to cudf.polarsfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    • Status

      In Progress
    • Status

      Needs owner
    • Status

      Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions