Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141

wence- · 2024-10-22T15:45:06Z

Description

Previously, we always applied parquet filters by post-filtering. This negates much of the potential gain from having filters available at read time, namely discarding row groups. To fix this, implement, with the new visitor system of #17016, conversion to pylibcudf expressions.

We must distinguish two types of expressions, ones that we can evaluate via cudf::compute_column, and the more restricted set of expressions that the parquet reader understands, this is handled by having a state that tracks the usage. The former style will be useful when we implement inequality joins.

While here, extend the support in pylibcudf expressions to handle all supported literal types and expose compute_column so we can test the correctness of the broader (non-parquet) implementation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

We will use this for inequality joins and filter pushdown in the parquet reader. The handling is a bit complicated, since the subset of expressions that the parquet filter accepts is smaller than all possible expressions. Since much of the logic is similar, however, we just dispatch on a transformer state variable to determine which case we're handling.

We attempt to turn the predicate into a filter expression that the parquet reader understands. If successful then we don't have to apply the predicate as a post-filter. We can only do this when a row index is not requested.

Matt711

I'm not as familiar with expressions in libcudf or cudf::compute_column, but here are a couple of smaller things I noticed.

python/cudf_polars/cudf_polars/dsl/to_ast.py

python/pylibcudf/pylibcudf/transform.pyx

python/cudf_polars/cudf_polars/dsl/ir.py

python/pylibcudf/pylibcudf/transform.pyx

python/pylibcudf/pylibcudf/tests/test_expressions.py

python/cudf_polars/cudf_polars/dsl/to_ast.py

vyasr · 2024-10-25T03:54:07Z

python/cudf_polars/cudf_polars/dsl/to_ast.py

+        if isinstance(haystack, expr.LiteralColumn) and len(haystack.value) < 16:
+            # 16 is an arbitrary limit


I'm confused, what is the purpose of this limit?

I have to make one scalar for every value and upload it to the device. So I just picked a value as a cutoff

Is the idea here that you think once we need to create more than a certain number of scalars the cost of allocation will be high enough that we will underperform the CPU? The end result here is that we raise and fall back when there are more than 16 scalars, right?

It means that (for example) we will do the parquet filter as a post-filter (still on the GPU) rather than during the read.

vyasr · 2024-10-28T22:59:34Z

python/cudf_polars/cudf_polars/dsl/to_ast.py

+        if isinstance(haystack, expr.LiteralColumn) and len(haystack.value) < 16:
+            # 16 is an arbitrary limit


Is the idea here that you think once we need to create more than a certain number of scalars the cost of allocation will be high enough that we will underperform the CPU? The end result here is that we raise and fall back when there are more than 16 scalars, right?

…olars-expr-to-ast

wence- · 2024-10-30T17:05:43Z

/merge

wence- requested a review from a team as a code owner October 22, 2024 15:45

wence- requested review from vyasr and isVoid October 22, 2024 15:45

github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Oct 22, 2024

wence- added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Oct 22, 2024

wence- mentioned this pull request Oct 22, 2024

[FEA] Parquet reader filter improvements #17142

Open

wence- force-pushed the wence/fea/polars-expr-to-ast branch from 6f3d385 to a98cd7b Compare October 23, 2024 11:21

github-actions bot added Python Affects Python cuDF API. cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Oct 23, 2024

wence- added 9 commits October 24, 2024 18:51

A few type annotations

e221466

Expose all type ids and match order with libcudf

59911d7

Support all types for scalars in pylibcudf Expressions

06c15da

Expose compute_column

68f0a9b

Add pylibcudf test for compute_column

c3986cd

Implement predicate pushdown into parquet read

3732c76

We attempt to turn the predicate into a filter expression that the parquet reader understands. If successful then we don't have to apply the predicate as a post-filter. We can only do this when a row index is not requested.

Add tests of parquet filters

9a62f53

Add tests of to_ast and column compute

16efcaf

wence- force-pushed the wence/fea/polars-expr-to-ast branch from a98cd7b to 16efcaf Compare October 24, 2024 18:52

Matt711 reviewed Oct 24, 2024

View reviewed changes

python/cudf_polars/cudf_polars/dsl/to_ast.py Outdated Show resolved Hide resolved

python/pylibcudf/pylibcudf/transform.pyx Outdated Show resolved Hide resolved

vyasr reviewed Oct 25, 2024

View reviewed changes

wence- added 2 commits October 25, 2024 10:15

One less move

43fbe36

Use plc.compute_column from legacy cython

a37e4fa

Less ambiguous import name

953d184

vyasr approved these changes Oct 28, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/branch-24.12' into wence/fea/p…

6c95cc8

…olars-expr-to-ast

github-actions bot assigned wence- Oct 30, 2024

rapids-bot bot merged commit 7157de7 into rapidsai:branch-24.12 Oct 30, 2024
102 checks passed

wence- deleted the wence/fea/polars-expr-to-ast branch October 30, 2024 20:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141

Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141

wence- commented Oct 22, 2024

Matt711 left a comment

vyasr Oct 25, 2024

wence- Oct 25, 2024

vyasr Oct 28, 2024

wence- Oct 30, 2024

vyasr Oct 28, 2024

wence- commented Oct 30, 2024

		if isinstance(haystack, expr.LiteralColumn) and len(haystack.value) < 16:
		# 16 is an arbitrary limit

Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141

Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141

Conversation

wence- commented Oct 22, 2024

Description

Checklist

Matt711 left a comment

Choose a reason for hiding this comment

vyasr Oct 25, 2024

Choose a reason for hiding this comment

wence- Oct 25, 2024

Choose a reason for hiding this comment

vyasr Oct 28, 2024

Choose a reason for hiding this comment

wence- Oct 30, 2024

Choose a reason for hiding this comment

vyasr Oct 28, 2024

Choose a reason for hiding this comment

wence- commented Oct 30, 2024