-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141
Add conversion from cudf-polars expressions to libcudf ast for parquet filters #17141
Conversation
6f3d385
to
a98cd7b
Compare
We will use this for inequality joins and filter pushdown in the parquet reader. The handling is a bit complicated, since the subset of expressions that the parquet filter accepts is smaller than all possible expressions. Since much of the logic is similar, however, we just dispatch on a transformer state variable to determine which case we're handling.
We attempt to turn the predicate into a filter expression that the parquet reader understands. If successful then we don't have to apply the predicate as a post-filter. We can only do this when a row index is not requested.
a98cd7b
to
16efcaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not as familiar with expressions in libcudf or cudf::compute_column
, but here are a couple of smaller things I noticed.
if isinstance(haystack, expr.LiteralColumn) and len(haystack.value) < 16: | ||
# 16 is an arbitrary limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, what is the purpose of this limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to make one scalar for every value and upload it to the device. So I just picked a value as a cutoff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that you think once we need to create more than a certain number of scalars the cost of allocation will be high enough that we will underperform the CPU? The end result here is that we raise and fall back when there are more than 16 scalars, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that (for example) we will do the parquet filter as a post-filter (still on the GPU) rather than during the read.
if isinstance(haystack, expr.LiteralColumn) and len(haystack.value) < 16: | ||
# 16 is an arbitrary limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea here that you think once we need to create more than a certain number of scalars the cost of allocation will be high enough that we will underperform the CPU? The end result here is that we raise and fall back when there are more than 16 scalars, right?
…olars-expr-to-ast
/merge |
Description
Previously, we always applied parquet filters by post-filtering. This negates much of the potential gain from having filters available at read time, namely discarding row groups. To fix this, implement, with the new visitor system of #17016, conversion to pylibcudf expressions.
We must distinguish two types of expressions, ones that we can evaluate via
cudf::compute_column
, and the more restricted set of expressions that the parquet reader understands, this is handled by having a state that tracks the usage. The former style will be useful when we implement inequality joins.While here, extend the support in pylibcudf expressions to handle all supported literal types and expose
compute_column
so we can test the correctness of the broader (non-parquet) implementation.Checklist