-
Notifications
You must be signed in to change notification settings - Fork 98
fix: filter pushdown for nested fields #5406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
09fcef5 to
7046d3a
Compare
Codecov Report❌ Patch coverage is ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I'm actually fairly confident that we just need to validate the ScalarFunction being pushed down to us is |
|
The case I found while reading parquet logic for this is that there might be a constant synthetic column created and you might have a filter on it where you could get to see the column. There might be test cases in DataFusion parquet |
|
The reason I added this check was that I was seeing getfield fail at execution time unexpectedly because the table schema (merged schema over all files) does have the field so the df planning works fine (as it should), but a specific file does not. I do think we need to check field existence. |
That makes sense, wasn't thinking about schema evolution. The Source reports what can be pushed down, and it has access to the table schema but doesn't know the individual file schemas. So, I think we should be adapting the predicate in the FileOpener instead of the FileSource. I've also noticed that some of these APIs have changed in DF 51 so I can double-check that today. |
|
Ahh, that makes sense. We already do pushdown checks in the opener against the file schema, so |
|
vortex/vortex-datafusion/src/persistent/opener.rs Lines 297 to 308 in 879a53b
I think we should error here instead of dropping the predicate silently after we had told DF that we're going to handle it. By my reading of DF, when we report I think once we add some protection there this should be gtg |
eae7eed to
0426d3d
Compare
| // SPDX-License-Identifier: Apache-2.0 | ||
| // SPDX-FileCopyrightText: Copyright the Vortex contributors | ||
|
|
||
| use std::ops::Range; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this just got moved out of opener.rs since it was getting long, and having these tests there was distracting
| | Timestamp(_, _) | ||
| | Time32(_) | ||
| | Time64(_) | ||
| | Struct(_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably expand this list further? I just added Struct to make one of the tests pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing List/ListView/FixedSiedList and interval types. Map and Union as well
CodSpeed Performance ReportMerging #5406 will improve performances by 17.18%Comparing Summary
Benchmarks breakdown
Footnotes
|
| if !can_be_pushed_down(expr, &predicate_file_schema) { | ||
| internal_datafusion_err!("DataFusion predicate {expr} cannot be pushed down to Vortex file {} with schema {predicate_file_schema}", | ||
| file_meta.object_meta.location); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here be dragons when we have filters being applied to a column that doesn't exist in every file in the source.
if we have filters that touch columns which are not in the file's physical schema, we can't just skip it because the default value returned by the schema adapter might actually have failed the filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this behavior though, doesn't this break cases where we would like to push down an expression as much as possible even if it operates on a column that doesn't exist in certain files? Wouldn't an error here error the whole query out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you're right, the correct thing to do is to ignore any predicates over missing columns in the opener (develop behavior) and just rely on DF to post filter for us
| return false; | ||
| } | ||
|
|
||
| let _expr_str = format!("{:?}", df_expr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is debugging? dbg! is useful in these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was leftover from debugging, but the rationale was that I wanted a string variable when I stepped thru this in the debugger.
|
I had a more general message on discord:
But related to this PR is the opener erroring out on the predicate. I think it should not error even if the expression is on a missing column. |
Deploying vortex-bench with
|
| Latest commit: |
0959703
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3e31eebe.vortex-93b.pages.dev |
| Branch Preview URL: | https://aduffy-filter-pushdown-fix.vortex-93b.pages.dev |
In #5295, we accidentally broke nested filter pushdown. The issue is that FileSource::try_pushdown_filters seems like it's meant to evaluate using the whole file schema, rather than any projected schema. As an example, in the Github Archive benchmark dataset, we have the following query, which should trivially pushdown and be pruned, executing about 30ms or so: ``` SELECT COUNT(*) from events WHERE payload.ref = 'refs/head/main' ``` However, after this change, pushdown of this field was failing, pushing query time up 100x. The root cause is that the old logic attempted to apply the file schema to the source_expr directly. Concretely, for the gharchive query, the whole expression is something like: ```text BinaryExpr { lhs: GetField { source_expr: Column { name: "payload", index: 0 }, field_expr: Literal { value: "ref" } } rhs: Literal { value: "refs/head/main" } operator: Eq } ``` The issue is that the column index 0 is wrong for the whole file. Instead, we need to recursively ensure that the source_expr is a valid sequence of Column and GetField expressions that resolve properly. Note how we already were doing this for checking if a standalone Column expression can be pushed down: ``` } else if let Some(col) = expr.downcast_ref::<df_expr::Column>() { schema .field_with_name(col.name()) .ok() .is_some_and(|field| supported_data_types(field.data_type())) ``` Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
…tex (#5521) requires some non-auto changes to to `bench-vortex` --------- Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
0959703 to
5ddd409
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
|
Here's GHArchive query Post-filtering the string match in DF is actually a trivial amount of the overall runtime (0.3%)
The bigger problem is that when we tell DF that we can't push the filter, it prompts us to return a projection of |
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
8a0c97f to
373a365
Compare




In #5295, we accidentally broke nested filter pushdown. The issue is that FileSource::try_pushdown_filters seems like it's meant to evaluate using the whole file schema, rather than any projected schema. As an example, in the Github Archive benchmark dataset, we have the following query, which should trivially pushdown and be pruned, executing about 30ms or so:
However, after this change, pushdown of this field was failing, pushing query time up 100x. The root cause is that the old logic attempted to apply the file schema to the source_expr directly.
Concretely, for the gharchive query, the whole expression is something like:
The issue is that the column index 0 is wrong for the whole file. Instead, we need to recursively ensure that the source_expr is a valid sequence of Column and GetField expressions that resolve properly.
Note how we already were doing this for checking if a standalone Column expression can be pushed down: