Description
Is your feature request related to a problem or challenge?
At query time, our use case requires that we evaluate predicates against in-memory data that may have a schema that is a subset of the table schema. The predicate can reference columns that are not currently in memory or known at query time.
For example, given the following in-memory data:
col_a | value |
---|---|
A | 42 |
We may have to evaluate a predicate such as col_a != A AND col_b=bananas
. Where col_b
is not present in the in-memory schema / unknown at pruning time, but is a valid column for the table in the system as a whole.
Because at query time we have a limited subset of the schema, the schema and statistics provided when constructing the PruningPredicate
covers only col_a, value
.
However the col_a != A
portion of the predicate can be proven FALSE irrespective of col_b
. Unfortunately constructing the PruningPredicate
eagerly validates the presence of statistics for all columns in the predicate, and errors stating that there are no fields named col_b
before attempting to evaluate any portion of the predicate.
Describe the solution you'd like
Attempt to evaluate the predicate based on the available statistics, and return FALSE if possible. If the predicate cannot be proven FALSE, return a "missing column" error as it does today.
For the example above, ideally pruning should return FALSE as it can be proven that col_a != A
is FALSE even though col_b
is unknown at pruning time.
Describe alternatives you've considered
Inserting NULL statistics into the pruning schema to satisfy the presence check - this works around the issue, but unfortunately requires extra processing to prevent the missing field error.
Additional context
This change in behaviour might need sticking behind a flag/option to opt into, rather than being the default.