Skip to content

Evaluate filter pushdown against the physical schema for performance and correctness #15780

@adriangb

Description

@adriangb

Describe the bug

Consider the following test:

COPY  (
  SELECT arrow_cast(a, 'Int16') AS a
  FROM ( VALUES (1), (2), (3) ) AS t(a)
)  TO 'test_files/scratch/parquet_filter_pushdown/parquet_table/1.parquet'
STORED AS PARQUET;

set datafusion.execution.parquet.pushdown_filters = true;

CREATE EXTERNAL TABLE t_pushdown(a int) STORED AS PARQUET
LOCATION 'test_files/scratch/parquet_filter_pushdown/parquet_table/';

select * from t_pushdown where a = arrow_cast(2, 'Int8');

At some point DataFusion optimizes the Int8 filter by casting the filter to Int32 (matching the table schema, thus avoiding having to cast the column).

So when the filter gets into ParquetSource it's an Int32 filter. But when we read the file schema it's actually an Int8! Since we now build pruning predicates, etc. on a per-file basis using the physical file schema this can introduce casting of the data from Int8 to Int32 which is unnecessary because (1) we could cast the filter instead which would be much cheaper and (2) if the file type and filter type were both Int8 or Int16 in this example (as might happen if one changes the table schema but not old data or old queries) we would actually be closer to the original intent of the query.

To be clear, I do not mean that this is a new regression. I believe this has always been the case but now we can actually fix it and before we could not.

This applies not only to stats filtering (where the impact is likely negligible) but also to predicate pushdown where I expect the impact may be much larger especially for cases where we never end up materializing the columns (and thus don't have to cast them to the table's data type at all). I don't know that any benchmark measures this case at the moment though.

To resolve this I think we just need to call optimize_casts(physical_expr, physical_file_schema) (a made up function) but I don't know where or howoptimize_casts exists (I feel like it must already exist, maybe it's at the logical expr level?). Does anyone know where this exists?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions