Skip to content

Bug: the new filter pushdown optimizer rule in physical layer will miss the equivalence info in filter #16563

@xudong963

Description

@xudong963

Describe the bug

# create table
COPY (SELECT * FROM values (1, 'a'), (2, 'b') t(int_col, str_col)) to 'test_files/scratch/table/1.parquet';
COPY (SELECT * FROM values ('c', 3), ('d', -1) t(str_col, int_col)) to 'test_files/scratch/table/2.parquet';
create external table t stored as parquet location 'test_files/scratch/table';
set datafusion.explain.format = indent;

# No filter pushdown in datasource:
set datafusion.execution.parquet.pushdown_filters = false; (default false, I guess that's why exising tests aren't triggered)

# No SortExec
explain select * from t where t.int_col = 2 order by int_col;
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                     |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t.int_col ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                                                                           |
|               |   Filter: t.int_col = Int64(2)                                                                                                                                                                                                                                                                                                                                                                                           |
|               |     TableScan: t projection=[int_col, str_col], partial_filters=[t.int_col = Int64(2)]                                                                                                                                                                                                                                                                                                                                   |
| physical_plan | CoalescePartitionsExec                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |   CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                                                                                                                                                                                            |
|               |     FilterExec: int_col@0 = 2                                                                                                                                                                                                                                                                                                                                                                                            |
|               |       RepartitionExec: partitioning=RoundRobinBatch(12), input_partitions=2                                                                                                                                                                                                                                                                                                                                              |
|               |         DataSourceExec: file_groups={2 groups: [[Users/xudong/opensource/datafusion/test_files/scratch/table/1.parquet], [Users/xudong/opensource/datafusion/test_files/scratch/table/2.parquet]]}, projection=[int_col, str_col], file_type=parquet, predicate=int_col@0 = 2, pruning_predicate=int_col_null_count@2 != row_count@3 AND int_col_min@0 <= 2 AND 2 <= int_col_max@1, required_guarantees=[int_col in (2)] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                          |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                          |
+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 
# Enable fitler push in datasource
set datafusion.execution.parquet.pushdown_filters = true;
 
# SortExec isn't removed, because the equivalence info from filter misses
explain select * from t where t.int_col = 2 order by int_col;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                 |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Sort: t.int_col ASC NULLS LAST                                                                                                                                                                                                                                                                                                                                                                                       |
|               |   Filter: t.int_col = Int64(2)                                                                                                                                                                                                                                                                                                                                                                                       |
|               |     TableScan: t projection=[int_col, str_col], partial_filters=[t.int_col = Int64(2)]                                                                                                                                                                                                                                                                                                                               |
| physical_plan | SortPreservingMergeExec: [int_col@0 ASC NULLS LAST]                                                                                                                                                                                                                                                                                                                                                                  |
|               |   SortExec: expr=[int_col@0 ASC NULLS LAST], preserve_partitioning=[true]                                                                                                                                                                                                                                                                                                                                            |
|               |     DataSourceExec: file_groups={2 groups: [[Users/xudong/opensource/datafusion/test_files/scratch/table/1.parquet], [Users/xudong/opensource/datafusion/test_files/scratch/table/2.parquet]]}, projection=[int_col, str_col], file_type=parquet, predicate=int_col@0 = 2, pruning_predicate=int_col_null_count@2 != row_count@3 AND int_col_min@0 <= 2 AND 2 <= int_col_max@1, required_guarantees=[int_col in (2)] |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                      |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                      |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

To Reproduce

See above

Expected behavior

The SortExec in the case should be removed

Additional context

DF47 works.

The bug is found during upgrading DF48.

I may not have time to fix it until next next week, if anyone is interested in it, feel free to pick.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions