Skip to content

ParquetScan with filter_pushdown enabled filter takes too much time to process #13298

@Lordworms

Description

@Lordworms

Describe the bug

I was doing #13054 and after rebase main I find out that the current parquet predicate pushdown may have some problem, it is 7 times slower than without a predicate
image

To Reproduce

the script to generate the parquet file

import pandas as pd
import numpy as np

part = pd.DataFrame({
    'p_partkey': np.random.randint(10, 21, size=1000),
    'p_brand': np.random.choice(['Brand#1', 'Brand#2', 'Brand#3'], 1000),
    'p_container': np.random.choice(['SM BOX', 'LG BOX', 'MED BOX'], 1000)
})

lineitem = pd.DataFrame({
    'l_partkey': np.random.randint(1, 1000001, size=100000000),
    'l_quantity': np.random.uniform(1, 50, size=100000000),
    'l_extendedprice': np.random.uniform(100, 10000, size=100000000)
})

part.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/part.parquet', index=False)
lineitem.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet', index=False)

The results

> CREATE EXTERNAL TABLE lineitem (
    l_partkey BIGINT,
    l_quantity DOUBLE,
    l_extendedprice DOUBLE
)
STORED AS PARQUET
LOCATION '/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet';
0 row(s) fetched. 
Elapsed 0.016 seconds.

> set datafusion.execution.target_partitions = 2;
0 row(s) fetched. 
Elapsed 0.002 seconds.

> select count(*) from linitem where l_partkey >= 10 and l_partkey <= 20;
Error during planning: table 'datafusion.public.linitem' not found
> select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20;
+----------+
| count(*) |
+----------+
| 1093     |
+----------+
1 row(s) fetched. 
Elapsed 7.037 seconds.

> select count(*) from lineitem;
+-----------+
| count(*)  |
+-----------+
| 100000000 |
+-----------+
1 row(s) fetched. 
Elapsed 1.022 seconds.

> set datafusion.execution.parquet.pushdown_filters = true;
0 row(s) fetched. 
Elapsed 0.002 seconds.

> select count(*) from lineitem;
+-----------+
| count(*)  |
+-----------+
| 100000000 |
+-----------+
1 row(s) fetched. 
Elapsed 1.029 seconds.

> select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20;
+----------+
| count(*) |
+----------+
| 1093     |
+----------+
1 row(s) fetched. 
Elapsed 6.748 seconds.

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions