-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
I was doing #13054 and after rebase main I find out that the current parquet predicate pushdown may have some problem, it is 7 times slower than without a predicate

To Reproduce
the script to generate the parquet file
import pandas as pd
import numpy as np
part = pd.DataFrame({
'p_partkey': np.random.randint(10, 21, size=1000),
'p_brand': np.random.choice(['Brand#1', 'Brand#2', 'Brand#3'], 1000),
'p_container': np.random.choice(['SM BOX', 'LG BOX', 'MED BOX'], 1000)
})
lineitem = pd.DataFrame({
'l_partkey': np.random.randint(1, 1000001, size=100000000),
'l_quantity': np.random.uniform(1, 50, size=100000000),
'l_extendedprice': np.random.uniform(100, 10000, size=100000000)
})
part.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/part.parquet', index=False)
lineitem.to_parquet('/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet', index=False)The results
> CREATE EXTERNAL TABLE lineitem (
l_partkey BIGINT,
l_quantity DOUBLE,
l_extendedprice DOUBLE
)
STORED AS PARQUET
LOCATION '/Users/yxiang1/Personal/code/datafusion/.vscode/lineitem.parquet';
0 row(s) fetched.
Elapsed 0.016 seconds.
> set datafusion.execution.target_partitions = 2;
0 row(s) fetched.
Elapsed 0.002 seconds.
> select count(*) from linitem where l_partkey >= 10 and l_partkey <= 20;
Error during planning: table 'datafusion.public.linitem' not found
> select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20;
+----------+
| count(*) |
+----------+
| 1093 |
+----------+
1 row(s) fetched.
Elapsed 7.037 seconds.
> select count(*) from lineitem;
+-----------+
| count(*) |
+-----------+
| 100000000 |
+-----------+
1 row(s) fetched.
Elapsed 1.022 seconds.
> set datafusion.execution.parquet.pushdown_filters = true;
0 row(s) fetched.
Elapsed 0.002 seconds.
> select count(*) from lineitem;
+-----------+
| count(*) |
+-----------+
| 100000000 |
+-----------+
1 row(s) fetched.
Elapsed 1.029 seconds.
> select count(*) from lineitem where l_partkey >= 10 and l_partkey <= 20;
+----------+
| count(*) |
+----------+
| 1093 |
+----------+
1 row(s) fetched.
Elapsed 6.748 seconds.
Expected behavior
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working