Skip to content

WIP: Test DataFusion with experimental Parquet Filter Pushdown #16222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jun 1, 2025

This PR is for testing DataFusion with the code in the following PR

This is the second of 2 experiments:

  1. Does ClickBench performance improve with pushdown_filters enabled?

The first experiment is in

@alamb alamb changed the title WIP: Test DataFusion with experimental IncrementalRecordBatchBuilder WIP: Test DataFusion with experimental Parquet Filter Pushdown Jun 1, 2025
@alamb
Copy link
Contributor Author

alamb commented Jun 1, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_actual_pushdown (c646027) to 7002a00 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Jun 1, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_test_actual_pushdown
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_test_actual_pushdown ┃         Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1911.99ms │                  1914.70ms │      no change │
│ QQuery 1     │   699.11ms │                   719.00ms │      no change │
│ QQuery 2     │  1398.50ms │                  1478.56ms │   1.06x slower │
│ QQuery 3     │   689.24ms │                   712.50ms │      no change │
│ QQuery 4     │  1452.92ms │                  1595.74ms │   1.10x slower │
│ QQuery 5     │ 15657.55ms │                 15569.30ms │      no change │
│ QQuery 6     │  2032.14ms │                   141.07ms │ +14.41x faster │
│ QQuery 7     │  2123.18ms │                  2226.42ms │      no change │
│ QQuery 8     │   837.16ms │                   851.36ms │      no change │
└──────────────┴────────────┴────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 26801.79ms │
│ Total Time (alamb_test_actual_pushdown)   │ 25208.64ms │
│ Average Time (HEAD)                       │  2977.98ms │
│ Average Time (alamb_test_actual_pushdown) │  2800.96ms │
│ Queries Faster                            │          1 │
│ Queries Slower                            │          2 │
│ Queries with No Change                    │          6 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_test_actual_pushdown ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    16.19ms │                    16.24ms │     no change │
│ QQuery 1     │    33.68ms │                    35.46ms │  1.05x slower │
│ QQuery 2     │    82.89ms │                    82.60ms │     no change │
│ QQuery 3     │    97.24ms │                    94.45ms │     no change │
│ QQuery 4     │   607.01ms │                   675.90ms │  1.11x slower │
│ QQuery 5     │   858.70ms │                   919.45ms │  1.07x slower │
│ QQuery 6     │    22.87ms │                    24.63ms │  1.08x slower │
│ QQuery 7     │    38.68ms │                    43.00ms │  1.11x slower │
│ QQuery 8     │   945.68ms │                   939.59ms │     no change │
│ QQuery 9     │  1243.06ms │                  1246.08ms │     no change │
│ QQuery 10    │   268.45ms │                   280.56ms │     no change │
│ QQuery 11    │   299.09ms │                   319.60ms │  1.07x slower │
│ QQuery 12    │   919.23ms │                  1014.32ms │  1.10x slower │
│ QQuery 13    │  1337.56ms │                  1523.46ms │  1.14x slower │
│ QQuery 14    │   856.70ms │                  1029.41ms │  1.20x slower │
│ QQuery 15    │   847.58ms │                   839.69ms │     no change │
│ QQuery 16    │  1730.92ms │                  1729.51ms │     no change │
│ QQuery 17    │  1607.73ms │                  1593.69ms │     no change │
│ QQuery 18    │  3099.12ms │                  3197.56ms │     no change │
│ QQuery 19    │    84.99ms │                    91.43ms │  1.08x slower │
│ QQuery 20    │  1121.22ms │                  1197.78ms │  1.07x slower │
│ QQuery 21    │  1345.40ms │                  1361.52ms │     no change │
│ QQuery 22    │  2199.07ms │                  2431.34ms │  1.11x slower │
│ QQuery 23    │  8112.04ms │                  3547.80ms │ +2.29x faster │
│ QQuery 24    │   475.29ms │                   631.80ms │  1.33x slower │
│ QQuery 25    │   393.16ms │                   432.03ms │  1.10x slower │
│ QQuery 26    │   537.95ms │                   692.90ms │  1.29x slower │
│ QQuery 27    │  1574.43ms │                  1809.52ms │  1.15x slower │
│ QQuery 28    │ 12663.96ms │                 12974.85ms │     no change │
│ QQuery 29    │   530.66ms │                   524.38ms │     no change │
│ QQuery 30    │   811.90ms │                  1288.32ms │  1.59x slower │
│ QQuery 31    │   875.23ms │                  1301.29ms │  1.49x slower │
│ QQuery 32    │  2699.84ms │                  2755.06ms │     no change │
│ QQuery 33    │  3352.11ms │                  3436.53ms │     no change │
│ QQuery 34    │  3410.67ms │                  3433.96ms │     no change │
│ QQuery 35    │  1338.26ms │                  1331.31ms │     no change │
│ QQuery 36    │   126.01ms │                    28.46ms │ +4.43x faster │
│ QQuery 37    │    56.31ms │                    29.07ms │ +1.94x faster │
│ QQuery 38    │   124.21ms │                    28.50ms │ +4.36x faster │
│ QQuery 39    │   198.12ms │                    28.56ms │ +6.94x faster │
│ QQuery 40    │    49.66ms │                    27.47ms │ +1.81x faster │
│ QQuery 41    │    44.20ms │                    26.20ms │ +1.69x faster │
│ QQuery 42    │    39.80ms │                    26.77ms │ +1.49x faster │
└──────────────┴────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 57076.85ms │
│ Total Time (alamb_test_actual_pushdown)   │ 55042.05ms │
│ Average Time (HEAD)                       │  1327.37ms │
│ Average Time (alamb_test_actual_pushdown) │  1280.05ms │
│ Queries Faster                            │          8 │
│ Queries Slower                            │         18 │
│ Queries with No Change                    │         17 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ alamb_test_actual_pushdown ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 118.82ms │                   116.19ms │    no change │
│ QQuery 2     │  22.15ms │                    22.21ms │    no change │
│ QQuery 3     │  35.15ms │                    36.53ms │    no change │
│ QQuery 4     │  20.36ms │                    21.03ms │    no change │
│ QQuery 5     │  53.46ms │                    54.15ms │    no change │
│ QQuery 6     │  12.10ms │                    13.53ms │ 1.12x slower │
│ QQuery 7     │  98.73ms │                    97.33ms │    no change │
│ QQuery 8     │  26.26ms │                    27.08ms │    no change │
│ QQuery 9     │  56.46ms │                    59.54ms │ 1.05x slower │
│ QQuery 10    │  58.11ms │                    58.47ms │    no change │
│ QQuery 11    │  11.75ms │                    11.81ms │    no change │
│ QQuery 12    │  42.19ms │                    45.28ms │ 1.07x slower │
│ QQuery 13    │  27.50ms │                    27.43ms │    no change │
│ QQuery 14    │   9.99ms │                    10.24ms │    no change │
│ QQuery 15    │  22.87ms │                    23.76ms │    no change │
│ QQuery 16    │  21.18ms │                    21.12ms │    no change │
│ QQuery 17    │  95.83ms │                    97.23ms │    no change │
│ QQuery 18    │ 209.53ms │                   219.14ms │    no change │
│ QQuery 19    │  25.82ms │                    27.12ms │ 1.05x slower │
│ QQuery 20    │  34.43ms │                    37.32ms │ 1.08x slower │
│ QQuery 21    │ 158.77ms │                   162.07ms │    no change │
│ QQuery 22    │  16.35ms │                    16.34ms │    no change │
└──────────────┴──────────┴────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1177.83ms │
│ Total Time (alamb_test_actual_pushdown)   │ 1204.93ms │
│ Average Time (HEAD)                       │   53.54ms │
│ Average Time (alamb_test_actual_pushdown) │   54.77ms │
│ Queries Faster                            │         0 │
│ Queries Slower                            │         5 │
│ Queries with No Change                    │        17 │
└───────────────────────────────────────────┴───────────┘

@zhuqi-lucas
Copy link
Contributor

🤖: Benchmark completed

Details

Comparing HEAD and alamb_test_actual_pushdown
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_test_actual_pushdown ┃         Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1911.99ms │                  1914.70ms │      no change │
│ QQuery 1     │   699.11ms │                   719.00ms │      no change │
│ QQuery 2     │  1398.50ms │                  1478.56ms │   1.06x slower │
│ QQuery 3     │   689.24ms │                   712.50ms │      no change │
│ QQuery 4     │  1452.92ms │                  1595.74ms │   1.10x slower │
│ QQuery 5     │ 15657.55ms │                 15569.30ms │      no change │
│ QQuery 6     │  2032.14ms │                   141.07ms │ +14.41x faster │
│ QQuery 7     │  2123.18ms │                  2226.42ms │      no change │
│ QQuery 8     │   837.16ms │                   851.36ms │      no change │
└──────────────┴────────────┴────────────────────────────┴────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 26801.79ms │
│ Total Time (alamb_test_actual_pushdown)   │ 25208.64ms │
│ Average Time (HEAD)                       │  2977.98ms │
│ Average Time (alamb_test_actual_pushdown) │  2800.96ms │
│ Queries Faster                            │          1 │
│ Queries Slower                            │          2 │
│ Queries with No Change                    │          6 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_test_actual_pushdown ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    16.19ms │                    16.24ms │     no change │
│ QQuery 1     │    33.68ms │                    35.46ms │  1.05x slower │
│ QQuery 2     │    82.89ms │                    82.60ms │     no change │
│ QQuery 3     │    97.24ms │                    94.45ms │     no change │
│ QQuery 4     │   607.01ms │                   675.90ms │  1.11x slower │
│ QQuery 5     │   858.70ms │                   919.45ms │  1.07x slower │
│ QQuery 6     │    22.87ms │                    24.63ms │  1.08x slower │
│ QQuery 7     │    38.68ms │                    43.00ms │  1.11x slower │
│ QQuery 8     │   945.68ms │                   939.59ms │     no change │
│ QQuery 9     │  1243.06ms │                  1246.08ms │     no change │
│ QQuery 10    │   268.45ms │                   280.56ms │     no change │
│ QQuery 11    │   299.09ms │                   319.60ms │  1.07x slower │
│ QQuery 12    │   919.23ms │                  1014.32ms │  1.10x slower │
│ QQuery 13    │  1337.56ms │                  1523.46ms │  1.14x slower │
│ QQuery 14    │   856.70ms │                  1029.41ms │  1.20x slower │
│ QQuery 15    │   847.58ms │                   839.69ms │     no change │
│ QQuery 16    │  1730.92ms │                  1729.51ms │     no change │
│ QQuery 17    │  1607.73ms │                  1593.69ms │     no change │
│ QQuery 18    │  3099.12ms │                  3197.56ms │     no change │
│ QQuery 19    │    84.99ms │                    91.43ms │  1.08x slower │
│ QQuery 20    │  1121.22ms │                  1197.78ms │  1.07x slower │
│ QQuery 21    │  1345.40ms │                  1361.52ms │     no change │
│ QQuery 22    │  2199.07ms │                  2431.34ms │  1.11x slower │
│ QQuery 23    │  8112.04ms │                  3547.80ms │ +2.29x faster │
│ QQuery 24    │   475.29ms │                   631.80ms │  1.33x slower │
│ QQuery 25    │   393.16ms │                   432.03ms │  1.10x slower │
│ QQuery 26    │   537.95ms │                   692.90ms │  1.29x slower │
│ QQuery 27    │  1574.43ms │                  1809.52ms │  1.15x slower │
│ QQuery 28    │ 12663.96ms │                 12974.85ms │     no change │
│ QQuery 29    │   530.66ms │                   524.38ms │     no change │
│ QQuery 30    │   811.90ms │                  1288.32ms │  1.59x slower │
│ QQuery 31    │   875.23ms │                  1301.29ms │  1.49x slower │
│ QQuery 32    │  2699.84ms │                  2755.06ms │     no change │
│ QQuery 33    │  3352.11ms │                  3436.53ms │     no change │
│ QQuery 34    │  3410.67ms │                  3433.96ms │     no change │
│ QQuery 35    │  1338.26ms │                  1331.31ms │     no change │
│ QQuery 36    │   126.01ms │                    28.46ms │ +4.43x faster │
│ QQuery 37    │    56.31ms │                    29.07ms │ +1.94x faster │
│ QQuery 38    │   124.21ms │                    28.50ms │ +4.36x faster │
│ QQuery 39    │   198.12ms │                    28.56ms │ +6.94x faster │
│ QQuery 40    │    49.66ms │                    27.47ms │ +1.81x faster │
│ QQuery 41    │    44.20ms │                    26.20ms │ +1.69x faster │
│ QQuery 42    │    39.80ms │                    26.77ms │ +1.49x faster │
└──────────────┴────────────┴────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 57076.85ms │
│ Total Time (alamb_test_actual_pushdown)   │ 55042.05ms │
│ Average Time (HEAD)                       │  1327.37ms │
│ Average Time (alamb_test_actual_pushdown) │  1280.05ms │
│ Queries Faster                            │          8 │
│ Queries Slower                            │         18 │
│ Queries with No Change                    │         17 │
└───────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ alamb_test_actual_pushdown ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 118.82ms │                   116.19ms │    no change │
│ QQuery 2     │  22.15ms │                    22.21ms │    no change │
│ QQuery 3     │  35.15ms │                    36.53ms │    no change │
│ QQuery 4     │  20.36ms │                    21.03ms │    no change │
│ QQuery 5     │  53.46ms │                    54.15ms │    no change │
│ QQuery 6     │  12.10ms │                    13.53ms │ 1.12x slower │
│ QQuery 7     │  98.73ms │                    97.33ms │    no change │
│ QQuery 8     │  26.26ms │                    27.08ms │    no change │
│ QQuery 9     │  56.46ms │                    59.54ms │ 1.05x slower │
│ QQuery 10    │  58.11ms │                    58.47ms │    no change │
│ QQuery 11    │  11.75ms │                    11.81ms │    no change │
│ QQuery 12    │  42.19ms │                    45.28ms │ 1.07x slower │
│ QQuery 13    │  27.50ms │                    27.43ms │    no change │
│ QQuery 14    │   9.99ms │                    10.24ms │    no change │
│ QQuery 15    │  22.87ms │                    23.76ms │    no change │
│ QQuery 16    │  21.18ms │                    21.12ms │    no change │
│ QQuery 17    │  95.83ms │                    97.23ms │    no change │
│ QQuery 18    │ 209.53ms │                   219.14ms │    no change │
│ QQuery 19    │  25.82ms │                    27.12ms │ 1.05x slower │
│ QQuery 20    │  34.43ms │                    37.32ms │ 1.08x slower │
│ QQuery 21    │ 158.77ms │                   162.07ms │    no change │
│ QQuery 22    │  16.35ms │                    16.34ms │    no change │
└──────────────┴──────────┴────────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                         ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                         │ 1177.83ms │
│ Total Time (alamb_test_actual_pushdown)   │ 1204.93ms │
│ Average Time (HEAD)                       │   53.54ms │
│ Average Time (alamb_test_actual_pushdown) │   54.77ms │
│ Queries Faster                            │         0 │
│ Queries Slower                            │         5 │
│ Queries with No Change                    │        17 │
└───────────────────────────────────────────┴───────────┘

The clickbench only has several cases with real regression > 20%, and i believe those cases can be improved by combined with adaptive, i think we are at good state.

@alamb
Copy link
Contributor Author

alamb commented Jun 1, 2025

The clickbench only has several cases with real regression > 20%, and i believe those cases can be improved by combined with adaptive, i think we are at good state.

I agree -- thank you @zhuqi-lucas

I have a few other optimization ideas on #16208 (comment) that will help this case too.

It would also be super helpful to profile / review the queries where the performance slows down , like Q14 and Q21 and see if those are the ones where the adaptive filtering would help

│ QQuery 14    │   856.70ms │                  1029.41ms │  1.20x slower │
│ QQuery 22    │  2199.07ms │                  2431.34ms │  1.11x slower │

Q14:

 SELECT "SearchEngineID", "SearchPhrase", COUNT(*) AS c FROM hits WHERE "SearchPhrase" <> '' GROUP BY "SearchEngineID", "SearchPhrase" ORDER BY c DESC LIMIT 10;

Q21:

SELECT "SearchPhrase", MIN("URL"), COUNT(*) AS c FROM hits WHERE "URL" LIKE '%google%' AND "SearchPhrase" <> '' GROUP BY "SearchPhrase" ORDER BY c DESC LIMIT 10;

@zhuqi-lucas
Copy link
Contributor

#16208 (comment)

apache/arrow-rs#7524 (comment)

Thank you @alamb , from previous result, it will help Q14 Q24 Q30 Q31 , which are the major regression from this PR benchmark result, but it seems not help Q21/22.

│ QQuery 14 │ 856.70ms │ 1029.41ms │ 1.20x slower │

│ QQuery 24 │ 475.29ms │ 631.80ms │ 1.33x slower │

│ QQuery 30 │ 811.90ms │ 1288.32ms │ 1.59x slower │

│ QQuery 31 │ 875.23ms │ 1301.29ms │ 1.49x slower │

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants