Upgrade to arrow/parquet 55, and `object_store` to `0.12.0` and pyo3 to `0.24.0` by alamb · Pull Request #15466 · apache/datafusion

alamb · 2025-03-27T21:31:59Z

This PR upgrades to the latest arrow/parquet release and all the dependencies.

I used this PR to test the arrow/parquet 55 release prior to creating an RC.

Some benchmark results show performance is as good or better (see below)

I havn't figured out exactly why but I theorize it is at least in part due to @rluvaton 's PR to improve concat

Improve concat performance, and add append_array for some array builder implementations arrow-rs#7309

alamb · 2025-04-07T15:43:48Z

datafusion/core/src/test/object_store.rs

        &self,
        prefix: Option<&Path>,
-    ) -> BoxStream<'_, object_store::Result<ObjectMeta>> {
+    ) -> BoxStream<'static, object_store::Result<ObjectMeta>> {


Due to

Return BoxStream with 'static lifetime from ObjectStore::list arrow-rs#6619

alamb · 2025-04-07T15:44:25Z

datafusion/core/src/datasource/physical_plan/arrow_file.rs

                    // read footer according to footer_len
                    let get_option = GetOptions {
-                        range: Some(GetRange::Suffix(10 + footer_len)),
+                        range: Some(GetRange::Suffix(10 + (footer_len as u64))),


The changes to usize/u64 are for better wasm support, see

Use u64 range instead of usize, for better wasm32 support arrow-rs#6961

alamb · 2025-04-07T15:44:52Z

datafusion/datasource-parquet/src/reader.rs

-        self.inner.get_metadata()
+    fn get_metadata<'a>(
+        &'a mut self,
+        options: Option<&'a ArrowReaderOptions>,


Due to

Remove AsyncFileReader::get_metadata_with_options, add options to AsyncFileReader::get_metadata arrow-rs#7342

alamb · 2025-04-07T15:45:25Z

datafusion/execution/Cargo.toml

 futures = { workspace = true }
 log = { workspace = true }
-object_store = { workspace = true }
+object_store = { workspace = true, features = ["fs"] }


Due to

object_store: Add enabled-by-default "fs" feature arrow-rs#6636

alamb · 2025-04-07T15:46:14Z

datafusion/sqllogictest/test_files/expr/date_part.slt

 SELECT extract(minute from arrow_cast('14400 minutes', 'Interval(DayTime)'))
 ----
-14400
+0


due to

Fix: date_part to extract only the requested part (not the overall interval) arrow-rs#7189

alamb · 2025-04-08T14:47:53Z

datafusion/functions-aggregate/benches/array_agg.rs

+use rand::prelude::StdRng;
 use rand::Rng;
+use rand::SeedableRng;
+


I inlined the small amount of code from bench_util so this benchmark is standalone and easier to understand what is tested

alamb · 2025-04-09T11:26:46Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_upgrade_54 (86aab05) to 784df33 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-04-09T12:14:40Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_test_upgrade_54
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_test_upgrade_54 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1981.40ms │             1948.43ms │     no change │
│ QQuery 1     │   720.32ms │              721.66ms │     no change │
│ QQuery 2     │  1429.66ms │             1483.79ms │     no change │
│ QQuery 3     │   727.52ms │              723.50ms │     no change │
│ QQuery 4     │  1481.60ms │             1474.85ms │     no change │
│ QQuery 5     │ 16900.01ms │            15889.88ms │ +1.06x faster │
│ QQuery 6     │  2040.21ms │             2065.32ms │     no change │
└──────────────┴────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 25280.71ms │
│ Total Time (alamb_test_upgrade_54)   │ 24307.43ms │
│ Average Time (HEAD)                  │  3611.53ms │
│ Average Time (alamb_test_upgrade_54) │  3472.49ms │
│ Queries Faster                       │          1 │
│ Queries Slower                       │          0 │
│ Queries with No Change               │          6 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ alamb_test_upgrade_54 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     3.03ms │                2.37ms │ +1.28x faster │
│ QQuery 1     │    38.93ms │               38.09ms │     no change │
│ QQuery 2     │    94.89ms │               90.32ms │     no change │
│ QQuery 3     │   103.63ms │               99.62ms │     no change │
│ QQuery 4     │   894.48ms │              747.89ms │ +1.20x faster │
│ QQuery 5     │   959.49ms │              879.98ms │ +1.09x faster │
│ QQuery 6     │     2.48ms │                1.88ms │ +1.32x faster │
│ QQuery 7     │    44.43ms │               44.93ms │     no change │
│ QQuery 8     │  1020.50ms │              906.90ms │ +1.13x faster │
│ QQuery 9     │  1457.74ms │             1208.98ms │ +1.21x faster │
│ QQuery 10    │   286.57ms │              270.88ms │ +1.06x faster │
│ QQuery 11    │   322.13ms │              310.42ms │     no change │
│ QQuery 12    │  1028.94ms │              918.73ms │ +1.12x faster │
│ QQuery 13    │  1487.53ms │             1401.51ms │ +1.06x faster │
│ QQuery 14    │   948.47ms │              863.41ms │ +1.10x faster │
│ QQuery 15    │  1148.04ms │             1055.15ms │ +1.09x faster │
│ QQuery 16    │  1908.90ms │             1779.63ms │ +1.07x faster │
│ QQuery 17    │  1794.66ms │             1645.24ms │ +1.09x faster │
│ QQuery 18    │  3366.04ms │             3162.38ms │ +1.06x faster │
│ QQuery 19    │    92.27ms │               84.64ms │ +1.09x faster │
│ QQuery 20    │  1228.36ms │             1147.43ms │ +1.07x faster │
│ QQuery 21    │  1413.98ms │             1344.48ms │     no change │
│ QQuery 22    │  2506.03ms │             2358.29ms │ +1.06x faster │
│ QQuery 23    │  9004.64ms │             8567.92ms │     no change │
│ QQuery 24    │   512.34ms │              474.61ms │ +1.08x faster │
│ QQuery 25    │   410.78ms │              391.33ms │     no change │
│ QQuery 26    │   572.85ms │              541.29ms │ +1.06x faster │
│ QQuery 27    │  1706.69ms │             1695.40ms │     no change │
│ QQuery 28    │ 13103.90ms │            12559.44ms │     no change │
│ QQuery 29    │   552.29ms │              542.01ms │     no change │
│ QQuery 30    │   904.32ms │              844.04ms │ +1.07x faster │
│ QQuery 31    │   943.68ms │              901.04ms │     no change │
│ QQuery 32    │  3041.68ms │             2733.91ms │ +1.11x faster │
│ QQuery 33    │  3762.99ms │             3409.88ms │ +1.10x faster │
│ QQuery 34    │  3848.12ms │             3493.28ms │ +1.10x faster │
│ QQuery 35    │  1360.59ms │             1335.12ms │     no change │
│ QQuery 36    │   129.70ms │              129.64ms │     no change │
│ QQuery 37    │    59.98ms │               58.72ms │     no change │
│ QQuery 38    │   125.37ms │              125.25ms │     no change │
│ QQuery 39    │   197.61ms │              202.15ms │     no change │
│ QQuery 40    │    47.36ms │               48.55ms │     no change │
│ QQuery 41    │    44.57ms │               45.87ms │     no change │
│ QQuery 42    │    40.00ms │               40.47ms │     no change │
└──────────────┴────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 62520.96ms │
│ Total Time (alamb_test_upgrade_54)   │ 58503.03ms │
│ Average Time (HEAD)                  │  1453.98ms │
│ Average Time (alamb_test_upgrade_54) │  1360.54ms │
│ Queries Faster                       │         23 │
│ Queries Slower                       │          0 │
│ Queries with No Change               │         20 │
└──────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ alamb_test_upgrade_54 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 136.26ms │              121.79ms │ +1.12x faster │
│ QQuery 2     │  27.03ms │               23.96ms │ +1.13x faster │
│ QQuery 3     │  40.96ms │               34.86ms │ +1.17x faster │
│ QQuery 4     │  22.81ms │               21.62ms │ +1.06x faster │
│ QQuery 5     │  62.52ms │               61.53ms │     no change │
│ QQuery 6     │   9.40ms │                9.17ms │     no change │
│ QQuery 7     │ 111.59ms │              112.52ms │     no change │
│ QQuery 8     │  29.92ms │               27.85ms │ +1.07x faster │
│ QQuery 9     │  73.03ms │               64.47ms │ +1.13x faster │
│ QQuery 10    │  64.50ms │               62.95ms │     no change │
│ QQuery 11    │  14.46ms │               13.96ms │     no change │
│ QQuery 12    │  41.86ms │               37.01ms │ +1.13x faster │
│ QQuery 13    │  32.39ms │               31.69ms │     no change │
│ QQuery 14    │  11.14ms │                9.97ms │ +1.12x faster │
│ QQuery 15    │  29.00ms │               27.53ms │ +1.05x faster │
│ QQuery 16    │  24.22ms │               26.48ms │  1.09x slower │
│ QQuery 17    │ 110.86ms │              106.57ms │     no change │
│ QQuery 18    │ 267.60ms │              263.78ms │     no change │
│ QQuery 19    │  30.41ms │               29.82ms │     no change │
│ QQuery 20    │  41.97ms │               42.76ms │     no change │
│ QQuery 21    │ 192.03ms │              192.47ms │     no change │
│ QQuery 22    │  19.54ms │               18.46ms │ +1.06x faster │
└──────────────┴──────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                    │ 1393.52ms │
│ Total Time (alamb_test_upgrade_54)   │ 1341.21ms │
│ Average Time (HEAD)                  │   63.34ms │
│ Average Time (alamb_test_upgrade_54) │   60.96ms │
│ Queries Faster                       │        10 │
│ Queries Slower                       │         1 │
│ Queries with No Change               │        11 │
└──────────────────────────────────────┴───────────┘

alamb · 2025-04-10T10:52:58Z

I tried briefly to reproduce the performance improvements reported above and it seems like I can:

q14.sql.txt

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ for i in `seq 1 5`; do datafusion-cli -f q14.sql  ; done  | grep seconds
Elapsed 0.374 seconds.
Elapsed 0.386 seconds.
Elapsed 0.378 seconds.
Elapsed 0.366 seconds.
Elapsed 0.366 seconds.

andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ for i in `seq 1 5`; do ./datafusion-cli-alamb_test_upgrade_54 -f q14.sql  ; done  | grep seconds
Elapsed 0.384 seconds.
Elapsed 0.368 seconds.
Elapsed 0.356 seconds.
Elapsed 0.373 seconds.
Elapsed 0.377 seconds.

I poked around and I can't quite figure out if the improvement is related to concat batches or maybe reducing the number of IOs for reading the parquet metadata 🤔

mbutrovich · 2025-04-10T16:52:45Z

Still seeing if this is just noise, but here are flame graphs for Q14 from my machine if anyone else wants to stare at them:

This PR:

main:

alamb · 2025-04-11T17:32:23Z

Still seeing if this is just noise, but here are flame graphs for Q14 from my machine if anyone else wants to stare at them:

My theory is that the improvement is due to @rluvaton 's PR to improve concat

Improve concat performance, and add append_array for some array builder implementations arrow-rs#7309

I thought it might be related to improved pre-fetching / fewer IOs due to

Parquet: Support reading Parquet metadata via suffix range requests arrow-rs#7334

However, I tried an experiment on main to reduce IO and it doesn't seem to have changed anything

Set default metadata prefetch size to 512k #15670

mbutrovich · 2025-04-11T17:42:21Z

I thought it might be related to improved pre-fetching / fewer IOs due to

This should be easy to confirm with dtruss/dtrace/bpftrace. Let me see if I find a moment.

jayzhan211

👍🏻

rluvaton · 2025-04-13T23:05:43Z

Happy to improve performance 😄 I got more in my chamber

xudong963

LGTM

alamb · 2025-04-14T10:43:14Z

Thanks everyone!

…to `0.24.0` (apache#15466) * Temp pin to datafusion main * Update cargo lock * update pyo3 * vendor random generation * Update error message * Update for extraction * Update pin * Upgrade object_store * fix feature * Update file size handling * bash for object store API changes * few more * Update APIs more * update expected message * update error messages * Update to apache * Update API for nicer parquet u64s * Fix wasm build * Remove pin * Fix signature

alamb added 3 commits March 27, 2025 17:30

Temp pin to datafusion main

4626f9b

Update cargo lock

98d62e0

update pyo3

fc1f7a5

github-actions bot added the common Related to common crate label Mar 28, 2025

vendor random generation

9299b4b

github-actions bot added the functions Changes to functions implementation label Mar 28, 2025

alamb added 2 commits March 28, 2025 18:54

Update error message

7b8320e

Update for extraction

cdc55e5

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 28, 2025

alamb changed the title ~~(WIP) Start working on upgrading to arrow 55~~ (WIP) Upgrading to arrow 55 Mar 31, 2025

This was referenced Apr 1, 2025

WIP: Test enabling Parquet filter pushdown with parquet caching page cache reader #15506

Closed

feat: add rounding logic and scale zero fix parse_decimal to match parse_string_to_decimal_native behavior apache/arrow-rs#7179

Draft

alamb added 3 commits April 2, 2025 17:06

Merge remote-tracking branch 'apache/main' into alamb/test_upgrade_54

1952406

Update pin

53ec353

Merge remote-tracking branch 'apache/main' into alamb/test_upgrade_54

955d37f

alamb changed the title ~~(WIP) Upgrading to arrow 55~~ (WIP) Test Upgrading to arrow 55 Apr 7, 2025

alamb mentioned this pull request Apr 7, 2025

Release arrow-rs / parquet major version 55.0.0 (Apr 2025) apache/arrow-rs#7084

Closed

8 tasks

alamb added 3 commits April 7, 2025 10:46

Upgrade object_store

c103a03

fix feature

586851b

Update file size handling

cabfb58

alamb mentioned this pull request Apr 7, 2025

Change Parquet API interaction to use u64 (support files larger than 4GB in WASM) apache/arrow-rs#7371

Merged

alamb added 2 commits April 7, 2025 11:39

bash for object store API changes

d980c00

few more

2dd1827

github-actions bot added core Core DataFusion crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate labels Apr 7, 2025

alamb commented Apr 7, 2025

View reviewed changes

Update APIs more

dff9490

github-actions bot added the substrait Changes to the substrait crate label Apr 7, 2025

alamb changed the title ~~(WIP) Test Upgrading to arrow 55~~ (WIP) Upgrade to arrow 55 Apr 8, 2025

alamb added 2 commits April 8, 2025 09:15

Merge remote-tracking branch 'apache/main' into alamb/test_upgrade_54

b1bae93

update error messages

84725ff

alamb changed the title ~~(WIP) Upgrade to arrow 55~~ (WIP) Upgrade to arrow/parquet 55 Apr 8, 2025

alamb commented Apr 8, 2025

View reviewed changes

xudong963 mentioned this pull request Apr 8, 2025

Release DataFusion 47.0.0 (April 2025) #15072

Closed

39 tasks

alamb added 3 commits April 8, 2025 11:58

Update to apache

9bfc8a3

Update API for nicer parquet u64s

3d646d7

Fix wasm build

86aab05

xudong963 self-requested a review April 9, 2025 14:20

alamb added 2 commits April 11, 2025 13:25

Merge remote-tracking branch 'apache/main' into alamb/test_upgrade_54

984f106

Remove pin

2a30ca3

alamb changed the title ~~(WIP) Upgrade to arrow/parquet 55~~ Upgrade to arrow/parquet 55 Apr 11, 2025

Fix signature

1f0711e

alamb marked this pull request as ready for review April 11, 2025 17:37

alamb changed the title ~~Upgrade to arrow/parquet 55~~ Upgrade to arrow/parquet 55, and object_store to 0.12.0 and pyo3 to 0.24.0 Apr 11, 2025

jayzhan211 approved these changes Apr 13, 2025

View reviewed changes

xudong963 approved these changes Apr 14, 2025

View reviewed changes

alamb merged commit b717723 into apache:main Apr 14, 2025
30 checks passed

alamb deleted the alamb/test_upgrade_54 branch April 14, 2025 10:43

Dandandan mentioned this pull request Apr 15, 2025

Optimize TopK with threshold filter ~1.4x speedup #15697

Closed

Conversation

alamb commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

alamb commented Apr 9, 2025

Uh oh!

alamb commented Apr 10, 2025

Uh oh!

mbutrovich commented Apr 10, 2025

Uh oh!

alamb commented Apr 11, 2025

Uh oh!

mbutrovich commented Apr 11, 2025

Uh oh!

jayzhan211 left a comment

Choose a reason for hiding this comment

Uh oh!

rluvaton commented Apr 13, 2025

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

alamb commented Mar 27, 2025 •

edited

Loading