Upgrade to arrow/parquet 55, and object_store to 0.12.0 and pyo3 to 0.24.0#15466
Upgrade to arrow/parquet 55, and object_store to 0.12.0 and pyo3 to 0.24.0#15466alamb merged 24 commits intoapache:mainfrom
object_store to 0.12.0 and pyo3 to 0.24.0#15466Conversation
| &self, | ||
| prefix: Option<&Path>, | ||
| ) -> BoxStream<'_, object_store::Result<ObjectMeta>> { | ||
| ) -> BoxStream<'static, object_store::Result<ObjectMeta>> { |
There was a problem hiding this comment.
| // read footer according to footer_len | ||
| let get_option = GetOptions { | ||
| range: Some(GetRange::Suffix(10 + footer_len)), | ||
| range: Some(GetRange::Suffix(10 + (footer_len as u64))), |
There was a problem hiding this comment.
The changes to usize/u64 are for better wasm support, see
| self.inner.get_metadata() | ||
| fn get_metadata<'a>( | ||
| &'a mut self, | ||
| options: Option<&'a ArrowReaderOptions>, |
There was a problem hiding this comment.
| futures = { workspace = true } | ||
| log = { workspace = true } | ||
| object_store = { workspace = true } | ||
| object_store = { workspace = true, features = ["fs"] } |
There was a problem hiding this comment.
| SELECT extract(minute from arrow_cast('14400 minutes', 'Interval(DayTime)')) | ||
| ---- | ||
| 14400 | ||
| 0 |
There was a problem hiding this comment.
| use rand::prelude::StdRng; | ||
| use rand::Rng; | ||
| use rand::SeedableRng; | ||
|
|
There was a problem hiding this comment.
I inlined the small amount of code from bench_util so this benchmark is standalone and easier to understand what is tested
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I tried briefly to reproduce the performance improvements reported above and it seems like I can: andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ for i in `seq 1 5`; do datafusion-cli -f q14.sql ; done | grep seconds
Elapsed 0.374 seconds.
Elapsed 0.386 seconds.
Elapsed 0.378 seconds.
Elapsed 0.366 seconds.
Elapsed 0.366 seconds.I poked around and I can't quite figure out if the improvement is related to concat batches or maybe reducing the number of IOs for reading the parquet metadata 🤔 |
My theory is that the improvement is due to @rluvaton 's PR to improve concat I thought it might be related to improved pre-fetching / fewer IOs due to However, I tried an experiment on main to reduce IO and it doesn't seem to have changed anything |
object_store to 0.12.0 and pyo3 to 0.24.0
This should be easy to confirm with |
|
Happy to improve performance 😄 I got more in my chamber |
|
Thanks everyone! |
…to `0.24.0` (apache#15466) * Temp pin to datafusion main * Update cargo lock * update pyo3 * vendor random generation * Update error message * Update for extraction * Update pin * Upgrade object_store * fix feature * Update file size handling * bash for object store API changes * few more * Update APIs more * update expected message * update error messages * Update to apache * Update API for nicer parquet u64s * Fix wasm build * Remove pin * Fix signature
This PR upgrades to the latest arrow/parquet release and all the dependencies.
I used this PR to test the arrow/parquet 55 release prior to creating an RC.
Some benchmark results show performance is as good or better (see below)
I havn't figured out exactly why but I theorize it is at least in part due to @rluvaton 's PR to improve
concatconcatperformance, and addappend_arrayfor some array builder implementations arrow-rs#7309