Skip to content

Conversation

@kevinjqliu
Copy link
Collaborator

@kevinjqliu kevinjqliu commented Nov 1, 2025

Closes #197 #198 #199

@kevinjqliu
Copy link
Collaborator Author

looks like we need to refactor get_column_writers since its now deprecated

@clflushopt
Copy link
Owner

I am wondering what the impact is of moving to 57 for all three deps

@kevinjqliu
Copy link
Collaborator Author

I am wondering what the impact is of moving to 57 for all three deps

in terms of performance?

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kevinjqliu and @clflushopt

I ran some unscientific benchmarks on my laptop and indeed it seems like the upgrade makes writing 10% slower for some reason 🤔

Benchmark

rm -rf lineitem && time tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet
Release Time
main 0m25.369s
main 0m25.729s
this PR 0m28.516s
this PR 0m28.682s

I'll do some profiling and see if I can see any reason

// Create writers for each of the leaf columns
let mut col_writers = get_column_writers(&parquet_schema, &writer_properties, &schema).unwrap();
#[allow(deprecated)]
let mut col_writers = parquet::arrow::arrow_writer::get_column_writers(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we just need to change this to follow the example here: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html

@kevinjqliu
Copy link
Collaborator Author

Not scientific either, this shows improvements for this PR vs last release (2.0.1)
only changes since 2.0.1 release are related to github actions, v2.0.1...main

➜ hyperfine --warmup 5 --runs 30 './target/release/tpchgen-cli -s 10'
Benchmark 1: ./target/release/tpchgen-cli -s 10
  Time (mean ± σ):     729.7 ms ±   7.6 ms    [User: 702.9 ms, System: 20.1 ms]
  Range (min … max):   714.5 ms … 748.4 ms    30 runs
 
➜ hyperfine --warmup 5 --runs 30 'uvx tpchgen-cli -s 10'
Benchmark 1: uvx tpchgen-cli -s 10
  Time (mean ± σ):     825.4 ms ±  12.3 ms    [User: 706.8 ms, System: 29.4 ms]
  Range (min … max):   812.7 ms … 873.6 ms    30 runs

@alamb
Copy link
Collaborator

alamb commented Nov 4, 2025

Not scientific either, this shows improvements for this PR vs last release (2.0.1) only changes since 2.0.1 release are related to github actions, v2.0.1...main

➜ hyperfine --warmup 5 --runs 30 './target/release/tpchgen-cli -s 10'
Benchmark 1: ./target/release/tpchgen-cli -s 10
  Time (mean ± σ):     729.7 ms ±   7.6 ms    [User: 702.9 ms, System: 20.1 ms]
  Range (min … max):   714.5 ms … 748.4 ms    30 runs
 
➜ hyperfine --warmup 5 --runs 30 'uvx tpchgen-cli -s 10'
Benchmark 1: uvx tpchgen-cli -s 10
  Time (mean ± σ):     825.4 ms ±  12.3 ms    [User: 706.8 ms, System: 29.4 ms]
  Range (min … max):   812.7 ms … 873.6 ms    30 runs

I think by default tpchgen-cli makes TBL files (not parquet) so this command is likely not testing any changes related to arrow/parquet

@alamb
Copy link
Collaborator

alamb commented Nov 4, 2025

I filed a ticket upstream to investigate and will post my findings there

@kevinjqliu
Copy link
Collaborator Author

I think by default tpchgen-cli makes TBL files (not parquet) so this command is likely not testing any changes related to arrow/parquet

oh yea, good point. I did it agains with the same options you used above. This PR is faster than v2.0.1

➜ hyperfine --warmup 5 --runs 30 './target/release/tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet'
Benchmark 1: ./target/release/tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet
  Time (mean ± σ):     784.5 ms ±   3.4 ms    [User: 757.3 ms, System: 20.9 ms]
  Range (min … max):   777.0 ms … 791.4 ms    30 runs

➜ hyperfine --warmup 5 --runs 30 'uvx tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet'
Benchmark 1: uvx tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet
  Time (mean ± σ):     869.3 ms ±   5.3 ms    [User: 753.4 ms, System: 29.8 ms]
  Range (min … max):   860.0 ms … 882.8 ms    30 runs

@alamb
Copy link
Collaborator

alamb commented Nov 4, 2025

I think we have figured out what was going on

@clflushopt
Copy link
Owner

I wanted to confirm similar numbers, I spent a little while thinking I was doing something wrong 😮‍💨

@alamb
Copy link
Collaborator

alamb commented Nov 5, 2025

I wanted to confirm similar numbers, I spent a little while thinking I was doing something wrong 😮‍💨

The good (amazing ❤️ ) news is that @etseidl has a fix already (will be released in 57.1.0):

I verified with the code in apache/arrow-rs#8786, this PR now has roughly similar performance to 56

It still seems a few percent slower (0m25.957s vs 0m25.378s) but I will file some issues upstream to further optimize things

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants