-
Notifications
You must be signed in to change notification settings - Fork 45
chore(deps): update parquet/arrow/arrow-csv from 56 to 57 #200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
chore(deps): update parquet/arrow/arrow-csv from 56 to 57 #200
Conversation
|
looks like we need to refactor |
|
I am wondering what the impact is of moving to 57 for all three deps |
in terms of performance? |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevinjqliu and @clflushopt
I ran some unscientific benchmarks on my laptop and indeed it seems like the upgrade makes writing 10% slower for some reason 🤔
Benchmark
rm -rf lineitem && time tpchgen-cli --scale-factor=100 --tables=lineitem --parts=10 --format=parquet| Release | Time |
|---|---|
| main | 0m25.369s |
| main | 0m25.729s |
| this PR | 0m28.516s |
| this PR | 0m28.682s |
I'll do some profiling and see if I can see any reason
| // Create writers for each of the leaf columns | ||
| let mut col_writers = get_column_writers(&parquet_schema, &writer_properties, &schema).unwrap(); | ||
| #[allow(deprecated)] | ||
| let mut col_writers = parquet::arrow::arrow_writer::get_column_writers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we just need to change this to follow the example here: https://docs.rs/parquet/latest/parquet/arrow/arrow_writer/struct.ArrowColumnWriter.html
|
Not scientific either, this shows improvements for this PR vs last release (2.0.1) |
I think by default tpchgen-cli makes TBL files (not parquet) so this command is likely not testing any changes related to arrow/parquet |
|
I filed a ticket upstream to investigate and will post my findings there |
oh yea, good point. I did it agains with the same options you used above. This PR is faster than v2.0.1 |
|
I think we have figured out what was going on |
|
I wanted to confirm similar numbers, I spent a little while thinking I was doing something wrong 😮💨 |
The good (amazing ❤️ ) news is that @etseidl has a fix already (will be released in 57.1.0): I verified with the code in apache/arrow-rs#8786, this PR now has roughly similar performance to 56 It still seems a few percent slower (0m25.957s vs 0m25.378s) but I will file some issues upstream to further optimize things |
Closes #197 #198 #199