Conversation
|
@alamb I borrowed liberally from your parquet footer code 😉 |
|
Example run Details |
alamb
left a comment
There was a problem hiding this comment.
Thanks @etseidl -- this looks great
I had a few suggestions, but nothing I think is required to merge
I also ran some basic profiling on these benchmarks to see
samply record -- cargo bench --bench parquet_round_trip -- "write int64 dict"And it looks like it is measuring what I would expect:
I have a good feeling that the encoder/decoder is about to get a lot faster...
|
|
||
| /// Creates a [`PrimitiveArray`] of a given `size` and `null_density` | ||
| /// filling it with random numbers generated using the provided `seed`. | ||
| pub fn create_primitive_array_with_seed<T>( |
There was a problem hiding this comment.
Do we need a separate copy of these functions? Maybe we can reuse the existing functions in bench_utils.rs:
There was a problem hiding this comment.
Ha, I just copied from https://github.com/alamb/parquet_footer_parsing/blob/main/src/datagen.rs 😅
I'll switch over 👍
There was a problem hiding this comment.
(I totally copy/pasted from arrow for that benchmark of course -- it all came full circle!)
| .collect() | ||
| } | ||
|
|
||
| pub fn file_from_spec(spec: ParquetFileSpec, buf_size: Option<usize>) -> Bytes { |
| } | ||
|
|
||
| fn read_write(c: &mut Criterion, spec: ParquetFileSpec, msg: &str) { | ||
| let f = file_from_spec(spec, None); |
There was a problem hiding this comment.
rather than passing in the buffer size, maybe the test could pass in the buffer directly (reusing it across calls) to avoid all output buffer allocations 🤔
There was a problem hiding this comment.
Yes, I was thinking that too after submitting. I'll try switching it up some.
| for rg in 0..spec.num_row_groups { | ||
| let col_writers = row_group_factory.create_column_writers(rg).unwrap(); | ||
|
|
||
| let encoded_columns = encode_row_group(&schema, &spec, col_writers); |
There was a problem hiding this comment.
What is the reason to use this lower level api (rather than just writer.write to write the whole batch)?
Using writer.write would likely be less code and I think it might more closely mirror the API people actually use (though now I write this I am not sure I really know what people use)
There was a problem hiding this comment.
|
|
||
| c.bench_function(&format!("read {msg}"), |b| { | ||
| b.iter(|| { | ||
| let record_reader = ParquetRecordBatchReaderBuilder::try_new(f.clone()) |
There was a problem hiding this comment.
I had to double check that f is Bytes here (and thus this clone is cheap)
There was a problem hiding this comment.
I'll change to a more meaningful name 😅
|
Thanks @alamb. I was a bit surprised that the read and write times were so different.
Hopefully. One thing to hit first is the fixed-len binary encoder. That seems to do a lot of allocations/copies that we should be able to avoid. |
| // arrow::util::bench_util::create_fsb_array with a seed | ||
|
|
||
| /// Creates a random (but fixed-seeded) array of fixed size with a given null density and length | ||
| fn create_fsb_array_with_seed( |
There was a problem hiding this comment.
Should I just move this to bench_util.rs?
| // arrow::util::bench_util::create_fsb_array with a seed | ||
|
|
||
| /// Creates a random (but fixed-seeded) array of fixed size with a given null density and length | ||
| fn create_fsb_array_with_seed( |
|
Thanks @etseidl |
Which issue does this PR close?
Rationale for this change
Baseline for future improvements.
What changes are included in this PR?
Adds new benchmarks for reading and writing. Currently uses a fixed number of row groups, pages, and rows. Cycles through data types and encodings.
Are these changes tested?
N/A
Are there any user-facing changes?
No