Skip to content

Conversation

@etseidl
Copy link
Contributor

@etseidl etseidl commented Oct 8, 2025

Which issue does this PR close?

Rationale for this change

A good bit (around 15%) of Parquet metadata parsing involves first decoding to thrift structs (FileMetaData, RowGroup, etc), and then converting to the metadata structs used by this crate (ParquetMetaData, RowGroupMetaData, etc). This PR removes some of the intermediate structures and parses straight to the crate structs.

What changes are included in this PR?

Some thrift generated structures are removed, and the code necessary to decode has been hand written. This will enable future optimizations such as selectively decoding parts of the metadata.

In addition to the above, this PR cleans up some of the memory size computation, and also boxes some of the objects used for decryption.

Are these changes tested?

Should be covered by existing tests.

Are there any user-facing changes?

No, only private APIs are changed.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 8, 2025
@etseidl
Copy link
Contributor Author

etseidl commented Oct 8, 2025

Benchmarks relative to current main.

default features
group                             main_57_0_default                      no_conv
-----                             -----------------                      -------
decode parquet metadata           1.10     19.2±0.29µs        ? ?/sec    1.00     17.4±0.31µs        ? ?/sec
decode parquet metadata (wide)    1.15     85.1±1.73ms        ? ?/sec    1.00     74.2±1.58ms        ? ?/sec
open(default)                     1.10     20.0±0.34µs        ? ?/sec    1.00     18.1±0.41µs        ? ?/sec
open(page index)                  1.02    257.3±7.66µs        ? ?/sec    1.00    252.9±4.98µs        ? ?/sec

all features
group                             main_57_0                              no_conv_encr
-----                             ---------                              ------------
decode parquet metadata           1.08     19.5±0.39µs        ? ?/sec    1.00     18.0±0.32µs        ? ?/sec
decode parquet metadata (wide)    1.14     89.2±1.68ms        ? ?/sec    1.00     78.5±1.69ms        ? ?/sec
open(default)                     1.14     21.3±0.38µs        ? ?/sec    1.00     18.8±0.39µs        ? ?/sec
open(page index)                  1.02    259.8±5.41µs        ? ?/sec    1.00    254.6±5.31µs        ? ?/sec

Comment on lines 59 to 63
impl<T: HeapSize> HeapSize for Box<T> {
fn heap_size(&self) -> usize {
self.as_ref().heap_size()
}
}
Copy link
Contributor Author

@etseidl etseidl Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is correct. Should this also include the size of T?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the description of

/// Return the size of any bytes allocated on the heap by this object,
/// including heap memory in those structures
///
/// Note that the size of the type itself is not included in the result --
/// instead, that size is added by the caller (e.g. container).

So in that case I do think the size of T should be included. The size of self (the pointer/Box) should not be included, but the memory it points to should be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll fix.

}
);

// TODO(ets): move a lot of the encryption stuff to its own module
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a follow-on PR ready to do just this.

Comment on lines +95 to +96
thrift_struct!(
pub(crate) struct Statistics<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just moved from below

return Err(general_err!("Column order length mismatch"));
// using ThriftSliceInputProtocol rather than ThriftCompactInputProtocl trait because
// these are all internal and operate on slices.
fn read_column_chunk<'a>(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is quite ugly, but necessary for both performance and for stats skipping and other optimizations. I do have a simplified version of this function in the queue, so it will get better with time.

@alamb
Copy link
Contributor

alamb commented Oct 9, 2025

👀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etseidl -- I started going through this PR but didn't make it. I will finish up tomorrow

Comment on lines 59 to 63
impl<T: HeapSize> HeapSize for Box<T> {
fn heap_size(&self) -> usize {
self.as_ref().heap_size()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the description of

/// Return the size of any bytes allocated on the heap by this object,
/// including heap memory in those structures
///
/// Note that the size of the type itself is not included in the result --
/// instead, that size is added by the caller (e.g. container).

So in that case I do think the size of T should be included. The size of self (the pointer/Box) should not be included, but the memory it points to should be

+ self.unencoded_byte_array_data_bytes.heap_size()
+ self.repetition_level_histogram.heap_size()
+ self.definition_level_histogram.heap_size()
+ self.geo_statistics.heap_size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯


impl HeapSize for FileMetaData {
fn heap_size(&self) -> usize {
#[cfg(feature = "encryption")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we still need to implement HeapSize for FileDecryptor.

#[test]
fn test_parquet_1481() {
let err = read_file("PARQUET-1481.parquet").unwrap_err();
#[cfg(feature = "encryption")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this test change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I've unified the file metadata decoding for both paths, so now we get the same error regardless of whether encryption is enabled or not.

@etseidl
Copy link
Contributor Author

etseidl commented Oct 9, 2025

Thanks @etseidl -- I started going through this PR but didn't make it. I will finish up tomorrow

Thanks @alamb, I know it's a lot to slog through. This is the largest one I've got queued up 😅

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing no_row_group_conv (12a8c2f) to 8e669e7 diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=no_row_group_conv
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

🤖: Benchmark completed

Details

group                             main                                   no_row_group_conv
-----                             ----                                   -----------------
decode parquet metadata           1.07     13.0±0.07µs        ? ?/sec    1.00     12.1±0.13µs        ? ?/sec
decode parquet metadata (wide)    1.19     71.6±1.05ms        ? ?/sec    1.00     60.1±1.02ms        ? ?/sec
open(default)                     1.11     13.0±0.05µs        ? ?/sec    1.00     11.7±0.09µs        ? ?/sec
open(page index)                  1.01    203.5±1.58µs        ? ?/sec    1.00    201.8±2.20µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Oct 10, 2025

🤖: Benchmark completed

🚀 that is quite nice @etseidl -- love it.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- this is a great exampe of what the new thrift decoding infrastructure allows ❤️

}

/// Create a [`crate::file::statistics::Statistics`] from a thrift [`Statistics`] object.
pub(crate) fn convert_stats(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a good target to consider reworking eventually as well it it shows up in traces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to the list.


let compression = codec;

// NOTE: I tried using the builder for this, but it added 20% to the execution time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is interesting and unexpected. Sounds like a good idea to keep it like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoiler: I eventually take another page from @jhorstmann's book and use the builder to create a default initialized ColumnChunkMetaData whose fields I then fill in directly. It reduces the code bloat and allows me to again split out the ColumnMetaData parsing without any drop in performance.

}

Ok(ParquetMetaData::new(fmd, row_groups))
/// Create [`ParquetMetaData`] from thrift input. Note that this only decodes the file metadata in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really neat. With this code, I can easily see a bunch more potential optimizations (e.g. for example use a single Vec of ColumnChunkMetadata and have each RowGroup store the starting offset into that vec rather than have Vecs of Vecs of Vec.

(not for this PR obviously, but for a future one)

file_offset = Some(i64::read_thrift(&mut *prot)?);
}
3 => {
// `ColumnMetaData`. Read inline for performance sake.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean "for performance sake" -- does that means the rust compiler isn't inlining this when using the macro? Maybe we should sprinkle some #[inline] 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was hard to tell from profiling exactly what was addding to the overhead, but splitting this into a separate call that decodes and returns a ColumnMetaData object which is then moved into the ColumnChunkMetaData added about 5% to execution time IIRC. As mentioned above, my latest code passes the ColumnChunkMetaData in, so no temporary object is created. Then I get some code reuse and this becomes easier to read 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another benefit of all this thrift remodel work is that we can do this level of optimziation (which is basically impractical with generated code)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

especially when it comes to things like recasting the encodings vector as a bitmask. That gets rid of a ton of allocations 😄

Comment on lines +1058 to +1062
let mut cols = Vec::with_capacity(list_ident.size as usize);
for i in 0..list_ident.size as usize {
let col = read_column_chunk(prot, &schema_descr.columns()[i])?;
cols.push(col);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just curious -- did you try a more functional form (I never know quite what the rust compiler knows how to optimize).

Sometimes using this type of loop lets the compiler avoid the bounds checks, which may or may not help in this case

Suggested change
let mut cols = Vec::with_capacity(list_ident.size as usize);
for i in 0..list_ident.size as usize {
let col = read_column_chunk(prot, &schema_descr.columns()[i])?;
cols.push(col);
}
let cols = schema_descr
.columns()
.iter()
.enumerate()
.map(|(i, c)| read_column_chunk(prot, c))
.collect::<Result<Vec<_>>>()?;

(I don't think we should do this as part of this PR, but it might be interesting to try in the future)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that last night 😅. It was a bit slower. That pattern has been a bit of a mixed bag. I have used it where I can.
https://github.com/etseidl/arrow-rs/blob/12a8c2f34dd9410d449b8562a72665670a6b65d7/parquet/src/file/metadata/thrift_gen.rs#L656-L659

I keep trying it here and there. I keep when it doesn't hurt performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fascinating!

@etseidl
Copy link
Contributor Author

etseidl commented Oct 10, 2025

Going to merge this and move on to the encryption refactor + part of #8518 (the latter has a small breaking change I want to get in while I can 😁).

@etseidl etseidl merged commit b51a000 into apache:main Oct 10, 2025
16 checks passed
@etseidl etseidl deleted the no_row_group_conv branch October 12, 2025 05:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[thrift-remodel] Optimize convert_row_groups

2 participants